Audio Model Integration - StepFun Documentation

In addition to inference models, Step Plan also supports accessing audio synthesis and speech recognition models via a dedicated path. All requests uniformly use the /step_plan/v1/... path prefix, and the domain name is fixed as https://api.stepfun.ai.

Prerequisites

Subscribed to a Step Plan.
Obtained an API Key.

Audio synthesis models

Supported Models

Model	Description
`stepaudio-2.5-tts`	Next-generation Contextual TTS based on context understanding, supporting dual-level control of global and in-text contexts. It generates human-like expressions with natural breathing rhythm, proper emphasis, and emotional arcs.

Endpoint Paths

Capability	Request Method	Step Plan Path
Non-streaming Audio Synthesis	POST	`https://api.stepfun.ai/step_plan/v1/audio/speech`
Streaming Audio Synthesis	WebSocket	`wss://api.stepfun.ai/step_plan/v1/realtime/audio`
Voice Preview	POST	`https://api.stepfun.ai/step_plan/v1/audio/voices/preview`
Voice Cloning	POST	`https://api.stepfun.ai/step_plan/v1/audio/voices`

The endpoint parameters are exactly the same as the open platform. For details, please refer to the API documentation of each endpoint: Audio Synthesis, Streaming Audio Synthesis, Voice Cloning Preview, Voice Cloning.

Billing Instructions

The billing logic is consistent with the open platform. Ultimately, the actual billed amount calculated on the open platform will be converted into Step Plan total quota consumption. For specific unit prices, please refer to Pricing and Rate Limits.

Examples

curl
Python (OpenAI SDK)
Python (WebSocket Streaming)

curl -X POST 'https://api.stepfun.ai/step_plan/v1/audio/speech' \
-H 'Content-Type: application/json' \
-H "Authorization: Bearer $STEP_API_KEY" \
-d '{
    "model": "stepaudio-2.5-tts",
    "input": "The weather is great today, perfect for a walk.",
    "voice": "cixingnansheng",
    "instruction": "gentle tone, slightly slow speed",
    "response_format": "mp3"
}' \
--output speech.mp3

from openai import OpenAI

client = OpenAI(
    api_key="YOUR_STEP_API_KEY",
    base_url="https://api.stepfun.ai/step_plan/v1",
)

response = client.audio.speech.create(
    model="stepaudio-2.5-tts",
    input="The weather is great today, perfect for a walk.",
    voice="cixingnansheng",
    extra_body={
        "instruction": "gentle tone, slightly slow speed",
    },
)

response.stream_to_file("speech.mp3")

import websocket
import rel
import json

headers = {
    "Authorization": "Bearer YOUR_STEP_API_KEY"
}

def get_start_event(sid):
    return json.dumps({
        "type": "tts.create",
        "data": {
            "session_id": sid,
            "voice_id": "cixingnansheng",
            "response_format": "wav",
            "volume_ratio": 1.0,
            "speed_ratio": 1.0,
            "sample_rate": 16000,
            "instruction": "gentle tone, slightly slow speed"
        },
    })

def on_message(ws, message):
    data = json.loads(message)
    session_id = data["data"]["session_id"]
    event_type = data["type"]

    if event_type == "tts.connection.done":
        ws.send(get_start_event(session_id))

    print(message)

def on_error(ws, error):
    print(error)

if __name__ == "__main__":
    ws = websocket.WebSocketApp(
        # Step Plan uses the /step_plan/v1 path
        "wss://api.stepfun.ai/step_plan/v1/realtime/audio?model=stepaudio-2.5-tts",
        header=headers,
        on_message=on_message,
        on_error=on_error,
    )

    ws.run_forever(dispatcher=rel, reconnect=5)
    rel.signal(2, rel.abort)
    rel.dispatch()

Speech recognition models

Supported Models

Model	Description
`stepaudio-2.5-asr`	New-generation streaming ASR model, 4B MTP architecture, targeting near-realtime transcription with low latency and high recognition accuracy

Endpoint Paths

Capability	Request Method	Step Plan Path
Speech Recognition (Streaming Output)	POST	`https://api.stepfun.ai/step_plan/v1/audio/asr/sse`

The endpoint parameters are exactly the same as the open platform. See Speech Recognition (Streaming Output) for details.

Capability Limitations

Under Step Plan, stepaudio-2.5-asr only supports the HTTP + SSE call method, consistent with the open platform’s capability boundary. Other real-time transport methods such as WebSocket are not supported.

Billing Instructions

Examples

curl

curl -X POST 'https://api.stepfun.ai/step_plan/v1/audio/asr/sse' \
-H 'Content-Type: application/json' \
-H 'Accept: text/event-stream' \
-H "Authorization: Bearer $STEP_API_KEY" \
-d '{
    "audio": {
        "data": "base64_encoded_audio",
        "input": {
            "transcription": {
                "model": "stepaudio-2.5-asr",
                "language": "zh",
                "enable_itn": true
            },
            "format": {
                "type": "pcm",
                "codec": "pcm_s16le",
                "rate": 16000,
                "bits": 16,
                "channel": 1
            }
        }
    }
}'

Step Plan

Integration Guide

​Prerequisites

​Audio synthesis models

​Supported Models

​Endpoint Paths

​Billing Instructions

​Examples

​Speech recognition models

​Supported Models

​Endpoint Paths

​Capability Limitations

​Billing Instructions

​Examples

Prerequisites

Audio synthesis models

Supported Models

Endpoint Paths

Billing Instructions

Examples

Speech recognition models

Supported Models

Endpoint Paths

Capability Limitations

Billing Instructions

Examples