Skip to main content
In addition to inference models, Step Plan also supports accessing audio synthesis and speech recognition models via a dedicated path. All requests uniformly use the /step_plan/v1/... path prefix, and the domain name is fixed as https://api.stepfun.ai.

Prerequisites

  1. Subscribed to a Step Plan.
  2. Obtained an API Key.

Audio synthesis models

Supported Models

ModelDescription
stepaudio-2.5-ttsNext-generation Contextual TTS based on context understanding, supporting dual-level control of global and in-text contexts. It generates human-like expressions with natural breathing rhythm, proper emphasis, and emotional arcs.

Endpoint Paths

CapabilityRequest MethodStep Plan Path
Non-streaming Audio SynthesisPOSThttps://api.stepfun.ai/step_plan/v1/audio/speech
Streaming Audio SynthesisWebSocketwss://api.stepfun.ai/step_plan/v1/realtime/audio
Voice PreviewPOSThttps://api.stepfun.ai/step_plan/v1/audio/voices/preview
Voice CloningPOSThttps://api.stepfun.ai/step_plan/v1/audio/voices
The endpoint parameters are exactly the same as the open platform. For details, please refer to the API documentation of each endpoint: Audio Synthesis, Streaming Audio Synthesis, Voice Cloning Preview, Voice Cloning.

Billing Instructions

The billing logic is consistent with the open platform. Ultimately, the actual billed amount calculated on the open platform will be converted into Step Plan total quota consumption. For specific unit prices, please refer to Pricing and Rate Limits.

Examples

curl -X POST 'https://api.stepfun.ai/step_plan/v1/audio/speech' \
-H 'Content-Type: application/json' \
-H "Authorization: Bearer $STEP_API_KEY" \
-d '{
    "model": "stepaudio-2.5-tts",
    "input": "The weather is great today, perfect for a walk.",
    "voice": "cixingnansheng",
    "instruction": "gentle tone, slightly slow speed",
    "response_format": "mp3"
}' \
--output speech.mp3

Speech recognition models

Supported Models

ModelDescription
stepaudio-2.5-asrNew-generation streaming ASR model, 4B MTP architecture, targeting near-realtime transcription with low latency and high recognition accuracy

Endpoint Paths

CapabilityRequest MethodStep Plan Path
Speech Recognition (Streaming Output)POSThttps://api.stepfun.ai/step_plan/v1/audio/asr/sse
The endpoint parameters are exactly the same as the open platform. See Speech Recognition (Streaming Output) for details.

Capability Limitations

Under Step Plan, stepaudio-2.5-asr only supports the HTTP + SSE call method, consistent with the open platform’s capability boundary. Other real-time transport methods such as WebSocket are not supported.

Billing Instructions

The billing logic is consistent with the open platform. Ultimately, the actual billed amount calculated on the open platform will be converted into Step Plan total quota consumption. For specific unit prices, please refer to Pricing and Rate Limits.

Examples

curl -X POST 'https://api.stepfun.ai/step_plan/v1/audio/asr/sse' \
-H 'Content-Type: application/json' \
-H 'Accept: text/event-stream' \
-H "Authorization: Bearer $STEP_API_KEY" \
-d '{
    "audio": {
        "data": "base64_encoded_audio",
        "input": {
            "transcription": {
                "model": "stepaudio-2.5-asr",
                "language": "zh",
                "enable_itn": true
            },
            "format": {
                "type": "pcm",
                "codec": "pcm_s16le",
                "rate": 16000,
                "bits": 16,
                "channel": 1
            }
        }
    }
}'