This API allows you to generate audio using our Text-to-Speech (TTS) model.Documentation Index
Fetch the complete documentation index at: https://platform.stepfun.ai/docs/llms.txt
Use this file to discover all available pages before exploring further.
Endpoint
POST https://api.stepfun.ai/v1/audio/speech
For Step Plan, use
POST https://api.stepfun.ai/step_plan/v1/audio/speechRequest body
-
modelstringrequired
The ID of the model to use. Currently supportsstep-tts-2,step-tts-mini, andstepaudio-2.5-tts.Thestep-tts-vividmodel name is deprecated but existing user requests will continue to be supported. -
inputstringrequired
The text to generate audio for. The maximum length is 1,000 characters. When usingstepaudio-2.5-tts, content inside parentheses()will be treated as instructions and will not be spoken. If you need the text itself to be spoken, do not wrap it in parentheses. -
voicestringrequired
The voice to use for generation. Supports both official voices and custom cloned voices. -
response_formatstringoptional
The audio format for the returned output. Supported formats:wav,mp3,flac,opus,pcm. Default:mp3. -
speedfloatoptional
The speed of the generated audio. Range: 0.5 to 2.0. Default: 1.0. 0.5 means half speed. -
volumefloatoptional
The volume of the generated audio. Range: 0.1 to 2.0. Default: 1.0. 0.1 reduces the volume to 10%; 2.0 increases it to 200%. -
voice_labelobjectoptional
Voice tags. Required when using a custom voice. Only one oflanguage,emotion, orstylecan be set at a time; combinations are not yet supported.languagestringoptional
Language. Supported values:Cantonese,Sichuanese,Japanese.emotionstringoptional
Emotion tag. Supports up to 11 options such asHappy,Angry, etc. Supported values may vary by model; see voice tags.stylestringoptional
Supports up to 17 speaking rates or delivery styles. Supported values may vary by model; see voice tags.
-
instructionstringoptional
Global natural language guidance. Only effective when using thestepaudio-2.5-ttsmodel; other models will return an error if this parameter is passed. Used to set the overall emotional tone, character persona, etc. for the entire audio. Maximum length: 200 characters. -
sample_rateintegeroptional
The sampling rate. Supports8000,16000,22050,24000,48000. Default:24000. Higher rates improve audio quality but increase file size.48000was added in recent iterations. -
pronunciation_mapobject arrayoptional
Defines a pronunciation rule to annotate or override the reading of specific characters or symbols. In Chinese text, tones are represented by numbers: 1 for the first tone, 2 for the second tone, 3 for the third tone, 4 for the fourth tone, and 5 for the neutral tone.tonestringrequired
Specific pronunciation mapping rules, separated by/. Example:["LOL/laugh out loudly"].
-
stream_formatstringoptional
Streaming return mode. By default, audio is returned directly. Supported values:sse,audio. Default:audio. Whensseis specified, audio is returned via Server-Sent Events (SSE) with the following data packet format:Event types:speech.audio.delta: Audio chunk. Theaudiofield contains the BASE64-encoded binary data of this chunk; concatenate all chunks to form the complete audio.speech.audio.done: Generation complete;audiois an empty string.speech.audio.error: An error occurred during generation.
-
markdown_filterbooloptional
Whether to enable Markdown filtering. -
return_urlbooloptional
Only effective for non-streaming requests. When set totrue, returns a URL to the audio file instead of the binary audio stream. The URL is valid for 12 hours.
Response
Audio file.Examples
- python
- js
- curl
- stepaudio-2.5-tts (python)