> ## Documentation Index > Fetch the complete documentation index at: https://platform.stepfun.ai/docs/llms.txt > Use this file to discover all available pages before exploring further. # Text-to-Speech This API allows you to generate audio using our Text-to-Speech (TTS) model. ### Endpoint `POST https://api.stepfun.ai/v1/audio/speech` For Step Plan, use `POST https://api.stepfun.ai/step_plan/v1/audio/speech` ### Request body * `model` `string` ***required***
The ID of the model to use. Currently supports `step-tts-2` and `stepaudio-2.5-tts`. The `step-tts-vivid` model name is deprecated but existing user requests will continue to be supported. * `input` `string` ***required***
The text to generate audio for. The maximum length is 1,000 characters. When using `stepaudio-2.5-tts`, content inside parentheses `()` will be treated as instructions and will not be spoken. If you need the text itself to be spoken, do not wrap it in parentheses. * `voice` `string` ***required***
The voice to use for generation. Supports both [official voices](/en/guides/developer/tts#system-voice-id-list) and custom cloned voices. * `response_format` `string` ***optional***
The audio format for the returned output. Supported formats: `wav`, `mp3`, `flac`, `opus`, `pcm`. Default: `mp3`. * `speed` `float` ***optional***
The speed of the generated audio. Range: 0.5 to 2.0. Default: 1.0. 0.5 means half speed. * `volume` `float` ***optional***
The volume of the generated audio. Range: 0.1 to 2.0. Default: 1.0. 0.1 reduces the volume to 10%; 2.0 increases it to 200%. * `voice_label` `object` ***optional***
Voice tags. Required when using a custom voice. Only one of `language`, `emotion`, or `style` can be set at a time; combinations are not yet supported. * `language` `string` ***optional***
Language. Supported values: `Cantonese`, `Sichuanese`, `Japanese`. * `emotion` `string` ***optional***
Emotion tag. Supports up to 11 options such as `Happy`, `Angry`, etc. Supported values may vary by model; see [voice tags](/en/guides/developer/tts#voice-tags-list). * `style` `string` ***optional***
Supports up to 17 speaking rates or delivery styles. Supported values may vary by model; see [voice tags](/en/guides/developer/tts#voice-tags-list). ⚠️ Note: The `stepaudio-2.5-tts` model does not support this field. Passing `voice_label` will cause an error. If you are using `stepaudio-2.5-tts`, use the `instruction` field or inline `()` prompts in the text to control emotion and style instead. For other models, see [voice tags](/en/guides/developer/tts#voice-tags-list). * `instruction` `string` ***optional***
Global natural language guidance. Only effective when using the `stepaudio-2.5-tts` model; other models will return an error if this parameter is passed. Used to set the overall emotional tone, character persona, etc. for the entire audio. Maximum length: 200 characters. * `sample_rate` `integer` ***optional***
The sampling rate. Supports `8000`, `16000`, `22050`, `24000`, `48000`. Default: `24000`. Higher rates improve audio quality but increase file size. `48000` was added in recent iterations. * `pronunciation_map` `object array` ***optional***
Defines a pronunciation rule to annotate or override the reading of specific characters or symbols. In Chinese text, tones are represented by numbers: 1 for the first tone, 2 for the second tone, 3 for the third tone, 4 for the fourth tone, and 5 for the neutral tone. * `tone` `string` ***required***
Specific pronunciation mapping rules, separated by `/`. Example: `["LOL/laugh out loudly"]`. * `stream_format` `string` ***optional***
Streaming return mode. By default, audio is returned directly. Supported values: `sse`, `audio`. Default: `audio`. When `sse` is specified, audio is returned via Server-Sent Events (SSE) with the following data packet format: ```text theme={null} data: {"type":"speech.audio.delta","audio":""} data: {"type":"speech.audio.delta","audio":""} data: {"type":"speech.audio.done","audio":""} data: [DONE] ``` Event types: * `speech.audio.delta`: Audio chunk. The `audio` field contains the BASE64-encoded binary data of this chunk; concatenate all chunks to form the complete audio. * `speech.audio.done`: Generation complete; `audio` is an empty string. * `speech.audio.error`: An error occurred during generation. * `markdown_filter` `bool` ***optional***
Whether to enable Markdown filtering. * `return_url` `bool` ***optional***
Only effective for non-streaming requests. When set to `true`, returns a URL to the audio file instead of the binary audio stream. The URL is valid for 12 hours. ### Response Audio file. ### Examples ```python theme={null} from pathlib import Path from openai import OpenAI speech_file_path = Path("step-tts.mp3") client = OpenAI( api_key="STEP_API_KEY", base_url="https://api.stepfun.ai/v1", ) response = client.audio.speech.create( model="step-tts-2", voice="lively-girl", input="StepFun is building the next generation of AGI.", extra_body={ "volume": 1.0, # volume is in extra_body "voice_label": { "language": "Cantonese", # choose one of language / emotion / style }, "pronunciation_map": { "tone": [ "LOL/laugh out loudly", ], }, }, ) response.stream_to_file(speech_file_path) ``` ```js theme={null} import OpenAI from "openai"; import fs from "fs"; import path from "path"; const STEP_API_KEY = "STEP_API_KEY"; const STEP_API_MODEL = "step-tts-2"; const openai = new OpenAI({ apiKey: STEP_API_KEY, baseURL: "https://api.stepfun.ai/v1" }); async function main() { const speechFile = path.resolve("./speech.mp3"); const mp3 = await openai.audio.speech.create({ model: STEP_API_MODEL, voice: "lively-girl", input: "StepFun is building the next generation of AGI.", extra_body: { volume: 2.0, // volume is in extra_body voice_label: { language: "Cantonese" // Optional: choose one of language/emotion/style }, pronunciation_map: { tone: [ "LOL/laugh out loudly" ] } } }); console.log(speechFile); const buffer = Buffer.from(await mp3.arrayBuffer()); await fs.promises.writeFile(speechFile, buffer); } main(); ``` ```bash theme={null} curl --location 'https://api.stepfun.ai/v1/audio/speech' \ --header 'Content-Type: application/json' \ --header "Authorization: Bearer $STEP_API_KEY" \ --data '{ "model": "step-tts-2", "input": "StepFun is building the next generation of AGI.", "voice": "lively-girl" }' \ --output "step.mp3" ``` ```python theme={null} from pathlib import Path from openai import OpenAI speech_file_path = Path(__file__).parent / "step-tts.mp3" client = OpenAI( api_key="STEP_API_KEY", base_url="https://api.stepfun.ai/v1" ) response = client.audio.speech.create( # [Modified] Specify stepaudio-2.5-tts model model="stepaudio-2.5-tts", voice="cixingnansheng", # [Modified] Max length <= 1000 characters. Content in () is treated as instructions (not spoken). input="(cold laugh) Do you think our technology at StepFun Beijing is a joke?!", extra_body={ "volume": 1.0, # volume is in extra_body # [New] TTS 2.5 global instruction (max 200 characters) "instruction": "Extremely angry tone, strong pressure, slightly fast pace", # ⚠️ Note: stepaudio-2.5-tts does not support voice_label; do not pass it # "voice_label": { # "language": "Cantonese", # "emotion": "Happy", # "style": "Slow" # }, "pronunciation_map": { "tone": [ "LOL/laugh out loudly" ] } } ) response.stream_to_file(speech_file_path) ```