> ## Documentation Index
> Fetch the complete documentation index at: https://platform.stepfun.ai/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Text-to-Speech

This API allows you to generate audio using our Text-to-Speech (TTS) model.

### Endpoint

`POST https://api.stepfun.ai/v1/audio/speech`

<Note>
  For Step Plan, use `POST https://api.stepfun.ai/step_plan/v1/audio/speech`
</Note>

### Request body

* `model` `string` ***required***<br />The ID of the model to use. Currently supports `step-tts-2` and `stepaudio-2.5-tts`.

  <Note>The `step-tts-vivid` model name is deprecated but existing user requests will continue to be supported.</Note>

* `input` `string` ***required***<br />The text to generate audio for. The maximum length is 1,000 characters. When using `stepaudio-2.5-tts`, content inside parentheses `()` will be treated as instructions and will not be spoken. If you need the text itself to be spoken, do not wrap it in parentheses.

* `voice` `string` ***required***<br />The voice to use for generation. Supports both [official voices](/en/guides/developer/tts#system-voice-id-list) and custom cloned voices.

* `response_format` `string` ***optional***<br />The audio format for the returned output. Supported formats: `wav`, `mp3`, `flac`, `opus`, `pcm`. Default: `mp3`.

* `speed` `float` ***optional***<br />The speed of the generated audio. Range: 0.5 to 2.0. Default: 1.0. 0.5 means half speed.

* `volume` `float` ***optional***<br />The volume of the generated audio. Range: 0.1 to 2.0. Default: 1.0. 0.1 reduces the volume to 10%; 2.0 increases it to 200%.

* `voice_label` `object` ***optional***<br />Voice tags. Required when using a custom voice. Only one of `language`, `emotion`, or `style` can be set at a time; combinations are not yet supported.
  * `language` `string` ***optional***<br />Language. Supported values: `Cantonese`, `Sichuanese`, `Japanese`.
  * `emotion` `string` ***optional***<br />Emotion tag. Supports up to 11 options such as `Happy`, `Angry`, etc. Supported values may vary by model; see [voice tags](/en/guides/developer/tts#voice-tags-list).
  * `style` `string` ***optional***<br />Supports up to 17 speaking rates or delivery styles. Supported values may vary by model; see [voice tags](/en/guides/developer/tts#voice-tags-list).

<Warning>
  ⚠️ Note: The `stepaudio-2.5-tts` model does not support this field. Passing `voice_label` will cause an error. If you are using `stepaudio-2.5-tts`, use the `instruction` field or inline `()` prompts in the text to control emotion and style instead. For other models, see [voice tags](/en/guides/developer/tts#voice-tags-list).
</Warning>

* `instruction` `string` ***optional***<br />Global natural language guidance. Only effective when using the `stepaudio-2.5-tts` model; other models will return an error if this parameter is passed. Used to set the overall emotional tone, character persona, etc. for the entire audio. Maximum length: 200 characters.

* `sample_rate` `integer` ***optional***<br />The sampling rate. Supports `8000`, `16000`, `22050`, `24000`, `48000`. Default: `24000`. Higher rates improve audio quality but increase file size. `48000` was added in recent iterations.

* `pronunciation_map` `object array` ***optional*** <br />Defines a pronunciation rule to annotate or override the reading of specific characters or symbols. In Chinese text, tones are represented by numbers: 1 for the first tone, 2 for the second tone, 3 for the third tone, 4 for the fourth tone, and 5 for the neutral tone.
  * `tone` `string` ***required*** <br /> Specific pronunciation mapping rules, separated by `/`. Example: `["LOL/laugh out loudly"]`.

* `stream_format` `string` ***optional*** <br /> Streaming return mode. By default, audio is returned directly. Supported values: `sse`, `audio`. Default: `audio`. When `sse` is specified, audio is returned via Server-Sent Events (SSE) with the following data packet format:

  ```text theme={null}
  data: {"type":"speech.audio.delta","audio":"<BASE64-encoded audio chunk>"}

  data: {"type":"speech.audio.delta","audio":"<BASE64-encoded audio chunk>"}

  data: {"type":"speech.audio.done","audio":""}

  data: [DONE]
  ```

  Event types:

  * `speech.audio.delta`: Audio chunk. The `audio` field contains the BASE64-encoded binary data of this chunk; concatenate all chunks to form the complete audio.
  * `speech.audio.done`: Generation complete; `audio` is an empty string.
  * `speech.audio.error`: An error occurred during generation.

* `markdown_filter` `bool` ***optional*** <br /> Whether to enable Markdown filtering.

* `return_url` `bool` ***optional*** <br /> Only effective for non-streaming requests. When set to `true`, returns a URL to the audio file instead of the binary audio stream. The URL is valid for 12 hours.

### Response

Audio file.

### Examples

<Tabs>
  <Tab title="python">
    ```python theme={null}
    from pathlib import Path
    from openai import OpenAI

    speech_file_path = Path("step-tts.mp3")

    client = OpenAI(
        api_key="STEP_API_KEY",
        base_url="https://api.stepfun.ai/v1",
    )

    response = client.audio.speech.create(
        model="step-tts-2",
        voice="lively-girl",
        input="StepFun is building the next generation of AGI.",
        extra_body={
            "volume": 1.0,  # volume is in extra_body
            "voice_label": {
                "language": "Cantonese",  # choose one of language / emotion / style
            },
            "pronunciation_map": {
                "tone": [
                    "LOL/laugh out loudly",
                ],
            },
        },
    )

    response.stream_to_file(speech_file_path)
    ```
  </Tab>

  <Tab title="js">
    ```js theme={null}
    import OpenAI from "openai";
    import fs from "fs";
    import path from "path";

    const STEP_API_KEY = "STEP_API_KEY";
    const STEP_API_MODEL = "step-tts-2";

    const openai = new OpenAI({
        apiKey: STEP_API_KEY,
        baseURL: "https://api.stepfun.ai/v1"
    });

    async function main() {
        const speechFile = path.resolve("./speech.mp3");
        const mp3 = await openai.audio.speech.create({
            model: STEP_API_MODEL,
            voice: "lively-girl",
            input: "StepFun is building the next generation of AGI.",
            extra_body: {
                volume: 2.0, // volume is in extra_body
                voice_label: {
                    language: "Cantonese" // Optional: choose one of language/emotion/style
                },
                pronunciation_map: {
                    tone: [
                        "LOL/laugh out loudly"
                    ]
                }
            }
        });
        console.log(speechFile);
        const buffer = Buffer.from(await mp3.arrayBuffer());
        await fs.promises.writeFile(speechFile, buffer);
    }

    main();
    ```
  </Tab>

  <Tab title="curl">
    ```bash theme={null}
    curl --location 'https://api.stepfun.ai/v1/audio/speech' \
      --header 'Content-Type: application/json' \
      --header "Authorization: Bearer $STEP_API_KEY" \
      --data '{
        "model": "step-tts-2",
        "input": "StepFun is building the next generation of AGI.",
        "voice": "lively-girl"
      }' \
      --output "step.mp3"
    ```
  </Tab>

  <Tab title="stepaudio-2.5-tts (python)">
    ```python theme={null}
    from pathlib import Path
    from openai import OpenAI

    speech_file_path = Path(__file__).parent / "step-tts.mp3"

    client = OpenAI(
        api_key="STEP_API_KEY",
        base_url="https://api.stepfun.ai/v1"
    )
    response = client.audio.speech.create(
        # [Modified] Specify stepaudio-2.5-tts model
        model="stepaudio-2.5-tts",
        voice="cixingnansheng",
        # [Modified] Max length <= 1000 characters. Content in () is treated as instructions (not spoken).
        input="(cold laugh) Do you think our technology at StepFun Beijing is a joke?!",
        extra_body={
            "volume": 1.0,  # volume is in extra_body
            # [New] TTS 2.5 global instruction (max 200 characters)
            "instruction": "Extremely angry tone, strong pressure, slightly fast pace",

            # ⚠️ Note: stepaudio-2.5-tts does not support voice_label; do not pass it
            # "voice_label": {
            #     "language": "Cantonese",
            #     "emotion": "Happy",
            #     "style": "Slow"
            # },

            "pronunciation_map": {
                "tone": [
                    "LOL/laugh out loudly"
                ]
            }
        }
    )
    response.stream_to_file(speech_file_path)
    ```
  </Tab>
</Tabs>
