Text-to-Speech

This API allows you to generate audio using our Text-to-Speech (TTS) model.

Endpoint

POST https://api.stepfun.ai/v1/audio/speech

For Step Plan, use POST https://api.stepfun.ai/step_plan/v1/audio/speech

Request body

model string required
The ID of the model to use. Currently supports step-tts-2, step-tts-mini, and stepaudio-2.5-tts.
The step-tts-vivid model name is deprecated but existing user requests will continue to be supported.
input string required
The text to generate audio for. The maximum length is 1,000 characters. When using stepaudio-2.5-tts, content inside parentheses () will be treated as instructions and will not be spoken. If you need the text itself to be spoken, do not wrap it in parentheses.
voice string required
The voice to use for generation. Supports both official voices and custom cloned voices.
response_format string optional
The audio format for the returned output. Supported formats: wav, mp3, flac, opus, pcm. Default: mp3.
speed float optional
The speed of the generated audio. Range: 0.5 to 2.0. Default: 1.0. 0.5 means half speed.
volume float optional
The volume of the generated audio. Range: 0.1 to 2.0. Default: 1.0. 0.1 reduces the volume to 10%; 2.0 increases it to 200%.
voice_label object optional
Voice tags. Required when using a custom voice. Only one of language, emotion, or style can be set at a time; combinations are not yet supported.
- language string optional
  Language. Supported values: Cantonese, Sichuanese, Japanese.
- emotion string optional
  Emotion tag. Supports up to 11 options such as Happy, Angry, etc. Supported values may vary by model; see voice tags.
- style string optional
  Supports up to 17 speaking rates or delivery styles. Supported values may vary by model; see voice tags.

⚠️ Note: The stepaudio-2.5-tts model does not support this field. Passing voice_label will cause an error. If you are using stepaudio-2.5-tts, use the instruction field or inline () prompts in the text to control emotion and style instead. For other models, see voice tags.

instruction string optional
Global natural language guidance. Only effective when using the stepaudio-2.5-tts model; other models will return an error if this parameter is passed. Used to set the overall emotional tone, character persona, etc. for the entire audio. Maximum length: 200 characters.
sample_rate integer optional
The sampling rate. Supports 8000, 16000, 22050, 24000, 48000. Default: 24000. Higher rates improve audio quality but increase file size. 48000 was added in recent iterations.
pronunciation_map object array optional
Defines a pronunciation rule to annotate or override the reading of specific characters or symbols. In Chinese text, tones are represented by numbers: 1 for the first tone, 2 for the second tone, 3 for the third tone, 4 for the fourth tone, and 5 for the neutral tone.
- tone string required
  Specific pronunciation mapping rules, separated by /. Example: ["LOL/laugh out loudly"].
stream_format string optional
Streaming return mode. By default, audio is returned directly. Supported values: sse, audio. Default: audio. When sse is specified, audio is returned via Server-Sent Events (SSE) with the following data packet format:
```
data: {"type":"speech.audio.delta","audio":"<BASE64-encoded audio chunk>"}

data: {"type":"speech.audio.delta","audio":"<BASE64-encoded audio chunk>"}

data: {"type":"speech.audio.done","audio":""}

data: [DONE]
```
Event types:
- speech.audio.delta: Audio chunk. The audio field contains the BASE64-encoded binary data of this chunk; concatenate all chunks to form the complete audio.
- speech.audio.done: Generation complete; audio is an empty string.
- speech.audio.error: An error occurred during generation.
markdown_filter bool optional
Whether to enable Markdown filtering.
return_url bool optional
Only effective for non-streaming requests. When set to true, returns a URL to the audio file instead of the binary audio stream. The URL is valid for 12 hours.

Response

Audio file.

Examples

python
js
curl
stepaudio-2.5-tts (python)

from pathlib import Path
from openai import OpenAI

speech_file_path = Path("step-tts.mp3")

client = OpenAI(
    api_key="STEP_API_KEY",
    base_url="https://api.stepfun.ai/v1",
)

response = client.audio.speech.create(
    model="step-tts-2",
    voice="lively-girl",
    input="StepFun is building the next generation of AGI.",
    extra_body={
        "volume": 1.0,  # volume is in extra_body
        "voice_label": {
            "language": "Cantonese",  # choose one of language / emotion / style
        },
        "pronunciation_map": {
            "tone": [
                "LOL/laugh out loudly",
            ],
        },
    },
)

response.stream_to_file(speech_file_path)

import OpenAI from "openai";
import fs from "fs";
import path from "path";

const STEP_API_KEY = "STEP_API_KEY";
const STEP_API_MODEL = "step-tts-2";

const openai = new OpenAI({
    apiKey: STEP_API_KEY,
    baseURL: "https://api.stepfun.ai/v1"
});

async function main() {
    const speechFile = path.resolve("./speech.mp3");
    const mp3 = await openai.audio.speech.create({
        model: STEP_API_MODEL,
        voice: "lively-girl",
        input: "StepFun is building the next generation of AGI.",
        extra_body: {
            volume: 2.0, // volume is in extra_body
            voice_label: {
                language: "Cantonese" // Optional: choose one of language/emotion/style
            },
            pronunciation_map: {
                tone: [
                    "LOL/laugh out loudly"
                ]
            }
        }
    });
    console.log(speechFile);
    const buffer = Buffer.from(await mp3.arrayBuffer());
    await fs.promises.writeFile(speechFile, buffer);
}

main();

curl --location 'https://api.stepfun.ai/v1/audio/speech' \
  --header 'Content-Type: application/json' \
  --header "Authorization: Bearer $STEP_API_KEY" \
  --data '{
    "model": "step-tts-2",
    "input": "StepFun is building the next generation of AGI.",
    "voice": "lively-girl"
  }' \
  --output "step.mp3"

from pathlib import Path
from openai import OpenAI

speech_file_path = Path(__file__).parent / "step-tts.mp3"

client = OpenAI(
    api_key="STEP_API_KEY",
    base_url="https://api.stepfun.ai/v1"
)
response = client.audio.speech.create(
    # [Modified] Specify stepaudio-2.5-tts model
    model="stepaudio-2.5-tts",
    voice="cixingnansheng",
    # [Modified] Max length <= 1000 characters. Content in () is treated as instructions (not spoken).
    input="(cold laugh) Do you think our technology at StepFun Beijing is a joke?!",
    extra_body={
        "volume": 1.0,  # volume is in extra_body
        # [New] TTS 2.5 global instruction (max 200 characters)
        "instruction": "Extremely angry tone, strong pressure, slightly fast pace",

        # ⚠️ Note: stepaudio-2.5-tts does not support voice_label; do not pass it
        # "voice_label": {
        #     "language": "Cantonese",
        #     "emotion": "Happy",
        #     "style": "Slow"
        # },

        "pronunciation_map": {
            "tone": [
                "LOL/laugh out loudly"
            ]
        }
    }
)
response.stream_to_file(speech_file_path)

Chat

Audio

Image

Models

Files

Account

Tool Call

Token Count

Error Codes

Endpoint

Request body

Response

Examples

Chat

Audio

Image

Models

Files

Account

Tool Call

Token Count

Error Codes

Documentation Index

​Endpoint

​Request body

​Response

​Examples

Endpoint

Request body

Response

Examples