Skip to main content

Documentation Index

Fetch the complete documentation index at: https://platform.stepfun.ai/docs/llms.txt

Use this file to discover all available pages before exploring further.

This API allows you to generate audio using our Text-to-Speech (TTS) model.

Endpoint

POST https://api.stepfun.ai/v1/audio/speech
For Step Plan, use POST https://api.stepfun.ai/step_plan/v1/audio/speech

Request body

  • model string required
    The ID of the model to use. Currently supports step-tts-2, step-tts-mini, and stepaudio-2.5-tts.
    The step-tts-vivid model name is deprecated but existing user requests will continue to be supported.
  • input string required
    The text to generate audio for. The maximum length is 1,000 characters. When using stepaudio-2.5-tts, content inside parentheses () will be treated as instructions and will not be spoken. If you need the text itself to be spoken, do not wrap it in parentheses.
  • voice string required
    The voice to use for generation. Supports both official voices and custom cloned voices.
  • response_format string optional
    The audio format for the returned output. Supported formats: wav, mp3, flac, opus, pcm. Default: mp3.
  • speed float optional
    The speed of the generated audio. Range: 0.5 to 2.0. Default: 1.0. 0.5 means half speed.
  • volume float optional
    The volume of the generated audio. Range: 0.1 to 2.0. Default: 1.0. 0.1 reduces the volume to 10%; 2.0 increases it to 200%.
  • voice_label object optional
    Voice tags. Required when using a custom voice. Only one of language, emotion, or style can be set at a time; combinations are not yet supported.
    • language string optional
      Language. Supported values: Cantonese, Sichuanese, Japanese.
    • emotion string optional
      Emotion tag. Supports up to 11 options such as Happy, Angry, etc. Supported values may vary by model; see voice tags.
    • style string optional
      Supports up to 17 speaking rates or delivery styles. Supported values may vary by model; see voice tags.
⚠️ Note: The stepaudio-2.5-tts model does not support this field. Passing voice_label will cause an error. If you are using stepaudio-2.5-tts, use the instruction field or inline () prompts in the text to control emotion and style instead. For other models, see voice tags.
  • instruction string optional
    Global natural language guidance. Only effective when using the stepaudio-2.5-tts model; other models will return an error if this parameter is passed. Used to set the overall emotional tone, character persona, etc. for the entire audio. Maximum length: 200 characters.
  • sample_rate integer optional
    The sampling rate. Supports 8000, 16000, 22050, 24000, 48000. Default: 24000. Higher rates improve audio quality but increase file size. 48000 was added in recent iterations.
  • pronunciation_map object array optional
    Defines a pronunciation rule to annotate or override the reading of specific characters or symbols. In Chinese text, tones are represented by numbers: 1 for the first tone, 2 for the second tone, 3 for the third tone, 4 for the fourth tone, and 5 for the neutral tone.
    • tone string required
      Specific pronunciation mapping rules, separated by /. Example: ["LOL/laugh out loudly"].
  • stream_format string optional
    Streaming return mode. By default, audio is returned directly. Supported values: sse, audio. Default: audio. When sse is specified, audio is returned via Server-Sent Events (SSE) with the following data packet format:
    data: {"type":"speech.audio.delta","audio":"<BASE64-encoded audio chunk>"}
    
    data: {"type":"speech.audio.delta","audio":"<BASE64-encoded audio chunk>"}
    
    data: {"type":"speech.audio.done","audio":""}
    
    data: [DONE]
    
    Event types:
    • speech.audio.delta: Audio chunk. The audio field contains the BASE64-encoded binary data of this chunk; concatenate all chunks to form the complete audio.
    • speech.audio.done: Generation complete; audio is an empty string.
    • speech.audio.error: An error occurred during generation.
  • markdown_filter bool optional
    Whether to enable Markdown filtering.
  • return_url bool optional
    Only effective for non-streaming requests. When set to true, returns a URL to the audio file instead of the binary audio stream. The URL is valid for 12 hours.

Response

Audio file.

Examples

from pathlib import Path
from openai import OpenAI

speech_file_path = Path("step-tts.mp3")

client = OpenAI(
    api_key="STEP_API_KEY",
    base_url="https://api.stepfun.ai/v1",
)

response = client.audio.speech.create(
    model="step-tts-2",
    voice="lively-girl",
    input="StepFun is building the next generation of AGI.",
    extra_body={
        "volume": 1.0,  # volume is in extra_body
        "voice_label": {
            "language": "Cantonese",  # choose one of language / emotion / style
        },
        "pronunciation_map": {
            "tone": [
                "LOL/laugh out loudly",
            ],
        },
    },
)

response.stream_to_file(speech_file_path)