Skip to Content
API ReferenceAudioText to speech

Text-to-Speech

This API allows you to generate audio using our Text-to-Speech (TTS) model.

Endpoint

POST https://api.stepfun.ai/v1/audio/speech

Request body

  • model string required
    The ID of the model to use. Currently, only step-tts-2 is supported for the overseas region.

  • input string required
    The text to generate audio for. The maximum length is 10,000 characters.

  • voice string required
    The voice to use for generation. Supports both official voices and custom cloned voices.

  • response_format string optional
    The audio format for the returned output. Supported formats: mp3 (default), opus, aac, flac, wav, pcm.

  • speed float optional
    The speed of the generated audio. Select a value from 0.5 to 2.0. Default is 1.0.

  • volume float optional
    The volume of the generated audio. The valid range is 0.1 to 2.0, with a default value of 1.0. 0.1 reduces the volume to 10%, while 2.0 increases it to 200%.

  • voice_label object optional
    Voice tags. Required when using a custom voice. Only one of language, emotion, or style can be set at a time.

    • language string optional
      Language. Supports Cantonese, Sichuanese, and Japanese. If not specified, the system automatically determines whether the input text is English or Chinese.
    • emotion string optional
      Emotion. Supports up to 11 options such as Happy, Angry, and more.
    • style string optional
      Supports up to 17 speaking rates or delivery styles.
  • sample_rate integer optional
    The sampling rate. Supports 8000, 16000, 22050, 24000. Default is 24000. Higher rates improve quality but increase file size.

  • pronunciation_map object array optional
    Defines a pronunciation rule to annotate or override the reading of specific characters or symbols. In Chinese text, tones are represented by numbers 1–5.

    • tone string required
      Specific pronunciation mapping rules, separated by /. Example: ["omg/oh my god"].
  • stream_format string optional
    Streaming return mode. By default, audio is returned directly. Supported values: sse, audio. Default is audio. When sse is specified, audio is returned via Server-Sent Events (SSE).

Response

Audio file output.

Examples

from pathlib import Path from openai import OpenAI speech_file_path = Path("step-tts.mp3") client = OpenAI( api_key="STEP_API_KEY", base_url="https://api.stepfun.ai/v1", ) response = client.audio.speech.create( model="step-tts-2", voice="lively-girl", input="StepFun is building the next generation of AGI.", extra_body={ "volume": 1.0, # volume is in extra_body "voice_label": { "language": "Cantonese", # choose one of language / emotion / style }, "pronunciation_map": { "tone": [ "LOL/laugh out loudly", ], }, }, ) response.stream_to_file(speech_file_path)
Last updated on