Skip to main content
This API allows you to generate audio using our Text-to-Speech (TTS) model.

Endpoint

POST https://api.stepfun.ai/v1/audio/speech

Request body

  • model string required
    The ID of the model to use. Currently, only step-tts-2 is supported for the overseas region.
  • input string required
    The text to generate audio for. The maximum length is 10,000 characters.
  • voice string required
    The voice to use for generation. Supports both official voices and custom cloned voices.
  • response_format string optional
    The audio format for the returned output. Supported formats: mp3 (default), opus, aac, flac, wav, pcm.
  • speed float optional
    The speed of the generated audio. Select a value from 0.5 to 2.0. Default is 1.0.
  • volume float optional
    The volume of the generated audio. The valid range is 0.1 to 2.0, with a default value of 1.0. 0.1 reduces the volume to 10%, while 2.0 increases it to 200%.
  • voice_label object optional
    Voice tags. Required when using a custom voice. Only one of language, emotion, or style can be set at a time.
    • language string optional
      Language. Supports Cantonese, Sichuanese, and Japanese. If not specified, the system automatically determines whether the input text is English or Chinese.
    • emotion string optional
      Emotion. Supports up to 11 options such as Happy, Angry, and more. Supported emotion and style labels can be found in the Voice Label Reference.
    • style string optional
      Supports up to 17 speaking rates or delivery styles. Supported emotion and style labels can be found in the Voice Label Reference
  • sample_rate integer optional
    The sampling rate. Supports 8000, 16000, 22050, 24000, 48000. Default is 24000. Higher rates improve quality but increase file size.
  • pronunciation_map object array optional
    Defines a pronunciation rule to annotate or override the reading of specific characters or symbols. In Chinese text, tones are represented by numbers 1–5.
    • tone string required
      Specific pronunciation mapping rules, separated by /. Example: ["omg/oh my god"].
  • stream_format string optional
    Streaming return mode. By default, audio is returned directly. Supported values: sse, audio. Default is audio. When sse is specified, audio is returned via Server-Sent Events (SSE).

Response

Audio file output.

Examples

from pathlib import Path
from openai import OpenAI

speech_file_path = Path("step-tts.mp3")

client = OpenAI(
  api_key="STEP_API_KEY",
  base_url="https://api.stepfun.ai/v1",
)

response = client.audio.speech.create(
  model="step-tts-2",
  voice="lively-girl",
  input="StepFun is building the next generation of AGI.",
  extra_body={
      "volume": 1.0,  # volume is in extra_body
      "voice_label": {
          "language": "Cantonese",  # choose one of language / emotion / style
      },
      "pronunciation_map": {
          "tone": [
              "LOL/laugh out loudly",
          ],
      },
  },
)

response.stream_to_file(speech_file_path)