Skip to main content
Generate a preview audio clip from a reference audio file (WAV or MP3) to quickly verify the voice cloning result. This endpoint does not create a permanent voice asset — it is only for previewing.

Endpoint

POST https://api.stepfun.ai/v1/audio/voices/preview
For Step Plan, use POST https://api.stepfun.ai/step_plan/v1/audio/voices/preview

Request parameters

  • model string required
    Model to use for cloning. Options: step-tts-2, step-tts-mini, stepaudio-2.5-tts.
  • file_id string required
    Reference audio file ID. Obtain via file upload; set purpose to storage.
  • text string optional
    Transcript of the reference audio. If omitted, automatic speech recognition is used. For best results, we recommend providing the transcript.
  • sample_text string required
    Text to synthesize for the preview. Recommended length: under 50 characters.
  • response_format string optional
    Audio format for the response. Options: wav, mp3, flac, opus, pcm. Default: mp3.
  • speed float optional
    Speaking rate. Range: 0.5–2.0. Default: 1.0.
  • volume float optional
    Volume level. Range: 0.1–2.0. Default: 1.0.
  • voice_label object optional
    Voice style tags for emotion and style control. Only one of language, emotion, or style may be set at a time.
    • language string optional
      Language option: Cantonese, Sichuan dialect, Japanese.
    • emotion string optional
      Emotion tag. See voice tags for supported options per model.
    • style string optional
      Speaking or delivery style. See voice tags for supported options per model.
stepaudio-2.5-tts does NOT support the voice_label parameter. Use the instruction field or inline parenthesized instructions () in the text instead.
  • instruction string optional
    Global natural language guidance. Only effective with stepaudio-2.5-tts; other models will return an error if this parameter is provided. Sets the overall emotional tone and character for the audio. Max length: 200 characters.
  • sample_rate integer optional
    Sample rate in Hz. Options: 8000, 16000, 22050, 24000, 48000. Default: 24000. Higher values produce better quality but larger files.
  • pronunciation_map object array optional
    Custom pronunciation rules for specific characters or symbols.
    • tone string required
      Pronunciation mapping separated by /. Example: ["word/wɜːrd"].
  • markdown_filter bool optional
    Whether to enable Markdown filtering for the input text.

Response

  • sample_text string
    The text used for the preview audio.
  • sample_audio string
    Preview audio in base64 format (WAV). Convert to a file for playback.
  • request_id string
    Unique identifier for this request.

Example

curl -L 'https://api.stepfun.ai/v1/audio/voices/preview' \
-H 'Content-Type: application/json' \
-H "Authorization: Bearer $STEP_API_KEY" \
-d '{
    "file_id": "file-Ckyl3cV09A",
    "model": "stepaudio-2.5-tts",
    "text": "StepFun intelligence, amplifying every possibility tenfold",
    "sample_text": "Nice weather today",
    "instruction": "Gentle tone, slightly slow pace"
}'