Voice clone preview - StepFun Documentation

Generate a preview audio clip from a reference audio file (WAV or MP3) to quickly verify the voice cloning result. This endpoint does not create a permanent voice asset — it is only for previewing.

Endpoint

POST https://api.stepfun.ai/v1/audio/voices/preview

For Step Plan, use POST https://api.stepfun.ai/step_plan/v1/audio/voices/preview

Request parameters

model string required
Model to use for cloning. Options: step-tts-2, stepaudio-2.5-tts.
file_id string required
Reference audio file ID. Obtain via file upload; set purpose to storage.
text string optional
Transcript of the reference audio. If omitted, automatic speech recognition is used. For best results, we recommend providing the transcript.
sample_text string required
Text to synthesize for the preview. Recommended length: under 50 characters.
response_format string optional
Audio format for the response. Options: wav, mp3, flac, opus, pcm. Default: mp3.
speed float optional
Speaking rate. Range: 0.5–2.0. Default: 1.0.
volume float optional
Volume level. Range: 0.1–2.0. Default: 1.0.
voice_label object optional
Voice style tags for emotion and style control. Only one of language, emotion, or style may be set at a time.
- language string optional
  Language option: Cantonese, Sichuan dialect, Japanese.
- emotion string optional
  Emotion tag. See voice tags for supported options per model.
- style string optional
  Speaking or delivery style. See voice tags for supported options per model.

stepaudio-2.5-tts does NOT support the voice_label parameter. Use the instruction field or inline parenthesized instructions () in the text instead.

instruction string optional
Global natural language guidance. Only effective with stepaudio-2.5-tts; other models will return an error if this parameter is provided. Sets the overall emotional tone and character for the audio. Max length: 200 characters.
sample_rate integer optional
Sample rate in Hz. Options: 8000, 16000, 22050, 24000, 48000. Default: 24000. Higher values produce better quality but larger files.
pronunciation_map object array optional
Custom pronunciation rules for specific characters or symbols.
- tone string required
  Pronunciation mapping separated by /. Example: ["word/wɜːrd"].
markdown_filter bool optional
Whether to enable Markdown filtering for the input text.

Response

sample_text string
The text used for the preview audio.
sample_audio string
Preview audio in base64 format (WAV). Convert to a file for playback.
request_id string
Unique identifier for this request.

Example

curl

curl -L 'https://api.stepfun.ai/v1/audio/voices/preview' \
-H 'Content-Type: application/json' \
-H "Authorization: Bearer $STEP_API_KEY" \
-d '{
    "file_id": "file-Ckyl3cV09A",
    "model": "stepaudio-2.5-tts",
    "text": "StepFun intelligence, amplifying every possibility tenfold",
    "sample_text": "Nice weather today",
    "instruction": "Gentle tone, slightly slow pace"
}'

​Endpoint

​Request parameters

​Response

​Example

Endpoint

Request parameters

Response

Example