Voice cloning - StepFun Documentation

Clone a voice from a previously uploaded WAV or MP3 file so it can be used for TTS audio generation.

Endpoint

POST https://api.stepfun.ai/v1/audio/voices

Request parameters

model string required
TTS model to use. Options: step-tts-2.
text string optional
Transcript of the source audio file. If omitted, automatic speech recognition is used. For best results, we recommend providing the transcript.
file_id string required
File ID of the source audio used for cloning. Obtain the ID via file upload; set purpose to storage. Supported formats: mp3, wav. Audio length should be 5–10 seconds.
sample_text string optional
Text (max 50 characters) used to create a preview clip.

Response

id string
Voice ID for subsequent audio generation.
object string
Object type, always audio.voice.
duplicated boolean
Indicates the request was duplicated (returned on repeated calls).
sample_text string
Text used for the preview audio.
sample_audio string
Preview audio in base64 (wav). Convert to a file to play.

Example

curl

curl -L 'https://api.stepfun.ai/v1/audio/voices' \
-H 'Content-Type: application/json' \
-H "Authorization: Bearer $STEP_API_KEY" \
-d '{
    "file_id":"file-Ckyl3cV09A",
    "model":"step-tts-2",
    "text":"StepFun intelligence, 10x possibilities for everyone.",
    "sample_text":"Nice weather today"
}'

Voice clone preview Streaming Text-to-Speech

​Endpoint

​Request parameters

​Response

​Example

Endpoint

Request parameters

Response

Example