Text-to-Speech Guide
StepFun provides developers with voice interaction models that support audio generation, voice cloning, and automatic speech recognition. By integrating these models, applications can extend beyond standard large language model understanding and enable voice interaction.
Quick Start
Quickly Generate an Audio Clip
Copy the following code to quickly generate an audio file.
curl --location 'https://api.stepfun.ai/v1/audio/speech' \
--header 'Content-Type: application/json' \
--header "Authorization: Bearer $STEP_API_KEY" \
--data '{
"model":"step-tts-2",
"input":"StepFun is building the next generation of AGI.",
"voice":"lively-girl"
}'\
--output "step.mp3"Voice Recommendations by Scenario
1. Emotional Companionship Scenarios
Emotional companionship requires voices that are warm, gentle, and empathetic, capable of providing users with comfort and psychological support. Our TTS model features delicate, soothing voice timbres with strong emotional expressiveness, helping you create a safe and comforting interaction environment for users.
Recommended voices
| Voice name | Voice ID | Sample audio |
|---|---|---|
| Soft-spoken Gentleman | soft-spoken-gentleman | ”The weather is perfect today. How about we go for a walk later? Just you and me." "It’s okay to take a break. You don’t have to carry everything on your own. Lean on me.” |
2. Video Dubbing Scenarios
Video dubbing requires voices that are expressive, rhythmic, and visually evocative, enabling them to blend seamlessly with the visual content. Our TTS model excels in precise emotional delivery and fine-grained speech rhythm control, enhancing the impact and overall appeal of your videos.
Recommended voices
| Voice name | Voice ID | Sample audio |
|---|---|---|
| Vibrant Youth | vibrant-youth | ”Every street, every face, every passing moment tells a story, if you’re willing to look a little closer." "This city never really slows down. It just changes rhythm, and today, it’s inviting you to move with it.” |
| Magnetic-voiced Male | magnetic-voiced-male | ”War is not about who is right. It is about who is left. We hold the line." "Legends aren’t born. They are forged in fire and written in blood.” |
3. Audiobook Scenarios
Audiobooks require voices that are expressive and emotionally engaging, capable of vividly bringing different characters and story atmospheres to life. Our TTS stands out with its delicate emotional expression and versatile vocal styles, enabling listeners to fully immerse themselves in the world of the story.
Recommended voices
| Voice name | Voice ID | Sample audio |
|---|---|---|
| Lively Girl | lively-girl | ”In the quiet moments before dawn, she realized that some journeys don’t begin with excitement, but with a calm decision to finally move forward." "Years later, when the details had faded, she could still remember the sound of the wind that afternoon, as if it had never truly left her.” |
System Voice ID List
| Language | Voice name | Voice ID | Supported models | Recommended use cases |
|---|---|---|---|---|
| English | Lively Girl | lively-girl | step-tts-2 | Audiobook, video dubbing |
| English | Vibrant Youth | vibrant-youth | step-tts-2 | Audiobook, video dubbing |
| English | Soft-spoken Gentleman | soft-spoken-gentleman | step-tts-2 | Audiobook, emotional companionship |
| English | Magnetic-voiced Male | magnetic-voiced-male | step-tts-2 | Audiobook, video dubbing |
| Chinese | Gentle Lady | elegantgentle-female | step-tts-2 | Customer service and transaction handling, voice-over and broadcasting, education and training, emotional companionship |
| Chinese | Breezy Girl | livelybreezy-female | step-tts-2 | Emotional companionship, customer service and transaction handling, education and training, advertising |
| Chinese | Confident Gentleman | zixinnansheng | step-tts-2 | Audiobook, video dubbing |
Voice Tags List
Voice tags support three categories: speaking style, emotion, and language. Emotion tags must be set in the voice_label.emotion field, while speaking-style tags must be set in the voice_label.style field.
| No. | Tag Name | Tag Type |
|---|---|---|
| 1 | Happy | Emotion |
| 2 | Very Happy | Emotion |
| 3 | Sad | Emotion |
| 4 | Angry | Emotion |
| 5 | Very Angry | Emotion |
| 6 | Coquettish | Emotion |
| 7 | Fearful | Emotion |
| 8 | Surprised | Emotion |
| 9 | Excited | Emotion |
| 10 | Admiring | Emotion |
| 11 | Confused | Emotion |
| 12 | Slow | Speaking Style |
| 13 | Very Slow | Speaking Style |
| 14 | Fast | Speaking Style |
| 15 | Very Fast | Speaking Style |
| 16 | Cold | Delivery Style |
| 17 | Embarrassed | Delivery Style |
| 18 | Frustrated | Delivery Style |
| 19 | Proud | Delivery Style |
| 20 | Tender | Delivery Style |
| 21 | Sweet | Delivery Style |
| 22 | Outgoing | Delivery Style |
| 23 | Serious | Delivery Style |
| 24 | Arrogant | Delivery Style |
| 25 | Elderly | Delivery Style |
| 26 | Shouting | Delivery Style |
| 27 | Sarcastic | Delivery Style |
| 28 | Stuttering | Delivery Style |
Output Format
StepFun TTS models support audio output in wav, mp3, flac, opus, and pcm formats. The default format is mp3. You can choose the format that best suits your use case.
Output Languages
StepFun TTS models support generating audio in Chinese, English, mixed Chinese-English, and Japanese.
FAQ
Do I own the audio I generate?
Yes. You own the audio you create. However, we recommend informing your end users that the audio was generated by AI so they are aware of its nature.
How do I adjust the volume of the generated audio?
You can set the volume parameter when calling the generation API. Valid values range from 0.1 to 2.0, representing 10% volume to 200% volume.
How do I adjust the speaking rate of the generated audio?
You can set the speed parameter when calling the generation API. Valid values range from 0.5 to 2.0, representing half-speed to double-speed.