Skip to Content
Developer GuidesSpeech synthesis best practices

Text-to-Speech Guide

StepFun provides developers with voice interaction models that support audio generation, voice cloning, and automatic speech recognition. By integrating these models, applications can extend beyond standard large language model understanding and enable voice interaction.

Quick Start

Quickly Generate an Audio Clip

Copy the following code to quickly generate an audio file.

curl --location 'https://api.stepfun.ai/v1/audio/speech' \ --header 'Content-Type: application/json' \ --header "Authorization: Bearer $STEP_API_KEY" \ --data '{ "model":"step-tts-2", "input":"StepFun is building the next generation of AGI.", "voice":"lively-girl" }'\ --output "step.mp3"

Voice Recommendations by Scenario

1. Emotional Companionship Scenarios

Emotional companionship requires voices that are warm, gentle, and empathetic, capable of providing users with comfort and psychological support. Our TTS model features delicate, soothing voice timbres with strong emotional expressiveness, helping you create a safe and comforting interaction environment for users.

Recommended voices

Voice nameVoice IDSample audio
Soft-spoken Gentlemansoft-spoken-gentleman”The weather is perfect today. How about we go for a walk later? Just you and me."

"It’s okay to take a break. You don’t have to carry everything on your own. Lean on me.”

2. Video Dubbing Scenarios

Video dubbing requires voices that are expressive, rhythmic, and visually evocative, enabling them to blend seamlessly with the visual content. Our TTS model excels in precise emotional delivery and fine-grained speech rhythm control, enhancing the impact and overall appeal of your videos.

Recommended voices

Voice nameVoice IDSample audio
Vibrant Youthvibrant-youth”Every street, every face, every passing moment tells a story, if you’re willing to look a little closer."

"This city never really slows down. It just changes rhythm, and today, it’s inviting you to move with it.”
Magnetic-voiced Malemagnetic-voiced-male”War is not about who is right. It is about who is left. We hold the line."

"Legends aren’t born. They are forged in fire and written in blood.”

3. Audiobook Scenarios

Audiobooks require voices that are expressive and emotionally engaging, capable of vividly bringing different characters and story atmospheres to life. Our TTS stands out with its delicate emotional expression and versatile vocal styles, enabling listeners to fully immerse themselves in the world of the story.

Recommended voices

Voice nameVoice IDSample audio
Lively Girllively-girl”In the quiet moments before dawn, she realized that some journeys don’t begin with excitement, but with a calm decision to finally move forward."

"Years later, when the details had faded, she could still remember the sound of the wind that afternoon, as if it had never truly left her.”

System Voice ID List

LanguageVoice nameVoice IDSupported modelsRecommended use cases
EnglishLively Girllively-girlstep-tts-2Audiobook, video dubbing
EnglishVibrant Youthvibrant-youthstep-tts-2Audiobook, video dubbing
EnglishSoft-spoken Gentlemansoft-spoken-gentlemanstep-tts-2Audiobook, emotional companionship
EnglishMagnetic-voiced Malemagnetic-voiced-malestep-tts-2Audiobook, video dubbing
ChineseGentle Ladyelegantgentle-femalestep-tts-2Customer service and transaction handling, voice-over and broadcasting, education and training, emotional companionship
ChineseBreezy Girllivelybreezy-femalestep-tts-2Emotional companionship, customer service and transaction handling, education and training, advertising
ChineseConfident Gentlemanzixinnanshengstep-tts-2Audiobook, video dubbing

Voice Tags List

Voice tags support three categories: speaking style, emotion, and language. Emotion tags must be set in the voice_label.emotion field, while speaking-style tags must be set in the voice_label.style field.

No.Tag NameTag Type
1HappyEmotion
2Very HappyEmotion
3SadEmotion
4AngryEmotion
5Very AngryEmotion
6CoquettishEmotion
7FearfulEmotion
8SurprisedEmotion
9ExcitedEmotion
10AdmiringEmotion
11ConfusedEmotion
12SlowSpeaking Style
13Very SlowSpeaking Style
14FastSpeaking Style
15Very FastSpeaking Style
16ColdDelivery Style
17EmbarrassedDelivery Style
18FrustratedDelivery Style
19ProudDelivery Style
20TenderDelivery Style
21SweetDelivery Style
22OutgoingDelivery Style
23SeriousDelivery Style
24ArrogantDelivery Style
25ElderlyDelivery Style
26ShoutingDelivery Style
27SarcasticDelivery Style
28StutteringDelivery Style

Output Format

StepFun TTS models support audio output in wav, mp3, flac, opus, and pcm formats. The default format is mp3. You can choose the format that best suits your use case.

Output Languages

StepFun TTS models support generating audio in Chinese, English, mixed Chinese-English, and Japanese.

FAQ

Do I own the audio I generate?

Yes. You own the audio you create. However, we recommend informing your end users that the audio was generated by AI so they are aware of its nature.

How do I adjust the volume of the generated audio?

You can set the volume parameter when calling the generation API. Valid values range from 0.1 to 2.0, representing 10% volume to 200% volume.

How do I adjust the speaking rate of the generated audio?

You can set the speed parameter when calling the generation API. Valid values range from 0.5 to 2.0, representing half-speed to double-speed.

Last updated on