Voice interaction developer guide - StepFun Documentation

StepFun provides developers with voice interaction models that support audio generation and voice cloning. By integrating these models, applications can extend beyond standard large language model understanding and enable voice interaction.

Quick Start

Quickly Generate an Audio Clip

Copy the following code to quickly generate an audio file.

curl --location 'https://api.stepfun.ai/v1/audio/speech' \
--header 'Content-Type: application/json' \
--header "Authorization: Bearer $STEP_API_KEY" \
--data '{
   "model":"step-tts-2",
   "input":"StepFun is building the next generation of AGI.",
   "voice":"lively-girl"
}'\
--output "step.mp3"

Voice Recommendations by Scenario

StepFun offers dozens of recommended voices across seven major scenarios. You can preview different voices here and use them via the API. We strongly recommend using voice cloning to create custom voices. The step-tts-2 model delivers industry-leading cloning performance, and cloned voices support all emotion and style controls at zero additional cost.

1. Marketing

Marketing scenarios require voices with charisma, persuasiveness, and warmth that can effectively convey product value and inspire purchase intent. Step-TTS delivers full emotional expression, building trust and professionalism to make marketing content more compelling.

Supported Models	Voice Name	Voice ID	Audio Samples
stepaudio-2.5-tts / step-tts-2	Lively Breezy	livelybreezy-female	Sample 1 · Sample 2
stepaudio-2.5-tts / step-tts-2	Upright Youth	zhengpaiqingnian	Sample 1 · Sample 2

2. Customer Service

Customer service scenarios require voices that are warm, patient, and professional, capable of calming users and providing clear solutions. We offer two types of customer service voices — step-tts-2 voices stand out with rich audio quality, full emotion, and a lifelike human feel, making the first four recommendations especially suited for phone scenarios.

Supported Models	Voice Name	Voice ID	Audio Samples
stepaudio-2.5-tts / step-tts-2	Straightforward Male	shuangkuainansheng	Sample 1 · Sample 2 · Sample 3
stepaudio-2.5-tts / step-tts-2	Capable Female	ganliannvsheng	Sample 1 · Sample 2 · Sample 3
stepaudio-2.5-tts / step-tts-2	Warm Female	qinhenvsheng	Sample 1 · Sample 2 · Sample 3
stepaudio-2.5-tts / step-tts-2	Energetic Female	huolinvsheng	Sample 1 · Sample 2 · Sample 3
stepaudio-2.5-tts / step-tts-2	Elegant Gentle	elegantgentle-female	Sample 1 · Sample 2
stepaudio-2.5-tts / step-tts-2	Lively Breezy	livelybreezy-female	Sample 1 · Sample 2
stepaudio-2.5-tts / step-tts-2	Gentle Male	wenrounansheng	Sample 1 · Sample 2
stepaudio-2.5-tts / step-tts-2	Classic Female	jingdiannvsheng	Sample 1 · Sample 2
stepaudio-2.5-tts / step-tts-2	Mature Gentle	wenroushunv	Sample 1 · Sample 2
stepaudio-2.5-tts / step-tts-2	Sweet Female	tianmeinvsheng	Sample 1 · Sample 2
stepaudio-2.5-tts / step-tts-2	Pure Girl	qingchunshaonv	Sample 1 · Sample 2
stepaudio-2.5-tts / step-tts-2	Spirited Male	yuanqinansheng	Sample 1 · Sample 2

3. Audiobook

Audiobooks require voices that are expressive and emotionally engaging, capable of vividly bringing different characters and story atmospheres to life. Our TTS stands out with its delicate emotional expression and versatile vocal styles, enabling listeners to fully immerse themselves in the world of the story.

Supported Models	Voice Name	Voice ID	Audio Samples
stepaudio-2.5-tts / step-tts-2	Lively Girl	lively-girl	Sample 1 · Sample 2
stepaudio-2.5-tts / step-tts-2	Scholarly Gentleman	ruyananshi	Sample 1 · Sample 2
stepaudio-2.5-tts / step-tts-2	Gentle Female	wenrounvsheng	Sample 1 · Sample 2
stepaudio-2.5-tts / step-tts-2	Tender Gentleman	wenrougongzi	Sample 1 · Sample 2
stepaudio-2.5-tts / step-tts-2	Magnetic Male	cixingnansheng	Sample 1 · Sample 2
stepaudio-2.5-tts / step-tts-2	Spirited Girl	yuanqishaonv	Sample 1 · Sample 2
stepaudio-2.5-tts / step-tts-2	Upright Youth	zhengpaiqingnian	Sample 1 · Sample 2
stepaudio-2.5-tts / step-tts-2	Spirited Male	yuanqinansheng	Sample 1 · Sample 2
stepaudio-2.5-tts / step-tts-2	Broadcast Male	boyinnansheng	Sample 1 · Sample 2
stepaudio-2.5-tts / step-tts-2	Deep Male	shenchennanyin	Sample 1 · Sample 2

4. Emotional Companionship

Emotional companionship requires voices that are warm, gentle, and empathetic, capable of providing users with comfort and psychological support. Our TTS features delicate, soothing voice timbres with strong emotional expressiveness, helping you create a safe and comforting interaction environment for users.

Supported Models	Voice Name	Voice ID	Audio Samples
stepaudio-2.5-tts / step-tts-2	Soft-spoken Gentleman	soft-spoken-gentleman	Sample 1 · Sample 2
stepaudio-2.5-tts / step-tts-2	Elegant Gentle	elegantgentle-female	Sample 1 · Sample 2
stepaudio-2.5-tts / step-tts-2	Lively Breezy	livelybreezy-female	Sample 1 · Sample 2
stepaudio-2.5-tts / step-tts-2	Gentle Male	wenrounansheng	Sample 1 · Sample 2
stepaudio-2.5-tts / step-tts-2	Tender Gentleman	wenrougongzi	Sample 1 · Sample 2
stepaudio-2.5-tts / step-tts-2	Classic Female	jingdiannvsheng	Sample 1 · Sample 2
stepaudio-2.5-tts / step-tts-2	Friendly Female	qinqienvsheng	Sample 1 · Sample 2
stepaudio-2.5-tts / step-tts-2	Sweet Female	tianmeinvsheng	Sample 1 · Sample 2
stepaudio-2.5-tts / step-tts-2	Magnetic Male	cixingnansheng	Sample 1 · Sample 2
stepaudio-2.5-tts / step-tts-2	Spirited Girl	yuanqishaonv	Sample 1 · Sample 2
stepaudio-2.5-tts / step-tts-2	Girl Next Door	linjiajiejie	Sample 1 · Sample 2
stepaudio-2.5-tts / step-tts-2	Scholarly Gentleman	ruyananshi	Sample 1 · Sample 2
stepaudio-2.5-tts / step-tts-2	Deep Male	shenchennanyin	Sample 1 · Sample 2
stepaudio-2.5-tts / step-tts-2	Gentle Female	wenrounvsheng	Sample 1 · Sample 2
stepaudio-2.5-tts / step-tts-2	Cute Soft Female	ruanmengnvsheng	Sample 1 · Sample 2

5. Voice Assistant

Voice assistant scenarios require voices that are clear, natural, and efficient, capable of accurately understanding and responding to user commands. Our TTS features natural prosody and full emotional expression, making your voice assistant both professional and approachable.

Supported Models	Voice Name	Voice ID	Audio Samples
stepaudio-2.5-tts / step-tts-2	Elegant Gentle	elegantgentle-female	Sample 1 · Sample 2
stepaudio-2.5-tts / step-tts-2	Lively Breezy	livelybreezy-female	Sample 1 · Sample 2
stepaudio-2.5-tts / step-tts-2	Pure Girl	qingchunshaonv	Sample 1 · Sample 2
stepaudio-2.5-tts / step-tts-2	Spirited Girl	yuanqishaonv	Sample 1 · Sample 2
stepaudio-2.5-tts / step-tts-2	Girl Next Door	linjiajiejie	Sample 1 · Sample 2
stepaudio-2.5-tts / step-tts-2	Scholarly Gentleman	ruyananshi	Sample 1 · Sample 2
stepaudio-2.5-tts / step-tts-2	Clever Girl	jilingshaonv	Sample 1 · Sample 2
stepaudio-2.5-tts / step-tts-2	Cute Soft Female	ruanmengnvsheng	Sample 1 · Sample 2
stepaudio-2.5-tts / step-tts-2	Kid Sister	linjiameimei	Sample 1 · Sample 2
stepaudio-2.5-tts / step-tts-2	Intellectual Lady	zhixingjiejie	Sample 1 · Sample 2

6. Video Dubbing

Video dubbing requires voices that are expressive, rhythmic, and visually evocative, capable of blending seamlessly with visual content. Our TTS excels in precise emotional delivery and fine-grained speech rhythm control, enhancing the impact and overall appeal of your videos.

Supported Models	Voice Name	Voice ID	Audio Samples
stepaudio-2.5-tts / step-tts-2	Vibrant Youth	vibrant-youth	Sample 1 · Sample 2
stepaudio-2.5-tts / step-tts-2	Magnetic-voiced Male	magnetic-voiced-male	Sample 1 · Sample 2
stepaudio-2.5-tts / step-tts-2	Girl Next Door	linjiajiejie	Sample 1 · Sample 2
stepaudio-2.5-tts / step-tts-2	Kid Sister	linjiameimei	Sample 1 · Sample 2
stepaudio-2.5-tts / step-tts-2	College Student	qingniandaxuesheng	Sample 1 · Sample 2
stepaudio-2.5-tts / step-tts-2	Cute Soft Female	ruanmengnvsheng	Sample 1 · Sample 2
stepaudio-2.5-tts / step-tts-2	Elegant Female	youyanvsheng	Sample 1 · Sample 2
stepaudio-2.5-tts / step-tts-2	Cool Beauty	lengyanyujie	Sample 1 · Sample 2
stepaudio-2.5-tts / step-tts-2	Intellectual Lady	zhixingjiejie	Sample 1 · Sample 2
stepaudio-2.5-tts / step-tts-2	Bold Sister	shuangkuaijiejie	Sample 1 · Sample 2
stepaudio-2.5-tts / step-tts-2	Quiet Scholar	wenjingxuejie	Sample 1 · Sample 2

7. Education & Training

Education and training scenarios require voices that are clear, accurate, and inspiring, capable of effectively conveying knowledge and sparking learning interest. Our TTS excels at capturing the vocal characteristics of instructors across different emotional states.

Supported Models	Voice Name	Voice ID	Audio Samples
stepaudio-2.5-tts / step-tts-2	Elegant Gentle	elegantgentle-female	Sample 1 · Sample 2
stepaudio-2.5-tts / step-tts-2	Gentle Male	wenrounansheng	Sample 1 · Sample 2
stepaudio-2.5-tts / step-tts-2	Lively Breezy	livelybreezy-female	Sample 1 · Sample 2
stepaudio-2.5-tts / step-tts-2	Mature Gentle	wenroushunv	Sample 1 · Sample 2

System Voice ID List

Voice Name	Voice ID	Supported Models	Recommended Use Cases
Vibrant Youth	vibrant-youth	stepaudio-2.5-tts, step-tts-2	Audiobook, video dubbing
Lively Girl	lively-girl	stepaudio-2.5-tts, step-tts-2	Audiobook, video dubbing
Soft-spoken Gentleman	soft-spoken-gentleman	stepaudio-2.5-tts, step-tts-2	Emotional companionship, audiobook
Magnetic-voiced Male	magnetic-voiced-male	stepaudio-2.5-tts, step-tts-2	Audiobook, video dubbing
Confident Male	zixinnansheng	stepaudio-2.5-tts, step-tts-2	Audiobook, emotional companionship, education, marketing
Elegant Gentle	elegantgentle-female	stepaudio-2.5-tts, step-tts-2	Customer service, voice-over, education, emotional companionship
Lively Breezy	livelybreezy-female	stepaudio-2.5-tts, step-tts-2	Emotional companionship, customer service, education, marketing
Gentle Male	wenrounansheng	stepaudio-2.5-tts, step-tts-2	Voice-over, emotional companionship, customer service, education
Tender Gentleman	wenrougongzi	stepaudio-2.5-tts, step-tts-2	Emotional companionship, audiobook
Spirited Male	yuanqinansheng	stepaudio-2.5-tts, step-tts-2	Audiobook, voice-over, customer service
Classic Female	jingdiannvsheng	stepaudio-2.5-tts, step-tts-2	Customer service, emotional companionship
Mature Gentle	wenroushunv	stepaudio-2.5-tts, step-tts-2	Customer service, voice-over, education
Sweet Female	tianmeinvsheng	stepaudio-2.5-tts, step-tts-2	Emotional companionship, customer service
Pure Girl	qingchunshaonv	stepaudio-2.5-tts, step-tts-2	Customer service, voice assistant
Magnetic Male	cixingnansheng	stepaudio-2.5-tts, step-tts-2	Audiobook, emotional companionship
Spirited Girl	yuanqishaonv	stepaudio-2.5-tts, step-tts-2	Audiobook, emotional companionship, voice assistant
Girl Next Door	linjiajiejie	stepaudio-2.5-tts, step-tts-2	Voice-over, emotional companionship, voice assistant, video dubbing
Upright Youth	zhengpaiqingnian	stepaudio-2.5-tts, step-tts-2	Marketing, audiobook
College Student	qingniandaxuesheng	stepaudio-2.5-tts, step-tts-2	Voice-over
Broadcast Male	boyinnansheng	stepaudio-2.5-tts, step-tts-2	Audiobook, voice-over
Scholarly Gentleman	ruyananshi	stepaudio-2.5-tts, step-tts-2	Audiobook, emotional companionship, voice-over, voice assistant
Deep Male	shenchennanyin	stepaudio-2.5-tts, step-tts-2	Emotional companionship, audiobook
Friendly Female	qinqienvsheng	stepaudio-2.5-tts, step-tts-2	Voice-over
Gentle Female	wenrounvsheng	stepaudio-2.5-tts, step-tts-2	Audiobook, emotional companionship
Clever Girl	jilingshaonv	stepaudio-2.5-tts, step-tts-2	Voice assistant, voice-over
Cute Soft Female	ruanmengnvsheng	stepaudio-2.5-tts, step-tts-2	Emotional companionship, voice assistant, video dubbing
Elegant Female	youyanvsheng	stepaudio-2.5-tts, step-tts-2	Video dubbing
Cool Beauty	lengyanyujie	stepaudio-2.5-tts, step-tts-2	Video dubbing
Bold Sister	shuangkuaijiejie	stepaudio-2.5-tts, step-tts-2	Voice-over
Quiet Scholar	wenjingxuejie	stepaudio-2.5-tts, step-tts-2	Voice-over
Kid Sister	linjiameimei	stepaudio-2.5-tts, step-tts-2	Video dubbing, voice-over, voice assistant
Intellectual Lady	zhixingjiejie	stepaudio-2.5-tts, step-tts-2	Video dubbing, voice-over, voice assistant
Straightforward Male	shuangkuainansheng	stepaudio-2.5-tts, step-tts-2	Customer service, voice assistant
Capable Female	ganliannvsheng	stepaudio-2.5-tts, step-tts-2	Customer service, voice assistant
Warm Female	qinhenvsheng	stepaudio-2.5-tts, step-tts-2	Customer service, voice assistant
Energetic Female	huolinvsheng	stepaudio-2.5-tts, step-tts-2	Customer service, voice assistant

Voice Tags List

Voice tags support three categories: speaking style, emotion, and language. Emotion tags must be set in the voice_label.emotion field, while speaking-style tags must be set in the voice_label.style field.

stepaudio-2.5-tts does NOT support voice tags. Use the instruction parameter for emotion and style control instead.

No.	Tag Name	Tag Type	step-tts-2
1	Happy	Emotion	✓
2	Very Happy	Emotion	✓
3	Sad	Emotion	✓
4	Angry	Emotion	✓
5	Very Angry	Emotion	✓
6	Coquettish	Emotion	✓
7	Slow	Speaking Style	✓
8	Very Slow	Speaking Style	✓
9	Fast	Speaking Style	✓
10	Very Fast	Speaking Style	✓
11	Fearful	Emotion	✓
12	Surprised	Emotion	✓
13	Excited	Emotion	✓
14	Admiring	Emotion	✓
15	Confused	Emotion	✓
16	Cold	Delivery Style	✓
17	Embarrassed	Delivery Style	✓
18	Frustrated	Delivery Style	✓
19	Proud	Delivery Style	✓
20	Tender	Delivery Style	✓
21	Sweet	Delivery Style	✓
22	Outgoing	Delivery Style	✓
23	Serious	Delivery Style	✓
24	Arrogant	Delivery Style	✓
25	Elderly	Delivery Style	✓
26	Shouting	Delivery Style	✓
27	Sarcastic	Delivery Style	✓
28	Stuttering	Delivery Style	✓

Output Format

StepFun TTS models support audio output in wav, mp3, flac, opus, and pcm formats. The default format is mp3. You can choose the format that best suits your use case.

Output Languages

StepFun TTS models support generating audio in Chinese, English, mixed Chinese-English, and Japanese.

FAQ

Do I own the audio I generate? Yes. You own the audio you create. However, we recommend informing your end users that the audio was generated by AI so they are aware of its nature. How do I adjust the volume of the generated audio? You can set the volume parameter when calling the generation API. Valid values range from 0.1 to 2.0, representing 10% volume to 200% volume. How do I adjust the speaking rate of the generated audio? You can set the speed parameter when calling the generation API. Valid values range from 0.5 to 2.0, representing half-speed to double-speed.

​Quick Start

​Quickly Generate an Audio Clip

​Voice Recommendations by Scenario

​1. Marketing

​2. Customer Service

​3. Audiobook

​4. Emotional Companionship

​5. Voice Assistant

​6. Video Dubbing

​7. Education & Training

​System Voice ID List

​Voice Tags List

​Output Format

​Output Languages

​FAQ

Quick Start

Quickly Generate an Audio Clip

Voice Recommendations by Scenario

1. Marketing

2. Customer Service

3. Audiobook

4. Emotional Companionship

5. Voice Assistant

6. Video Dubbing

7. Education & Training

System Voice ID List

Voice Tags List

Output Format

Output Languages

FAQ