Text-to-Speech
This API allows you to generate audio using our Text-to-Speech (TTS) model.
Endpoint
POST https://api.stepfun.ai/v1/audio/speech
Request body
-
modelstringrequired
The ID of the model to use. Currently, onlystep-tts-2is supported for the overseas region. -
inputstringrequired
The text to generate audio for. The maximum length is 10,000 characters. -
voicestringrequired
The voice to use for generation. Supports both official voices and custom cloned voices. -
response_formatstringoptional
The audio format for the returned output. Supported formats:mp3(default),opus,aac,flac,wav,pcm. -
speedfloatoptional
The speed of the generated audio. Select a value from 0.5 to 2.0. Default is 1.0. -
volumefloatoptional
The volume of the generated audio. The valid range is 0.1 to 2.0, with a default value of 1.0. 0.1 reduces the volume to 10%, while 2.0 increases it to 200%. -
voice_labelobjectoptional
Voice tags. Required when using a custom voice. Only one oflanguage,emotion, orstylecan be set at a time.languagestringoptional
Language. Supports Cantonese, Sichuanese, and Japanese. If not specified, the system automatically determines whether the input text is English or Chinese.emotionstringoptional
Emotion. Supports up to 11 options such as Happy, Angry, and more.stylestringoptional
Supports up to 17 speaking rates or delivery styles.
-
sample_rateintegeroptional
The sampling rate. Supports 8000, 16000, 22050, 24000. Default is 24000. Higher rates improve quality but increase file size. -
pronunciation_mapobject arrayoptional
Defines a pronunciation rule to annotate or override the reading of specific characters or symbols. In Chinese text, tones are represented by numbers 1–5.tonestringrequired
Specific pronunciation mapping rules, separated by/. Example:["omg/oh my god"].
-
stream_formatstringoptional
Streaming return mode. By default, audio is returned directly. Supported values:sse,audio. Default isaudio. Whensseis specified, audio is returned via Server-Sent Events (SSE).
Response
Audio file output.
Examples
from pathlib import Path
from openai import OpenAI
speech_file_path = Path("step-tts.mp3")
client = OpenAI(
api_key="STEP_API_KEY",
base_url="https://api.stepfun.ai/v1",
)
response = client.audio.speech.create(
model="step-tts-2",
voice="lively-girl",
input="StepFun is building the next generation of AGI.",
extra_body={
"volume": 1.0, # volume is in extra_body
"voice_label": {
"language": "Cantonese", # choose one of language / emotion / style
},
"pronunciation_map": {
"tone": [
"LOL/laugh out loudly",
],
},
},
)
response.stream_to_file(speech_file_path)