Text-to-Speech

This API allows you to generate audio using our Text-to-Speech (TTS) model.

Endpoint

POST https://api.stepfun.ai/v1/audio/speech

Request body

model string required
The ID of the model to use. Currently, only step-tts-2 is supported for the overseas region.
input string required
The text to generate audio for. The maximum length is 10,000 characters.
voice string required
The voice to use for generation. Supports both official voices and custom cloned voices.
response_format string optional
The audio format for the returned output. Supported formats: mp3 (default), opus, aac, flac, wav, pcm.
speed float optional
The speed of the generated audio. Select a value from 0.5 to 2.0. Default is 1.0.
volume float optional
The volume of the generated audio. The valid range is 0.1 to 2.0, with a default value of 1.0. 0.1 reduces the volume to 10%, while 2.0 increases it to 200%.
voice_label object optional
Voice tags. Required when using a custom voice. Only one of language, emotion, or style can be set at a time.
- language string optional
  Language. Supports Cantonese, Sichuanese, and Japanese. If not specified, the system automatically determines whether the input text is English or Chinese.
- emotion string optional
  Emotion. Supports up to 11 options such as Happy, Angry, and more.
- style string optional
  Supports up to 17 speaking rates or delivery styles.
sample_rate integer optional
The sampling rate. Supports 8000, 16000, 22050, 24000. Default is 24000. Higher rates improve quality but increase file size.
pronunciation_map object array optional
Defines a pronunciation rule to annotate or override the reading of specific characters or symbols. In Chinese text, tones are represented by numbers 1–5.
- tone string required
  Specific pronunciation mapping rules, separated by /. Example: ["omg/oh my god"].
stream_format string optional
Streaming return mode. By default, audio is returned directly. Supported values: sse, audio. Default is audio. When sse is specified, audio is returned via Server-Sent Events (SSE).

Response

Audio file output.

Examples


from pathlib import Path
from openai import OpenAI
 
speech_file_path = Path("step-tts.mp3")
 
client = OpenAI(
  api_key="STEP_API_KEY",
  base_url="https://api.stepfun.ai/v1",
)
 
response = client.audio.speech.create(
  model="step-tts-2",
  voice="lively-girl",
  input="StepFun is building the next generation of AGI.",
  extra_body={
      "volume": 1.0,  # volume is in extra_body
      "voice_label": {
          "language": "Cantonese",  # choose one of language / emotion / style
      },
      "pronunciation_map": {
          "tone": [
              "LOL/laugh out loudly",
          ],
      },
  },
)
 
response.stream_to_file(speech_file_path)



import OpenAI from "openai";
import fs from "fs";
import path from "path";

const STEP_API_KEY =
"STEP_API_KEY";
const STEP_API_MODEL = "step-tts-2";

const openai = new OpenAI({
apiKey: STEP_API_KEY,
baseURL: "https://api.stepfun.ai/v1",
});

async function main() {
const speechFile = path.resolve("./speech.mp3");
const mp3 = await openai.audio.speech.create({
  model: STEP_API_MODEL,
  voice: "lively-girl",
  input: "StepFun is building the next generation of AGI.",
  extra_body:{
    "volume":2.0, // volume is in extra_body
    "voice_label": {
      "language": "Cantonese"  // Optional: choose one of language/emotion/style
    },
    "pronunciation_map":{
      "tone":[
          "LOL/laugh out loudly"
      ]
    }
  }
});
console.log(speechFile);
const buffer = Buffer.from(await mp3.arrayBuffer());
await fs.promises.writeFile(speechFile, buffer);
}

main();


curl --location 'https://api.stepfun.ai/v1/audio/speech' \
--header 'Content-Type: application/json' \
--header "Authorization: Bearer $STEP_API_KEY" \
--data '{
    "model":"step-tts-2",
    "input":"StepFun is building the next generation of AGI.",
    "voice":"lively-girl"
}'\
--output "step.mp3"