StepAudio 2.5 TTS - StepFun Documentation

A speech synthesis model with true vocal performance capabilities, the first to integrate contextual understanding into the entire speech generation pipeline. Through Global Context + Inline Context dual-level control combined with zero-shot voice cloning, StepAudio 2.5 TTS lets AI not just read text, but perform it.

Online Demo

Visit the official demo page to experience the model’s capabilities firsthand.

API Quick Start

Jump to minimal runnable curl / WebSocket call examples.

Step Plan Integration

Step Plan subscribers can use this model directly.

Key Information

Model Type

Contextual TTS
Text-to-speech with contextual understanding

Max Input per Request

1,000 characters

Instruction Limit

200 characters
Global context natural language guidance

Core Capabilities

🎭 Dual-level Context Control

Global Context sets the overall tone for an entire passage; Inline Context uses parentheses () for per-sentence fine-grained control of emotion, pauses, and breathing. Natural language descriptions replace tag matching — supports complex intents like “restrained sadness, no sobbing, with a slight tremble”.

🎨 Zero-shot Voice Cloning

Clone any voice from just a 3-second reference audio clip, with full Global / Inline context control inherited. Not limited by fixed voice libraries or preset characters.

🎙️ Every Word Is a Performance

Comprehensive improvements in pauses, stress, rhythm, and tonal transitions. Upgraded underlying voice quality delivers clear, natural output free from the “plastic feel” and “AI sound” of traditional TTS.

API Endpoints

Non-streaming TTS

POST /v1/audio/speech
Generate a complete audio file in a single request — best audio quality.

Streaming TTS

WebSocket /v1/realtime/audio
Low-latency streaming playback, ideal for conversational and real-time scenarios.

Voice Clone Preview

POST /v1/audio/voices/preview
Quickly preview synthesis results from a reference audio sample without creating a permanent voice asset.

Pricing

See the pricing page for current rates on contextual TTS and voice cloning.

Quick Start

The two key capability entry points: the instruction parameter defines the overall expression tone (Global Context), while parentheses () inside the input / text field insert per-sentence instructions (Inline Context). Content inside parentheses is treated as instructions only and will not be spoken aloud.

Non-streaming (curl)
Streaming (WebSocket)
Voice Clone Preview (curl)

curl https://api.stepfun.ai/v1/audio/speech \
  -H "Authorization: Bearer $STEP_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "stepaudio-2.5-tts",
    "voice": "cixingnansheng",
    "input": "(lowered voice) Hey... look at my phone. (short gasp) Am I seeing things? (feigning calm) ...Never mind, must be a scam text.",
    "instruction": "Voice is extremely tense, as if desperately holding back uncontrollable excitement; fast and halting pace with noticeable restraint"
  }' \
  --output step-tts-contextual.mp3

instruction defines the overall context, while parenthesized text in input serves as inline instructions. The model synthesizes emotion, pauses, breathing, and subtext together.

Connection URL:

wss://api.stepfun.ai/v1/realtime/audio?model=stepaudio-2.5-tts

After connecting, send tts.create to start a session with a global instruction:

{
  "type": "tts.create",
  "data": {
    "session_id": "01956e7388477cfcbdc3aaabf364bc70",
    "voice_id": "cixingnansheng",
    "response_format": "wav",
    "sample_rate": 24000,
    "instruction": "Ice-cold tone, strong pressure, slightly slow pace"
  }
}

Then send text with inline instructions via tts.text.delta:

{
  "type": "tts.text.delta",
  "data": {
    "session_id": "01956e7388477cfcbdc3aaabf364bc70",
    "text": "(excited) The weather is great today, and I want to learn about StepFun's large model technologies!"
  }
}

curl https://api.stepfun.ai/v1/audio/voices/preview \
  -H "Authorization: Bearer $STEP_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "stepaudio-2.5-tts",
    "file_id": "file-Ckyl3cV09A",
    "text": "StepFun intelligence, amplifying every possibility tenfold",
    "sample_text": "Nice weather today",
    "instruction": "Gentle tone, slightly slow pace"
  }'

This endpoint generates a preview audio clip only — it does not create a permanent voice asset.

Audio Models Overview

View all TTS models and their capabilities.

Pricing Details

View pricing for speech, text, image, and all other models.

Voice List

Browse available voices and their supported parameters.

Start

Models

Pricing

Terms and Agreements

Documentation Index

Online Demo

API Quick Start

Step Plan Integration

​Key Information

Model Type

Max Input per Request

Instruction Limit

​Core Capabilities

🎭 Dual-level Context Control

🎨 Zero-shot Voice Cloning

🎙️ Every Word Is a Performance

​API Endpoints

Non-streaming TTS

Streaming TTS

Voice Clone Preview

​Pricing

​Quick Start

​Related Resources

Audio Models Overview

Pricing Details

Voice List

Key Information

Core Capabilities

API Endpoints

Pricing

Quick Start

Related Resources