Skip to main content

Documentation Index

Fetch the complete documentation index at: https://platform.stepfun.ai/docs/llms.txt

Use this file to discover all available pages before exploring further.

A speech synthesis model with true vocal performance capabilities, the first to integrate contextual understanding into the entire speech generation pipeline. Through Global Context + Inline Context dual-level control combined with zero-shot voice cloning, StepAudio 2.5 TTS lets AI not just read text, but perform it.

Online Demo

Visit the official demo page to experience the model’s capabilities firsthand.

API Quick Start

Jump to minimal runnable curl / WebSocket call examples.

Step Plan Integration

Step Plan subscribers can use this model directly.

Key Information

Model Type

Contextual TTS
Text-to-speech with contextual understanding

Max Input per Request

1,000 characters

Instruction Limit

200 characters
Global context natural language guidance

Core Capabilities

🎭 Dual-level Context Control

Global Context sets the overall tone for an entire passage; Inline Context uses parentheses () for per-sentence fine-grained control of emotion, pauses, and breathing. Natural language descriptions replace tag matching — supports complex intents like “restrained sadness, no sobbing, with a slight tremble”.

🎨 Zero-shot Voice Cloning

Clone any voice from just a 3-second reference audio clip, with full Global / Inline context control inherited. Not limited by fixed voice libraries or preset characters.

🎙️ Every Word Is a Performance

Comprehensive improvements in pauses, stress, rhythm, and tonal transitions. Upgraded underlying voice quality delivers clear, natural output free from the “plastic feel” and “AI sound” of traditional TTS.

API Endpoints

Non-streaming TTS

POST /v1/audio/speech
Generate a complete audio file in a single request — best audio quality.

Streaming TTS

WebSocket /v1/realtime/audio
Low-latency streaming playback, ideal for conversational and real-time scenarios.

Voice Clone Preview

POST /v1/audio/voices/preview
Quickly preview synthesis results from a reference audio sample without creating a permanent voice asset.

Pricing

See the pricing page for current rates on contextual TTS and voice cloning.

Quick Start

The two key capability entry points: the instruction parameter defines the overall expression tone (Global Context), while parentheses () inside the input / text field insert per-sentence instructions (Inline Context). Content inside parentheses is treated as instructions only and will not be spoken aloud.
curl https://api.stepfun.ai/v1/audio/speech \
  -H "Authorization: Bearer $STEP_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "stepaudio-2.5-tts",
    "voice": "cixingnansheng",
    "input": "(lowered voice) Hey... look at my phone. (short gasp) Am I seeing things? (feigning calm) ...Never mind, must be a scam text.",
    "instruction": "Voice is extremely tense, as if desperately holding back uncontrollable excitement; fast and halting pace with noticeable restraint"
  }' \
  --output step-tts-contextual.mp3
instruction defines the overall context, while parenthesized text in input serves as inline instructions. The model synthesizes emotion, pauses, breathing, and subtext together.

Audio Models Overview

View all TTS models and their capabilities.

Pricing Details

View pricing for speech, text, image, and all other models.

Voice List

Browse available voices and their supported parameters.