A speech synthesis model with true vocal performance capabilities, the first to integrate contextual understanding into the entire speech generation pipeline. Through Global Context + Inline Context dual-level control combined with zero-shot voice cloning, StepAudio 2.5 TTS lets AI not just read text, but perform it.Documentation Index
Fetch the complete documentation index at: https://platform.stepfun.ai/docs/llms.txt
Use this file to discover all available pages before exploring further.
Online Demo
Visit the official demo page to experience the model’s capabilities firsthand.
API Quick Start
Jump to minimal runnable curl / WebSocket call examples.
Step Plan Integration
Step Plan subscribers can use this model directly.
Key Information
Model Type
Contextual TTS
Text-to-speech with contextual understanding
Text-to-speech with contextual understanding
Max Input per Request
1,000 characters
Instruction Limit
200 characters
Global context natural language guidance
Global context natural language guidance
Core Capabilities
🎭 Dual-level Context Control
Global Context sets the overall tone for an entire passage; Inline Context uses parentheses
() for per-sentence fine-grained control of emotion, pauses, and breathing. Natural language descriptions replace tag matching — supports complex intents like “restrained sadness, no sobbing, with a slight tremble”.🎨 Zero-shot Voice Cloning
Clone any voice from just a 3-second reference audio clip, with full Global / Inline context control inherited. Not limited by fixed voice libraries or preset characters.
🎙️ Every Word Is a Performance
Comprehensive improvements in pauses, stress, rhythm, and tonal transitions. Upgraded underlying voice quality delivers clear, natural output free from the “plastic feel” and “AI sound” of traditional TTS.
API Endpoints
Non-streaming TTS
POST /v1/audio/speechGenerate a complete audio file in a single request — best audio quality.
Streaming TTS
WebSocket /v1/realtime/audioLow-latency streaming playback, ideal for conversational and real-time scenarios.
Voice Clone Preview
POST /v1/audio/voices/previewQuickly preview synthesis results from a reference audio sample without creating a permanent voice asset.
Pricing
See the pricing page for current rates on contextual TTS and voice cloning.Quick Start
The two key capability entry points: theinstruction parameter defines the overall expression tone (Global Context), while parentheses () inside the input / text field insert per-sentence instructions (Inline Context). Content inside parentheses is treated as instructions only and will not be spoken aloud.
- Non-streaming (curl)
- Streaming (WebSocket)
- Voice Clone Preview (curl)
instruction defines the overall context, while parenthesized text in input serves as inline instructions. The model synthesizes emotion, pauses, breathing, and subtext together.Related Resources
Audio Models Overview
View all TTS models and their capabilities.
Pricing Details
View pricing for speech, text, image, and all other models.
Voice List
Browse available voices and their supported parameters.