All Audio Models - StepFun Documentation

Model overview

Stepfun audio models use leading speech-generation technologies to provide text-to-speech and voice cloning APIs for audio-driven experiences. Common use cases include smart customer service, audiobooks, A/V production, and game NPCs. We currently offer the following models; see the guide for details:

Models

StepAudio 2.5 TTS

Contextual TTS — a speech synthesis model with true vocal performance capabilities. It is the first to integrate contextual understanding into the entire speech generation pipeline, letting AI not just read text, but perform it. Supports natural language descriptions for global context setting and fine-grained in-text control, producing human-level speech with natural breathing, dynamic emphasis, and emotional arcs.

Key improvements

Dual-level context control — everyone can be a voice director:
- Through Global Context + Inline Context dual-level control, set the overall mood and character relationships for an entire passage, and fine-tune how each word and phrase is delivered.
- Move beyond traditional tag matching — describe the voice expression you want in natural language. Emotion, style, scene, and speaking state are all understood and precisely executed by the model.
- Supports complex, mixed, layered expression intents such as “restrained sadness, no sobbing, with a slight tremble” or “tentatively playful, not too clingy, with a hint of stubbornness”, enabling more open, continuous, and context-aware emotion control.
Zero-shot Clone — any voice fully controllable, cloned at will:
- Just ~3 seconds of reference audio is enough to clone a voice precisely, with full Global / Inline context control inherited. Not limited by fixed voice libraries or preset characters.
Every word is a performance, every sentence authentic, zero AI sound:
- Comprehensive improvements across pauses, stress, rhythm, and tonal transitions. Synthesized speech has natural breathing, dynamic emphasis, and emotional flow.
- Upgraded underlying voice quality — output is clearer and more natural, free from the “plastic feel” and “AI sound” common in traditional speech synthesis.

Target scenarios

Audiobooks, short drama dubbing, ad narration, emotional storytelling, content remixing, and other scenarios that demand high vocal expressiveness.

step-tts-2

A streamlined, efficient TTS model that replaces traditional speaker/emotion embedding modules with a pure NTP end-to-end speech generation approach, significantly reducing system complexity. Supports a broad range of voices, emotions, styles, and languages while further enhancing emotion and style controllability, emotional expressiveness, and voice cloning quality.

Key improvements

11 emotions, 17 styles, 3 languages — precise control with natural prosody:
- Built-in 11 emotions and 17 styles covering a wide range from gentle and sweet to serious and bold. Tone, prosody, and pacing closely follow natural human expression.
- Perfectly suited for dubbing and dialogue requiring emotional depth.
- Supports Cantonese, Sichuan dialect, and Japanese.
10-second audio, precise cloning with zero-cost emotion/style control:
- Just ~10 seconds of reference audio can precisely replicate a voice, with all emotion and style controls activated at zero additional cost.
- Ideal for scenarios requiring cloned voices with multi-emotion delivery, such as short video dubbing, emotional chat, and marketing narration.
Accent-faithful voice cloning:
- Industry-leading LLM-based architecture provides more accurate reproduction of speaker accent details compared to similar products.
- Delivers a more authentic, natural voice interaction experience for live commerce scenarios, enhancing audience immersion and trust.

stepaudio-2.5-asr

StepFun’s new-generation streaming ASR model, based on a 4B MTP architecture, for streaming / near-realtime transcription. Features:

Supports one-shot audio submission over HTTP + SSE with incremental text streaming;
Fits realtime captions, voice input, meeting transcription, and backend batch processing.

stepaudio-2-asr-pro

A 32B-parameter ASR Pro model.

Usage limits

Max characters per request: TTS models support up to 1000 characters per call.
Output formats: wav, mp3, flac, opus; default is mp3.

Quickstart

StepAudio 2.5 TTS Overview

Explore the contextual TTS model with dual-level context control and zero-shot voice cloning.

StepAudio 2.5 ASR Overview

Explore the new-generation ASR series with 4B MTP and 32B Pro models.

Voice interaction developer guide

Get started with speech generation, voice cloning, and automatic speech recognition.

​Model overview

​Models

​StepAudio 2.5 TTS

​Key improvements

​Target scenarios

​step-tts-2

​Key improvements

​stepaudio-2.5-asr

​stepaudio-2-asr-pro

​Usage limits

​Quickstart

StepAudio 2.5 TTS Overview

StepAudio 2.5 ASR Overview

Voice interaction developer guide

Model overview

Models

StepAudio 2.5 TTS

Key improvements

Target scenarios

step-tts-2

Key improvements

stepaudio-2.5-asr

stepaudio-2-asr-pro

Usage limits

Quickstart