> ## Documentation Index
> Fetch the complete documentation index at: https://platform.stepfun.ai/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# All Audio Models

## Model overview

Stepfun audio models use leading speech-generation technologies to provide text-to-speech and voice cloning APIs for audio-driven experiences. Common use cases include smart customer service, audiobooks, A/V production, and game NPCs.

We currently offer the following models; see the guide for details:

## Models

### StepAudio 2.5 TTS

Contextual TTS — a speech synthesis model with true vocal performance capabilities. It is the first to integrate **contextual understanding** into the entire speech generation pipeline, letting AI not just read text, but *perform* it. Supports **natural language descriptions for global context setting and fine-grained in-text control**, producing human-level speech with natural breathing, dynamic emphasis, and emotional arcs.

#### Key improvements

1. **Dual-level context control — everyone can be a voice director**:
   * Through Global Context + Inline Context dual-level control, set the overall mood and character relationships for an entire passage, and fine-tune how each word and phrase is delivered.
   * Move beyond traditional tag matching — describe the voice expression you want in natural language. Emotion, style, scene, and speaking state are all understood and precisely executed by the model.
   * Supports complex, mixed, layered expression intents such as "restrained sadness, no sobbing, with a slight tremble" or "tentatively playful, not too clingy, with a hint of stubbornness", enabling more open, continuous, and context-aware emotion control.
2. **Zero-shot Clone — any voice fully controllable, cloned at will**:
   * Just \~3 seconds of reference audio is enough to clone a voice precisely, with full Global / Inline context control inherited. Not limited by fixed voice libraries or preset characters.
3. **Every word is a performance, every sentence authentic, zero AI sound**:
   * Comprehensive improvements across pauses, stress, rhythm, and tonal transitions. Synthesized speech has natural breathing, dynamic emphasis, and emotional flow.
   * Upgraded underlying voice quality — output is clearer and more natural, free from the "plastic feel" and "AI sound" common in traditional speech synthesis.

#### Target scenarios

Audiobooks, short drama dubbing, ad narration, emotional storytelling, content remixing, and other scenarios that demand high vocal expressiveness.

### step-tts-2

A streamlined, efficient TTS model that replaces traditional speaker/emotion embedding modules with a pure NTP end-to-end speech generation approach, significantly reducing system complexity. Supports a broad range of voices, emotions, styles, and languages while further enhancing **emotion and style controllability**, **emotional expressiveness**, and **voice cloning quality**.

#### Key improvements

1. **11 emotions, 17 styles, 3 languages — precise control with natural prosody**:
   * Built-in 11 emotions and 17 styles covering a wide range from gentle and sweet to serious and bold. Tone, prosody, and pacing closely follow natural human expression.
   * Perfectly suited for dubbing and dialogue requiring emotional depth.
   * Supports Cantonese, Sichuan dialect, and Japanese.
2. **10-second audio, precise cloning with zero-cost emotion/style control**:
   * Just \~10 seconds of reference audio can precisely replicate a voice, with all emotion and style controls activated at zero additional cost.
   * Ideal for scenarios requiring cloned voices with multi-emotion delivery, such as short video dubbing, emotional chat, and marketing narration.
3. **Accent-faithful voice cloning**:
   * Industry-leading LLM-based architecture provides more accurate reproduction of speaker accent details compared to similar products.
   * Delivers a more authentic, natural voice interaction experience for **live commerce** scenarios, enhancing audience immersion and trust.

### stepaudio-2.5-asr

StepFun's new-generation streaming ASR model, based on a 4B MTP architecture, for streaming / near-realtime transcription.

Features:

* Supports one-shot audio submission over HTTP + SSE with incremental text streaming;
* Fits realtime captions, voice input, meeting transcription, and backend batch processing.

### stepaudio-2-asr-pro

A 32B-parameter ASR Pro model.

## Usage limits

1. **Max characters per request**: TTS models support up to 1000 characters per call.
2. **Output formats**: wav, mp3, flac, opus; default is mp3.

## Quickstart

<Columns cols={2}>
  <Card title="StepAudio 2.5 TTS Overview" href="/en/guides/models/stepaudio-2.5-tts">
    Explore the contextual TTS model with dual-level context control and zero-shot voice cloning.
  </Card>

  <Card title="StepAudio 2.5 ASR Overview" href="/en/guides/models/stepaudio-2.5-asr">
    Explore the new-generation ASR series with 4B MTP and 32B Pro models.
  </Card>

  <Card title="Voice interaction developer guide" href="/en/guides/developer/tts">
    Get started with speech generation, voice cloning, and automatic speech recognition.
  </Card>
</Columns>
