Audio Models

Model overview

Stepfun audio models use leading speech-generation technologies to provide text-to-speech and voice cloning APIs for audio-driven experiences. Common use cases include smart customer service, audiobooks, A/V production, and game NPCs.

We currently offer the following model; see the guide for details:

step-tts-2
Next-generation TTS model with voice cloning support for natural voice interaction.

Usage limits

Max characters per request: step-tts-2 supports up to 1000 characters per call.
Output formats: wav, mp3, flac, opus; default is mp3.

Quickstart

Audio synthesis guide