Audio Models
Model overview
Stepfun audio models use leading speech-generation technologies to provide text-to-speech and voice cloning APIs for audio-driven experiences. Common use cases include smart customer service, audiobooks, A/V production, and game NPCs.
We currently offer the following model; see the guide for details:
step-tts-2
Next-generation TTS model with voice cloning support for natural voice interaction.
Usage limits
- Max characters per request:
step-tts-2supports up to 1000 characters per call. - Output formats: wav, mp3, flac, opus; default is mp3.
Quickstart
Audio synthesis guide
Last updated on