Skip to Content
ModelsAudio Models

Audio Models

Model overview

Stepfun audio models use leading speech-generation technologies to provide text-to-speech and voice cloning APIs for audio-driven experiences. Common use cases include smart customer service, audiobooks, A/V production, and game NPCs.

We currently offer the following model; see the guide for details:

step-tts-2
Next-generation TTS model with voice cloning support for natural voice interaction.

Usage limits

  1. Max characters per request: step-tts-2 supports up to 1000 characters per call.
  2. Output formats: wav, mp3, flac, opus; default is mp3.

Quickstart


Audio synthesis guide
Last updated on