Skip to main content

Model overview

Stepfun audio models use leading speech-generation technologies to provide text-to-speech and voice cloning APIs for audio-driven experiences. Common use cases include smart customer service, audiobooks, A/V production, and game NPCs. We currently offer the following model; see the guide for details:

Models

step-tts-2

Next-generation TTS model with voice cloning support for natural voice interaction.

Usage limits

  1. Max characters per request: step-tts-2 supports up to 1000 characters per call.
  2. Output formats: wav, mp3, flac, opus; default is mp3.

Quickstart

Voice interaction developer guide

Get started with speech generation, voice cloning, and automatic speech recognition.