FAQ - StepFun Documentation

Audio models

What model should I use for text-to-speech?

For new projects, use stepaudio-2.5-tts — our flagship contextual TTS model with zero-shot voice cloning and natural-language control over emotion and style. Use step-tts-2 if you rely on tag-based voice/emotion control or preset voice libraries. See Audio Models for a full comparison.

Where can I find the TTS API parameters?

See Generate audio for the full request schema and examples.

What audio formats are supported?

wav, mp3, flac, opus, pcm. Default is mp3.

Is there a limit on input length?

Yes. The maximum input length is 1,000 characters per request.

Quickstart All Reasoning Models

​Audio models

​What model should I use for text-to-speech?

​Where can I find the TTS API parameters?

​What audio formats are supported?

​Is there a limit on input length?

Audio models

What model should I use for text-to-speech?

Where can I find the TTS API parameters?

What audio formats are supported?

Is there a limit on input length?