Documentation Index
Fetch the complete documentation index at: https://platform.stepfun.ai/docs/llms.txt
Use this file to discover all available pages before exploring further.
5 category views covering 19 public models
Recommended models
All models
Text & reasoning
Vision
Audio
Reasoning / textRecommended
step-3.5-flash
Flagship reasoning
A flagship reasoning model built for agents. Its reasoning depth rivals leading closed-source models while also delivering ultra-fast responses and stable, reliable tool calling. On top of strong general reasoning, it excels at complex project planning and long-horizon task execution.
ReasoningTool CallingWeb Search
Reasoning / textRecommended
step-3
Multimodal reasoning
Combines visual perception with complex reasoning for cross-modal analysis and knowledge-intensive tasks.
ReasoningImage Understanding
Text / codingRecommended
step-2-mini
High-speed text
An ultra-fast MFA-attention model that delivers strong general-task and coding performance at lower cost.
TextCodingTool Calling
VisionRecommended
step-1o-turbo-vision
Recommended vision
The recommended vision model for image and video understanding, with a lighter footprint and faster output.
Image UnderstandingVideo UnderstandingText Output
Speech synthesisRecommended
stepaudio-2.5-tts
Contextual TTS
The first model to integrate contextual understanding into the full speech generation pipeline. Supports Global Context + Inline Context dual-level control via natural language descriptions for precise emotion and style control. Ideal for audiobooks, drama dubbing, ad narration, and other high-expressiveness scenarios.
Speech synthesisContext controlZero-shot Clone
Speech synthesisRecommended
step-tts-mini
Expressive TTS
A TTS model focused on emotional expressiveness and controllable style, suitable for multi-emotion voice output and cloning.
Speech SynthesisVoice CloningEmotion & Style
Speech synthesisRecommended
step-tts-vivid
High-fidelity TTS
A speech-synthesis model optimized for highly human-like output and strong realism in outbound-call scenarios.
Speech SynthesisHuman-like VoiceEmotion & Style
Reasoning / textRecommended
step-3.5-flash
Flagship reasoning
A flagship reasoning model built for agents. Its reasoning depth rivals leading closed-source models while also delivering ultra-fast responses and stable, reliable tool calling. On top of strong general reasoning, it excels at complex project planning and long-horizon task execution.
ReasoningTool CallingWeb Search
Reasoning / textRecommended
step-3
Multimodal reasoning
Combines visual perception with complex reasoning for cross-modal analysis and knowledge-intensive tasks.
ReasoningImage Understanding
Text / codingRecommended
step-2-mini
High-speed text
An ultra-fast MFA-attention model that delivers strong general-task and coding performance at lower cost.
TextCodingTool Calling
Text
step-2-16k
Trillion-parameter text
The production step-2 model with stronger overall quality, feel, and planning ability.
TextGeneralPlanning
The 8K variant of the step-1 family, tuned for short-context generation, Q&A, and tool-oriented workflows.
TextMathCoding
Text
step-1-32k
Mid-context text
The 32K step-1 variant extends the classic family with a larger window for longer conversations and document-heavy tasks.
TextMathCoding
Reasoning / text
step-r1-v-mini
Vision reasoning
A reasoning model for image understanding and deep thinking, with strong performance on vision, math, and coding tasks.
ReasoningImage UnderstandingCoding
VisionRecommended
step-1o-turbo-vision
Recommended vision
The recommended vision model for image and video understanding, with a lighter footprint and faster output.
Image UnderstandingVideo UnderstandingText Output
Vision
step-1o-vision-32k
Image understanding
A strong image-understanding model with text-and-image input and text-only output, positioned above the step-1v series in visual quality.
Image UnderstandingText Output
Vision
step-1v-8k
Image understanding 8K
The 8K member of the step-1v series for short-context image Q&A and visual analysis.
Image UnderstandingShort context
Vision
step-1v-32k
Image understanding 32K
The 32K step-1v variant handles more images and longer histories for vision understanding tasks.
Image UnderstandingLonger context
Image generation
step-1x-medium
Text-to-image
A text-to-image model with strong Chinese-language support and high-resolution output for general image generation.
Image GenerationChinese Support
Image generation
step-2x-large
Next-gen text-to-image
A new-generation image model focused on more realistic output and stronger Chinese-and-English text rendering.
Image GenerationChinese Prompting
Image generation
step-1x-edit
Image editing
An image-editing model that modifies and enhances input images based on image-plus-text instructions.
Image EditingImage Enhancement
Speech synthesisRecommended
stepaudio-2.5-tts
Contextual TTS
The first model to integrate contextual understanding into the full speech generation pipeline. Supports Global Context + Inline Context dual-level control via natural language descriptions for precise emotion and style control. Ideal for audiobooks, drama dubbing, ad narration, and other high-expressiveness scenarios.
Speech synthesisContext controlZero-shot Clone
Speech synthesisRecommended
step-tts-mini
Expressive TTS
A TTS model focused on emotional expressiveness and controllable style, suitable for multi-emotion voice output and cloning.
Speech SynthesisVoice CloningEmotion & Style
Speech synthesisRecommended
step-tts-vivid
High-fidelity TTS
A speech-synthesis model optimized for highly human-like output and strong realism in outbound-call scenarios.
Speech SynthesisHuman-like VoiceEmotion & Style
Speech recognitionRecommended
stepaudio-2.5-asr
New-generation streaming ASR
StepFun’s new-generation streaming speech recognition model, based on a 4B MTP architecture that balances recognition accuracy with low latency. Supports Chinese and English recognition with ITN text normalization — well suited to realtime captions, voice input, and meeting transcription where both speed and accuracy matter.
High accuracyLow latencyChinese & English
Speech recognition
stepaudio-2-asr-pro
32B ASR Pro
32B-parameter ASR Pro model.
Large model
A mature first-generation end-to-end voice model suited to foundational voice interaction and content generation.
Realtime InteractionTool Calling
Reasoning / textRecommended
step-3.5-flash
Flagship reasoning
A flagship reasoning model built for agents. Its reasoning depth rivals leading closed-source models while also delivering ultra-fast responses and stable, reliable tool calling. On top of strong general reasoning, it excels at complex project planning and long-horizon task execution.
ReasoningTool CallingWeb Search
Reasoning / textRecommended
step-3
Multimodal reasoning
Combines visual perception with complex reasoning for cross-modal analysis and knowledge-intensive tasks.
ReasoningImage Understanding
Text / codingRecommended
step-2-mini
High-speed text
An ultra-fast MFA-attention model that delivers strong general-task and coding performance at lower cost.
TextCodingTool Calling
Text
step-2-16k
Trillion-parameter text
The production step-2 model with stronger overall quality, feel, and planning ability.
TextGeneralPlanning
The 8K variant of the step-1 family, tuned for short-context generation, Q&A, and tool-oriented workflows.
TextMathCoding
Text
step-1-32k
Mid-context text
The 32K step-1 variant extends the classic family with a larger window for longer conversations and document-heavy tasks.
TextMathCoding
Reasoning / text
step-r1-v-mini
Vision reasoning
A reasoning model for image understanding and deep thinking, with strong performance on vision, math, and coding tasks.
ReasoningImage UnderstandingCoding
VisionRecommended
step-1o-turbo-vision
Recommended vision
The recommended vision model for image and video understanding, with a lighter footprint and faster output.
Image UnderstandingVideo UnderstandingText Output
Vision
step-1o-vision-32k
Image understanding
A strong image-understanding model with text-and-image input and text-only output, positioned above the step-1v series in visual quality.
Image UnderstandingText Output
Vision
step-1v-8k
Image understanding 8K
The 8K member of the step-1v series for short-context image Q&A and visual analysis.
Image UnderstandingShort context
Vision
step-1v-32k
Image understanding 32K
The 32K step-1v variant handles more images and longer histories for vision understanding tasks.
Image UnderstandingLonger context
Speech synthesisRecommended
stepaudio-2.5-tts
Contextual TTS
The first model to integrate contextual understanding into the full speech generation pipeline. Supports Global Context + Inline Context dual-level control via natural language descriptions for precise emotion and style control. Ideal for audiobooks, drama dubbing, ad narration, and other high-expressiveness scenarios.
Speech synthesisContext controlZero-shot Clone
Speech synthesisRecommended
step-tts-mini
Expressive TTS
A TTS model focused on emotional expressiveness and controllable style, suitable for multi-emotion voice output and cloning.
Speech SynthesisVoice CloningEmotion & Style
Speech synthesisRecommended
step-tts-vivid
High-fidelity TTS
A speech-synthesis model optimized for highly human-like output and strong realism in outbound-call scenarios.
Speech SynthesisHuman-like VoiceEmotion & Style
Speech recognitionRecommended
stepaudio-2.5-asr
New-generation streaming ASR
StepFun’s new-generation streaming speech recognition model, based on a 4B MTP architecture that balances recognition accuracy with low latency. Supports Chinese and English recognition with ITN text normalization — well suited to realtime captions, voice input, and meeting transcription where both speed and accuracy matter.
High accuracyLow latencyChinese & English
Speech recognition
stepaudio-2-asr-pro
32B ASR Pro
32B-parameter ASR Pro model.
Large model
A mature first-generation end-to-end voice model suited to foundational voice interaction and content generation.
Realtime InteractionTool Calling