Model Capability Overview - StepFun Documentation

Recommended

2

Indexed models

3

Max context

256K

4 category views covering 3 public models

Recommended models
All models
Text & reasoning
Audio

Reasoning / multimodalRecommended

Step 3.7 Flash

Flagship multimodal reasoning

StepFun’s flagship multimodal reasoning model. Building on step-3.5-flash’s high-throughput reasoning and tool calling, it adds native multimodal input — understanding images and videos directly, without an additional vision MCP or auxiliary model. Three reasoning effort levels (low / medium / high) make it a fast and dependable choice for agent, coding, and multimodal workloads.

ReasoningMultimodalAgentImage understandingVideo understanding

Related entry points

Model page

Step 3.7 Flash

Quickstart

Multimodal quickstart

Guide

Reasoning model best practices

Reasoning / textRecommended

step-3.5-flash

Flagship reasoning

A flagship reasoning model built for agents. Its reasoning depth rivals leading closed-source models while also delivering ultra-fast responses and stable, reliable tool calling. On top of strong general reasoning, it excels at complex project planning and long-horizon task execution.

ReasoningTool CallingWeb Search

Related entry points

Model page

Reasoning models

Guide

Reasoning model best practices

API reference

Chat Completion API

Speech synthesisRecommended

stepaudio-2.5-tts

Contextual TTS

The first model to integrate contextual understanding into the full speech generation pipeline. Supports Global Context + Inline Context dual-level control via natural language descriptions for precise emotion and style control. Ideal for audiobooks, drama dubbing, ad narration, and other high-expressiveness scenarios.

Speech synthesisContext controlZero-shot Clone

Related links

ModelsAudio Models GuideTTS Developer Guide API DocsText-to-Speech API

Reasoning / multimodalRecommended

Step 3.7 Flash

Flagship multimodal reasoning

StepFun’s flagship multimodal reasoning model. Building on step-3.5-flash’s high-throughput reasoning and tool calling, it adds native multimodal input — understanding images and videos directly, without an additional vision MCP or auxiliary model. Three reasoning effort levels (low / medium / high) make it a fast and dependable choice for agent, coding, and multimodal workloads.

ReasoningMultimodalAgentImage understandingVideo understanding

Related entry points

Model page

Step 3.7 Flash

Quickstart

Multimodal quickstart

Guide

Reasoning model best practices

Reasoning / textRecommended

step-3.5-flash

Flagship reasoning

A flagship reasoning model built for agents. Its reasoning depth rivals leading closed-source models while also delivering ultra-fast responses and stable, reliable tool calling. On top of strong general reasoning, it excels at complex project planning and long-horizon task execution.

ReasoningTool CallingWeb Search

Related entry points

Model page

Reasoning models

Guide

Reasoning model best practices

API reference

Chat Completion API

Speech synthesisRecommended

stepaudio-2.5-tts

Contextual TTS

The first model to integrate contextual understanding into the full speech generation pipeline. Supports Global Context + Inline Context dual-level control via natural language descriptions for precise emotion and style control. Ideal for audiobooks, drama dubbing, ad narration, and other high-expressiveness scenarios.

Speech synthesisContext controlZero-shot Clone

Related links

ModelsAudio Models GuideTTS Developer Guide API DocsText-to-Speech API

Speech recognitionRecommended

stepaudio-2.5-asr

New-generation streaming ASR

StepFun’s new-generation streaming speech recognition model, based on a 4B MTP architecture that balances recognition accuracy with low latency. Supports Chinese and English recognition with ITN text normalization — well suited to realtime captions, voice input, and meeting transcription where both speed and accuracy matter.

High accuracyLow latencyChinese & English

Related entry points

Model page

StepAudio 2.5 ASR

API reference

Speech Recognition (Streaming Output)

Reasoning / multimodalRecommended

Step 3.7 Flash

Flagship multimodal reasoning

StepFun’s flagship multimodal reasoning model. Building on step-3.5-flash’s high-throughput reasoning and tool calling, it adds native multimodal input — understanding images and videos directly, without an additional vision MCP or auxiliary model. Three reasoning effort levels (low / medium / high) make it a fast and dependable choice for agent, coding, and multimodal workloads.

ReasoningMultimodalAgentImage understandingVideo understanding

Related entry points

Model page

Step 3.7 Flash

Quickstart

Multimodal quickstart

Guide

Reasoning model best practices

Reasoning / textRecommended

step-3.5-flash

Flagship reasoning

A flagship reasoning model built for agents. Its reasoning depth rivals leading closed-source models while also delivering ultra-fast responses and stable, reliable tool calling. On top of strong general reasoning, it excels at complex project planning and long-horizon task execution.

ReasoningTool CallingWeb Search

Related entry points

Model page

Reasoning models

Guide

Reasoning model best practices

API reference

Chat Completion API