Model Capability Overview - StepFun Documentation

Recommended

4

Public models

5

Max context

256K

4 category views covering 5 public models

Recommended models
All models
Text & reasoning
Audio

Reasoning / multimodalRecommended

Step 3.7 Flash

Flagship multimodal reasoning

StepFun’s flagship multimodal reasoning model. Building on step-3.5-flash’s high-throughput reasoning and tool calling, it adds native multimodal input: understanding images and videos directly, without an additional vision MCP or auxiliary model. Three reasoning effort levels (low / medium / high) make it a fast and dependable choice for agent, coding, and multimodal workloads.

ReasoningMultimodalAgentImage understandingVideo understanding

Related entry points

Model page

Step 3.7 Flash

Quickstart

Multimodal quickstart

Guide

Reasoning model best practices

Reasoning / textRecommended

step-3.5-flash

Flagship reasoning

A flagship reasoning model built for agents, combining deep reasoning with ultra-fast responses and stable, reliable tool calling. On top of strong general reasoning, it excels at complex project planning and long-horizon task execution.

ReasoningTool CallingWeb Search

Related entry points

Model page

Reasoning models

Guide

Reasoning model best practices

API reference

Chat Completion API

Speech synthesisRecommended

stepaudio-2.5-tts

Contextual TTS

Integrates contextual understanding into the full speech generation pipeline. Supports Global Context + Inline Context dual-level control via natural language descriptions for precise emotion and style control. Ideal for audiobooks, drama dubbing, ad narration, and other high-expressiveness scenarios.

Speech synthesisContext controlZero-shot Clone

Related links

ModelsAudio Models GuideTTS Developer Guide API DocsText-to-Speech API

Speech understandingRecommended

stepaudio-2.5-chat

End-to-end speech understanding

An end-to-end speech-understanding model served through an OpenAI-compatible Chat Completion API. It accepts audio or text input and returns text, interpreting vocal cues such as intonation, hesitation, and laughter alongside the words. It suits voice assistants, call and meeting analysis, and voice message triage.

Speech understandingParalinguistic cuesAudio & text input

Related entry points

Model page

StepAudio 2.5 Chat

API reference

Chat Completion API

Reasoning / multimodalRecommended

Step 3.7 Flash

Flagship multimodal reasoning

StepFun’s flagship multimodal reasoning model. Building on step-3.5-flash’s high-throughput reasoning and tool calling, it adds native multimodal input: understanding images and videos directly, without an additional vision MCP or auxiliary model. Three reasoning effort levels (low / medium / high) make it a fast and dependable choice for agent, coding, and multimodal workloads.

ReasoningMultimodalAgentImage understandingVideo understanding

Related entry points

Model page

Step 3.7 Flash

Quickstart

Multimodal quickstart

Guide

Reasoning model best practices

Reasoning / textRecommended

step-3.5-flash

Flagship reasoning

A flagship reasoning model built for agents, combining deep reasoning with ultra-fast responses and stable, reliable tool calling. On top of strong general reasoning, it excels at complex project planning and long-horizon task execution.

ReasoningTool CallingWeb Search

Related entry points

Model page

Reasoning models

Guide

Reasoning model best practices

API reference

Chat Completion API

Speech synthesisRecommended

stepaudio-2.5-tts

Contextual TTS

Integrates contextual understanding into the full speech generation pipeline. Supports Global Context + Inline Context dual-level control via natural language descriptions for precise emotion and style control. Ideal for audiobooks, drama dubbing, ad narration, and other high-expressiveness scenarios.

Speech synthesisContext controlZero-shot Clone

Related links

ModelsAudio Models GuideTTS Developer Guide API DocsText-to-Speech API

Speech recognitionRecommended

stepaudio-2.5-asr

New-generation streaming ASR

StepFun’s new-generation streaming speech recognition model, based on a 4B MTP architecture that maintains SOTA transcription accuracy while sharply reducing latency. Supports Chinese and English recognition with ITN text normalization, and suits real-time captions, voice input, and meeting transcription where both speed and accuracy matter.

SOTA accuracyLow latencyChinese & English

Related entry points

Model page

StepAudio 2.5 ASR

API reference

Speech Recognition (Streaming Output)

Speech understandingRecommended

stepaudio-2.5-chat

End-to-end speech understanding

An end-to-end speech-understanding model served through an OpenAI-compatible Chat Completion API. It accepts audio or text input and returns text, interpreting vocal cues such as intonation, hesitation, and laughter alongside the words. It suits voice assistants, call and meeting analysis, and voice message triage.

Speech understandingParalinguistic cuesAudio & text input

Related entry points

Model page

StepAudio 2.5 Chat

API reference

Chat Completion API

Reasoning / multimodalRecommended

Step 3.7 Flash

Flagship multimodal reasoning

StepFun’s flagship multimodal reasoning model. Building on step-3.5-flash’s high-throughput reasoning and tool calling, it adds native multimodal input: understanding images and videos directly, without an additional vision MCP or auxiliary model. Three reasoning effort levels (low / medium / high) make it a fast and dependable choice for agent, coding, and multimodal workloads.

ReasoningMultimodalAgentImage understandingVideo understanding

Related entry points

Model page

Step 3.7 Flash

Quickstart

Multimodal quickstart

Guide

Reasoning model best practices

Reasoning / textRecommended

step-3.5-flash

Flagship reasoning

A flagship reasoning model built for agents, combining deep reasoning with ultra-fast responses and stable, reliable tool calling. On top of strong general reasoning, it excels at complex project planning and long-horizon task execution.

ReasoningTool CallingWeb Search

Related entry points

Model page

Reasoning models

Guide

Reasoning model best practices

API reference

Chat Completion API