Documentation Index
Fetch the complete documentation index at: https://platform.stepfun.ai/docs/llms.txt
Use this file to discover all available pages before exploring further.
Model overview
Stepfun offers two voice-enabled dialogue solutions to suit different scenarios:| Characteristic | Chat Completions API | Realtime API |
|---|---|---|
| Connection | HTTP requests | WebSocket long-lived connection |
| ASR | Implement yourself or use 3rd-party | Built-in, automatic ASR |
| Context Management | Manual message list updates | Built-in, automatic history management |
| VAD | Implement yourself | Built-in, automatic detection |
| Web Search | Implement search API manually | Built-in web_search tool |
| Retrieval (RAG) | Implement yourself | Built-in retrieval tool |
| Latency | Moderate (streaming) | Ultra-low (bidirectional streaming) |
| Scenarios | Offline processing, batch tasks, simple integration | Real-time dialogue, voice assistants, customer service |
Models
step-audio-r1.1
- Positioning:Deep Sound Understanding & Thinking
- Capability tags:Thinking-while-speaking, Deep Reasoning
- Studio:Try in Studio
- Boasts powerful acoustic detail analysis and logical reasoning. Supports Audio Reasoning to grasp the intent behind the tone. Significant improvement in understanding sound and emotions through inferred thinking.
- Executes reasoning and speech concurrently to ensure high-quality, high-speed responses.
step-audio-2
- Positioning:Omni-Sensory Understanding & End-to-End Interaction
- Capability tags:Voice Cloning, Tool Calling, Web Search
- Studio:Try in Studio
- Understands Mandarin, dialects, English, and Japanese. Supports Voice Cloning (custom voices via uploaded audio clips). Grasps acoustic events, paralinguistics, and emotions. Infers age from voice, understands music, and controls tone/speed/emotion. Native support for Tool Calling and Web Search.
step-audio-2-mini
- Positioning:Lightweight, Ultra-Fast, and Deep
- Capability tags:Tool Calling, Web Search
- Access:API Access Only
- Shares core capabilities with step-audio-2, including native Tool Calling and Web Search. Optimized for speed and resource efficiency, with slightly lower scores in instruction following and reasoning.
step-1o-audio
- Positioning:Stable, Proven, and Battle-Tested
- Capability tags:Tool Calling
- Access:API Access Only
- 1st-Gen end-to-end speech model. Mature and stable technology, deployed extensively in automotive scenarios. Supports various preset voice styles and Tool Calling. Ideal for fundamental voice interaction and creative content tasks.
What is the Realtime API
The Stepfun Realtime API is a low-latency, interactive voice interface built on the hundred-billion–parameter end-to-end speech model Step-1o-Audio. It enables natural, fluid conversations with real-time interruption for true two-way dialogue.Key features
- Realtime low latency: Hundred-millisecond responses for smooth, natural dialogue
- Bidirectional interruption: Users can interrupt anytime, and the AI adapts, mirroring human conversation
- Multimodal I/O: Flexible handling of speech, text, and mixed inputs/outputs
- Deep voice understanding: Captures tone, rhythm, dialects, and personal speaking habits for human-like speech
- Emotional intelligence: Detects emotion in prosody, understands context, and responds accordingly
- Rich knowledge: Inherits the Stepfun language model knowledge base for reliable answers
- Creative generation: Strong storytelling and improvisation abilities
Example use cases
- Emotional support: Congratulates and empathizes during important moments, asking thoughtful follow-ups when users share major life events.
- Safe driving assistance: Detects signs of driver fatigue in speech and offers safety tips with a supportive tone to alleviate tiredness.
- Dialect interaction: Accurately handles regional accents such as Sichuan dialect, delivering localized interactions with its distinct tone and vocabulary.
- Playful relationship tips: Demonstrates natural, cute, and lighthearted tone for romantic interactions.
- Parent-child support: Eases anxiety in moments like a child’s first day of school and offers practical guidance for parents.
Business scenarios
With real-time interaction and emotional understanding, the Realtime API powers many industries:- Smart cockpits: Natural voice UI for in-car systems, covering queries, chit-chat, and safety reminders
- Smart devices: Real-time voice interaction for IoT hardware, improving user experience
- Social entertainment: Builds companion agents for social and entertainment apps
- Customer service: Highly human-like support to boost efficiency and satisfaction
- Financial mediation: Neutral, professional assistance in financial dispute resolution
Quickstart
Realtime API developer guide
Learn the session model, event flow, and voice-to-voice interaction patterns.
Step Realtime Console
Explore the official demo project for building and testing realtime voice apps.