Realtime Voice Interaction Models

Model overview

Stepfun offers two voice-enabled dialogue solutions to suit different scenarios:

Characteristic	Chat Completions API	Realtime API
Connection	HTTP requests	WebSocket long-lived connection
ASR	Implement yourself or use 3rd-party	Built-in, automatic ASR
Context Management	Manual message list updates	Built-in, automatic history management
VAD	Implement yourself	Built-in, automatic detection
Web Search	Implement search API manually	Built-in web_search tool
Retrieval (RAG)	Implement yourself	Built-in retrieval tool
Latency	Moderate (streaming)	Ultra-low (bidirectional streaming)
Scenarios	Offline processing, batch tasks, simple integration	Real-time dialogue, voice assistants, customer service

You can access the following models via the Chat API and Realtime API.

Models

step-audio-r1.1

Positioning：Deep Sound Understanding & Thinking
Capability tags：Thinking-while-speaking, Deep Reasoning
Studio：Try in Studio
Boasts powerful acoustic detail analysis and logical reasoning. Supports Audio Reasoning to grasp the intent behind the tone. Significant improvement in understanding sound and emotions through inferred thinking.
Executes reasoning and speech concurrently to ensure high-quality, high-speed responses.

step-audio-2

Positioning：Omni-Sensory Understanding & End-to-End Interaction
Capability tags：Voice Cloning, Tool Calling, Web Search
Studio：Try in Studio
Understands Mandarin, dialects, English, and Japanese. Supports Voice Cloning (custom voices via uploaded audio clips). Grasps acoustic events, paralinguistics, and emotions. Infers age from voice, understands music, and controls tone/speed/emotion. Native support for Tool Calling and Web Search.

step-audio-2-mini

Positioning：Lightweight, Ultra-Fast, and Deep
Capability tags：Tool Calling, Web Search
Access：API Access Only
Shares core capabilities with step-audio-2, including native Tool Calling and Web Search. Optimized for speed and resource efficiency, with slightly lower scores in instruction following and reasoning.

step-1o-audio

Positioning：Stable, Proven, and Battle-Tested
Capability tags：Tool Calling
Access：API Access Only
1st-Gen end-to-end speech model. Mature and stable technology, deployed extensively in automotive scenarios. Supports various preset voice styles and Tool Calling. Ideal for fundamental voice interaction and creative content tasks.

What is the Realtime API

The Stepfun Realtime API is a low-latency, interactive voice interface built on the hundred-billion–parameter end-to-end speech model Step-1o-Audio. It enables natural, fluid conversations with real-time interruption for true two-way dialogue.

Key features

Realtime low latency: Hundred-millisecond responses for smooth, natural dialogue
Bidirectional interruption: Users can interrupt anytime, and the AI adapts, mirroring human conversation
Multimodal I/O: Flexible handling of speech, text, and mixed inputs/outputs
Deep voice understanding: Captures tone, rhythm, dialects, and personal speaking habits for human-like speech
Emotional intelligence: Detects emotion in prosody, understands context, and responds accordingly
Rich knowledge: Inherits the Stepfun language model knowledge base for reliable answers
Creative generation: Strong storytelling and improvisation abilities

Example use cases

Emotional support: Congratulates and empathizes during important moments, asking thoughtful follow-ups when users share major life events.

Audio sample: Congrats on Success (Generated by step-audio-2)

Safe driving assistance: Detects signs of driver fatigue in speech and offers safety tips with a supportive tone to alleviate tiredness.

Audio sample: Driver Fatigue Reminder (Generated by step-1o-audio)

Dialect interaction: Accurately handles regional accents such as Sichuan dialect, delivering localized interactions with its distinct tone and vocabulary.

Audio sample: Sichuan Dialect (Generated by step-1o-audio)

Playful relationship tips: Demonstrates natural, cute, and lighthearted tone for romantic interactions.

Audio sample: Playful Coquetry (Generated by step-1o-audio)

Parent-child support: Eases anxiety in moments like a child’s first day of school and offers practical guidance for parents.

Audio sample: Soothing at School Entrance (Generated by step-1o-audio)

Business scenarios

With real-time interaction and emotional understanding, the Realtime API powers many industries:

Smart cockpits: Natural voice UI for in-car systems, covering queries, chit-chat, and safety reminders
Smart devices: Real-time voice interaction for IoT hardware, improving user experience
Social entertainment: Builds companion agents for social and entertainment apps
Customer service: Highly human-like support to boost efficiency and satisfaction
Financial mediation: Neutral, professional assistance in financial dispute resolution

Integrating the Realtime API lets you quickly build applications with human-like conversational ability for immersive voice experiences.

Start

Models

Pricing

Terms and Agreements

Model overview

Models

step-audio-r1.1

step-audio-2

step-audio-2-mini

step-1o-audio

What is the Realtime API

Key features

Example use cases

Business scenarios

Quickstart

Realtime API developer guide

Step Realtime Console

Start

Models

Pricing

Terms and Agreements

Documentation Index

​Model overview

​Models

​step-audio-r1.1

​step-audio-2

​step-audio-2-mini

​step-1o-audio

​What is the Realtime API

​Key features

​Example use cases

​Business scenarios

​Quickstart

Realtime API developer guide

Step Realtime Console

Model overview

Models

step-audio-r1.1

step-audio-2

step-audio-2-mini

step-1o-audio

What is the Realtime API

Key features

Example use cases

Business scenarios

Quickstart