Skip to main content

Documentation Index

Fetch the complete documentation index at: https://platform.stepfun.ai/docs/llms.txt

Use this file to discover all available pages before exploring further.

Model overview

Stepfun offers two voice-enabled dialogue solutions to suit different scenarios:
CharacteristicChat Completions APIRealtime API
ConnectionHTTP requestsWebSocket long-lived connection
ASRImplement yourself or use 3rd-partyBuilt-in, automatic ASR
Context ManagementManual message list updatesBuilt-in, automatic history management
VADImplement yourselfBuilt-in, automatic detection
Web SearchImplement search API manuallyBuilt-in web_search tool
Retrieval (RAG)Implement yourselfBuilt-in retrieval tool
LatencyModerate (streaming)Ultra-low (bidirectional streaming)
ScenariosOffline processing, batch tasks, simple integrationReal-time dialogue, voice assistants, customer service
You can access the following models via the Chat API and Realtime API.

Models

step-audio-r1.1

  • Positioning:Deep Sound Understanding & Thinking
  • Capability tags:Thinking-while-speaking, Deep Reasoning
  • Studio:Try in Studio
  • Boasts powerful acoustic detail analysis and logical reasoning. Supports Audio Reasoning to grasp the intent behind the tone. Significant improvement in understanding sound and emotions through inferred thinking.
  • Executes reasoning and speech concurrently to ensure high-quality, high-speed responses.

step-audio-2

  • Positioning:Omni-Sensory Understanding & End-to-End Interaction
  • Capability tags:Voice Cloning, Tool Calling, Web Search
  • Studio:Try in Studio
  • Understands Mandarin, dialects, English, and Japanese. Supports Voice Cloning (custom voices via uploaded audio clips). Grasps acoustic events, paralinguistics, and emotions. Infers age from voice, understands music, and controls tone/speed/emotion. Native support for Tool Calling and Web Search.

step-audio-2-mini

  • Positioning:Lightweight, Ultra-Fast, and Deep
  • Capability tags:Tool Calling, Web Search
  • Access:API Access Only
  • Shares core capabilities with step-audio-2, including native Tool Calling and Web Search. Optimized for speed and resource efficiency, with slightly lower scores in instruction following and reasoning.

step-1o-audio

  • Positioning:Stable, Proven, and Battle-Tested
  • Capability tags:Tool Calling
  • Access:API Access Only
  • 1st-Gen end-to-end speech model. Mature and stable technology, deployed extensively in automotive scenarios. Supports various preset voice styles and Tool Calling. Ideal for fundamental voice interaction and creative content tasks.

What is the Realtime API

The Stepfun Realtime API is a low-latency, interactive voice interface built on the hundred-billion–parameter end-to-end speech model Step-1o-Audio. It enables natural, fluid conversations with real-time interruption for true two-way dialogue.

Key features

  • Realtime low latency: Hundred-millisecond responses for smooth, natural dialogue
  • Bidirectional interruption: Users can interrupt anytime, and the AI adapts, mirroring human conversation
  • Multimodal I/O: Flexible handling of speech, text, and mixed inputs/outputs
  • Deep voice understanding: Captures tone, rhythm, dialects, and personal speaking habits for human-like speech
  • Emotional intelligence: Detects emotion in prosody, understands context, and responds accordingly
  • Rich knowledge: Inherits the Stepfun language model knowledge base for reliable answers
  • Creative generation: Strong storytelling and improvisation abilities

Example use cases

  • Emotional support: Congratulates and empathizes during important moments, asking thoughtful follow-ups when users share major life events.
Audio sample: Congrats on Success (Generated by step-audio-2)
  • Safe driving assistance: Detects signs of driver fatigue in speech and offers safety tips with a supportive tone to alleviate tiredness.
Audio sample: Driver Fatigue Reminder (Generated by step-1o-audio)
  • Dialect interaction: Accurately handles regional accents such as Sichuan dialect, delivering localized interactions with its distinct tone and vocabulary.
Audio sample: Sichuan Dialect (Generated by step-1o-audio)
  • Playful relationship tips: Demonstrates natural, cute, and lighthearted tone for romantic interactions.
Audio sample: Playful Coquetry (Generated by step-1o-audio)
  • Parent-child support: Eases anxiety in moments like a child’s first day of school and offers practical guidance for parents.
Audio sample: Soothing at School Entrance (Generated by step-1o-audio)

Business scenarios

With real-time interaction and emotional understanding, the Realtime API powers many industries:
  • Smart cockpits: Natural voice UI for in-car systems, covering queries, chit-chat, and safety reminders
  • Smart devices: Real-time voice interaction for IoT hardware, improving user experience
  • Social entertainment: Builds companion agents for social and entertainment apps
  • Customer service: Highly human-like support to boost efficiency and satisfaction
  • Financial mediation: Neutral, professional assistance in financial dispute resolution
Integrating the Realtime API lets you quickly build applications with human-like conversational ability for immersive voice experiences.

Quickstart

Realtime API developer guide

Learn the session model, event flow, and voice-to-voice interaction patterns.

Step Realtime Console

Explore the official demo project for building and testing realtime voice apps.