Use the Realtime API for low-latency, event-driven interactions. Models emit lifecycle events while generating so you can stream partial results (e.g.,Documentation Index
Fetch the complete documentation index at: https://platform.stepfun.ai/docs/llms.txt
Use this file to discover all available pages before exploring further.
response.text.delta).
Voice-to-voice sessions
A realtime session is a stateful interaction between a client and the model.- Session: configuration such as model, voice, and other options.
- Conversation: user inputs and model outputs collected during the session.
- Responses: audio or text items added to the conversation.
- Input audio buffer: when using WebSocket for audio, send base64 audio events to feed the buffer.
Quick start: create a WebSocket connection
Connect with:Connection details
| Field | Value |
|---|---|
| URL | wss://api.stepfun.ai/v1/realtime |
| Query | model – e.g., step-1o-audio |
| Header | Authorization: Bearer YOUR_API_KEY |
Node.js (ws)
Python (websocket-client)
Session lifecycle events
After connecting, the server sendssession.created when the session is ready. Update configuration anytime via session.update; after an update, the server emits session.updated.
| Client events | Server events |
|---|---|
session.update | session.created, session.updated |
Text input and output
Add user text to the conversation, then ask the model to respond. Create a user message:response.text.delta and response.done for streaming updates.
Audio input and output
Voice options
Realtime sessions can be configured to use one of several built-in voices for audio output. You can control the model’s voice by setting thevoice parameter in a session.update request or when calling response.create. Current voice options include qingchunshaonv, wenrounansheng, elegantgentle-female, livelybreezy-female, and others.
Note: Once the model has generated audio in a session, the voice parameter for that session cannot be changed.
The step-audio-2 model supports voice cloning. You can create a custom voice ID by uploading an audio file and then use it in the voice parameter during a realtime session. For details, see Voice cloning.
Send streaming audio
Send base64 chunks into the input buffer:Send a complete audio message
Handle audio output
Audio is streamed via events likeresponse.audio.delta (base64 chunks) and response.done.
Voice activity detection
VAD is enabled by default to auto-start and end turns based on speech. To handle your own start/stop, disable VAD:input_audio_buffer.commit / response.create manually.
Tool Calls
Built-in web search
Enable internet search by adding the tool definition:tool_calls.function.results and are used as context.
Knowledge base retrieval
Use the retrieval tool to ground answers on a vector store. Configure with your store ID and prompt template:Custom function calls
Define functions the model can call, execute them in your app, and feed results back.- Detect a tool call in
tool_calls. - Extract arguments and run your function.
- Send the result back as a new conversation item.
- Ask the model to continue with the new context.
Development tips
Add an opening greeting
Send an initialconversation.item.create with input_text and then create a response to have the model greet first.
More flexible VAD
Tune VAD thresholds or durations insession.input_audio_config (e.g., silence_duration_ms) to fit your UX.
Play audio in the browser
Use the Web Audio API: convert base64 to anArrayBuffer, decode to AudioBuffer, and play via AudioContext.
Error handling
- Watch for HTTP errors (400/401/429/etc.) and back off or fix credentials as needed.
- In streaming, handle
finish_reasonvalues (stop,length,content_filter,tool_calls). - Log Trace IDs from headers for support.