Realtime API developer guide - StepFun Documentation

Use the Realtime API for low-latency, event-driven interactions. Models emit lifecycle events while generating so you can stream partial results (e.g., response.text.delta).

Voice-to-voice sessions

A realtime session is a stateful interaction between a client and the model.

Session: configuration such as model, voice, and other options.
Conversation: user inputs and model outputs collected during the session.
Responses: audio or text items added to the conversation.
Input audio buffer: when using WebSocket for audio, send base64 audio events to feed the buffer.

Quick start: create a WebSocket connection

Connect with:

Connection details

Field	Value
URL	`wss://api.stepfun.ai/v1/realtime`
Query	`model` – e.g., `step-1o-audio`
Header	`Authorization: Bearer YOUR_API_KEY`

Examples:

Node.js (ws)

import WebSocket from 'ws'

const url = 'wss://api.stepfun.ai/v1/realtime?model=step-1o-audio'
const ws = new WebSocket(url, {
	headers: { Authorization: 'Bearer ' + process.env.STEPFUN_API_KEY },
})

ws.on('open', () => console.log('Connected'))
ws.on('message', msg => console.log(JSON.parse(msg.toString())))

Python (websocket-client)

import os, json, websocket

STEPFUN_API_KEY = os.environ.get("STEPFUN_API_KEY")
url = "wss://api.stepfun.ai/v1/realtime?model=step-1o-audio"
headers = ["Authorization: Bearer " + STEPFUN_API_KEY]

def on_open(ws):
    print("Connected")

def on_message(ws, message):
    print(json.loads(message))

ws = websocket.WebSocketApp(url, header=headers, on_open=on_open, on_message=on_message)
ws.run_forever()

Session lifecycle events

After connecting, the server sends session.created when the session is ready. Update configuration anytime via session.update; after an update, the server emits session.updated.

const event = {
	type: 'session.update',
	session: { instructions: "Do not use the word 'moist'." },
}
ws.send(JSON.stringify(event))

Client events	Server events
`session.update`	`session.created`, `session.updated`

Text input and output

Add user text to the conversation, then ask the model to respond. Create a user message:

ws.send(
	JSON.stringify({
		type: 'conversation.item.create',
		item: {
			type: 'message',
			role: 'user',
			content: [{ type: 'input_text', text: 'Which Prince album sold the most?' }],
		},
	}),
)

Trigger a text-only response:

ws.send(
	JSON.stringify({
		type: 'response.create',
		response: { modalities: ['text'] },
	}),
)

Listen for completion:

ws.on('message', e => {
	const evt = JSON.parse(e.data)
	if (evt.type === 'response.done') {
		console.log(evt.response.output[0])
	}
})

During generation, you’ll see events like response.text.delta and response.done for streaming updates.

Audio input and output

Voice options

Realtime sessions can be configured to use one of several built-in voices for audio output. You can control the model’s voice by setting the voice parameter in a session.update request or when calling response.create. Current voice options include qingchunshaonv, wenrounansheng, elegantgentle-female, livelybreezy-female, and others.

Note: Once the model has generated audio in a session, the voice parameter for that session cannot be changed.

The step-audio-2 model supports voice cloning. You can create a custom voice ID by uploading an audio file and then use it in the voice parameter during a realtime session. For details, see Voice cloning.

Send streaming audio

Send base64 chunks into the input buffer:

ws.send(
	JSON.stringify({
		type: 'input_audio_buffer.append',
		audio: '<base64_pcm16_audio>',
	}),
)

Start a response that returns both audio and text:

ws.send(
	JSON.stringify({
		type: 'response.create',
		response: { modalities: ['audio', 'text'] },
	}),
)

Send a complete audio message

ws.send(
	JSON.stringify({
		type: 'conversation.item.create',
		item: {
			type: 'message',
			role: 'user',
			content: [{ type: 'input_audio', audio: '<base64_pcm16_audio>' }],
		},
	}),
)

Handle audio output

Audio is streamed via events like response.audio.delta (base64 chunks) and response.done.

Voice activity detection

VAD is enabled by default to auto-start and end turns based on speech. To handle your own start/stop, disable VAD:

ws.send(JSON.stringify({ type: 'session.update', session: { input_audio_vad: false } }))

Then call input_audio_buffer.commit / response.create manually.

Tool Calls

Built-in web search

Enable internet search by adding the tool definition:

const tools = [
	{
		type: 'web_search',
		function: { description: 'Search the web for fresh information' },
	},
]

The model decides when to search; results appear in tool_calls.function.results and are used as context.

Knowledge base retrieval

Use the retrieval tool to ground answers on a vector store. Configure with your store ID and prompt template:

const tools = [
	{
		type: 'retrieval',
		function: {
			name: 'kb_lookup',
			description: 'Look up answers in the nutrition knowledge base',
			options: {
				vector_store_id: 'your_vector_store_id',
				prompt_template: "Answer {{query}} using document {{knowledge}}; if missing, say you couldn't find it.",
			},
		},
	},
]

Custom function calls

Define functions the model can call, execute them in your app, and feed results back.

const tools = [
	{
		type: 'function',
		function: {
			name: 'get_weather',
			description: 'Get current weather by city name',
			parameters: {
				type: 'object',
				properties: { city: { type: 'string', description: 'City name' } },
				required: ['city'],
			},
		},
	},
]

Workflow:

Detect a tool call in tool_calls.
Extract arguments and run your function.
Send the result back as a new conversation item.
Ask the model to continue with the new context.

Development tips

Add an opening greeting

Send an initial conversation.item.create with input_text and then create a response to have the model greet first.

More flexible VAD

Tune VAD thresholds or durations in session.input_audio_config (e.g., silence_duration_ms) to fit your UX.

Play audio in the browser

Use the Web Audio API: convert base64 to an ArrayBuffer, decode to AudioBuffer, and play via AudioContext.

Error handling

Watch for HTTP errors (400/401/429/etc.) and back off or fix credentials as needed.
In streaming, handle finish_reason values (stop, length, content_filter, tool_calls).
Log Trace IDs from headers for support.

Developer Guides

Documentation Index

​Voice-to-voice sessions

​Quick start: create a WebSocket connection

​Connection details

​Node.js (ws)

​Python (websocket-client)

​Session lifecycle events

​Text input and output

​Audio input and output

​Voice options

​Send streaming audio

​Send a complete audio message

​Handle audio output

​Voice activity detection

​Tool Calls

​Built-in web search

​Knowledge base retrieval

​Custom function calls

​Development tips

​Add an opening greeting

​More flexible VAD

​Play audio in the browser

​Error handling