Skip to main content

Documentation Index

Fetch the complete documentation index at: https://platform.stepfun.ai/docs/llms.txt

Use this file to discover all available pages before exploring further.

Use the Realtime API for low-latency, event-driven interactions. Models emit lifecycle events while generating so you can stream partial results (e.g., response.text.delta).

Voice-to-voice sessions

A realtime session is a stateful interaction between a client and the model.
  • Session: configuration such as model, voice, and other options.
  • Conversation: user inputs and model outputs collected during the session.
  • Responses: audio or text items added to the conversation.
  • Input audio buffer: when using WebSocket for audio, send base64 audio events to feed the buffer.

Quick start: create a WebSocket connection

Connect with:

Connection details

FieldValue
URLwss://api.stepfun.ai/v1/realtime
Querymodel – e.g., step-1o-audio
HeaderAuthorization: Bearer YOUR_API_KEY
Examples:

Node.js (ws)

import WebSocket from 'ws'

const url = 'wss://api.stepfun.ai/v1/realtime?model=step-1o-audio'
const ws = new WebSocket(url, {
	headers: { Authorization: 'Bearer ' + process.env.STEPFUN_API_KEY },
})

ws.on('open', () => console.log('Connected'))
ws.on('message', msg => console.log(JSON.parse(msg.toString())))

Python (websocket-client)

import os, json, websocket

STEPFUN_API_KEY = os.environ.get("STEPFUN_API_KEY")
url = "wss://api.stepfun.ai/v1/realtime?model=step-1o-audio"
headers = ["Authorization: Bearer " + STEPFUN_API_KEY]

def on_open(ws):
    print("Connected")

def on_message(ws, message):
    print(json.loads(message))

ws = websocket.WebSocketApp(url, header=headers, on_open=on_open, on_message=on_message)
ws.run_forever()

Session lifecycle events

After connecting, the server sends session.created when the session is ready. Update configuration anytime via session.update; after an update, the server emits session.updated.
const event = {
	type: 'session.update',
	session: { instructions: "Do not use the word 'moist'." },
}
ws.send(JSON.stringify(event))
Client eventsServer events
session.updatesession.created, session.updated

Text input and output

Add user text to the conversation, then ask the model to respond. Create a user message:
ws.send(
	JSON.stringify({
		type: 'conversation.item.create',
		item: {
			type: 'message',
			role: 'user',
			content: [{ type: 'input_text', text: 'Which Prince album sold the most?' }],
		},
	}),
)
Trigger a text-only response:
ws.send(
	JSON.stringify({
		type: 'response.create',
		response: { modalities: ['text'] },
	}),
)
Listen for completion:
ws.on('message', e => {
	const evt = JSON.parse(e.data)
	if (evt.type === 'response.done') {
		console.log(evt.response.output[0])
	}
})
During generation, you’ll see events like response.text.delta and response.done for streaming updates.

Audio input and output

Voice options

Realtime sessions can be configured to use one of several built-in voices for audio output. You can control the model’s voice by setting the voice parameter in a session.update request or when calling response.create. Current voice options include qingchunshaonv, wenrounansheng, elegantgentle-female, livelybreezy-female, and others.
Note: Once the model has generated audio in a session, the voice parameter for that session cannot be changed.
The step-audio-2 model supports voice cloning. You can create a custom voice ID by uploading an audio file and then use it in the voice parameter during a realtime session. For details, see Voice cloning.

Send streaming audio

Send base64 chunks into the input buffer:
ws.send(
	JSON.stringify({
		type: 'input_audio_buffer.append',
		audio: '<base64_pcm16_audio>',
	}),
)
Start a response that returns both audio and text:
ws.send(
	JSON.stringify({
		type: 'response.create',
		response: { modalities: ['audio', 'text'] },
	}),
)

Send a complete audio message

ws.send(
	JSON.stringify({
		type: 'conversation.item.create',
		item: {
			type: 'message',
			role: 'user',
			content: [{ type: 'input_audio', audio: '<base64_pcm16_audio>' }],
		},
	}),
)

Handle audio output

Audio is streamed via events like response.audio.delta (base64 chunks) and response.done.

Voice activity detection

VAD is enabled by default to auto-start and end turns based on speech. To handle your own start/stop, disable VAD:
ws.send(JSON.stringify({ type: 'session.update', session: { input_audio_vad: false } }))
Then call input_audio_buffer.commit / response.create manually.

Tool Calls

Enable internet search by adding the tool definition:
const tools = [
	{
		type: 'web_search',
		function: { description: 'Search the web for fresh information' },
	},
]
The model decides when to search; results appear in tool_calls.function.results and are used as context.

Knowledge base retrieval

Use the retrieval tool to ground answers on a vector store. Configure with your store ID and prompt template:
const tools = [
	{
		type: 'retrieval',
		function: {
			name: 'kb_lookup',
			description: 'Look up answers in the nutrition knowledge base',
			options: {
				vector_store_id: 'your_vector_store_id',
				prompt_template: "Answer {{query}} using document {{knowledge}}; if missing, say you couldn't find it.",
			},
		},
	},
]

Custom function calls

Define functions the model can call, execute them in your app, and feed results back.
const tools = [
	{
		type: 'function',
		function: {
			name: 'get_weather',
			description: 'Get current weather by city name',
			parameters: {
				type: 'object',
				properties: { city: { type: 'string', description: 'City name' } },
				required: ['city'],
			},
		},
	},
]
Workflow:
  1. Detect a tool call in tool_calls.
  2. Extract arguments and run your function.
  3. Send the result back as a new conversation item.
  4. Ask the model to continue with the new context.

Development tips

Add an opening greeting

Send an initial conversation.item.create with input_text and then create a response to have the model greet first.

More flexible VAD

Tune VAD thresholds or durations in session.input_audio_config (e.g., silence_duration_ms) to fit your UX.

Play audio in the browser

Use the Web Audio API: convert base64 to an ArrayBuffer, decode to AudioBuffer, and play via AudioContext.

Error handling

  • Watch for HTTP errors (400/401/429/etc.) and back off or fix credentials as needed.
  • In streaming, handle finish_reason values (stop, length, content_filter, tool_calls).
  • Log Trace IDs from headers for support.