Start Real-Time Voice Calls
Enable real-time voice calling with voice and text input, and audio output.
Quick Demo
We provide a quick demo; click the link below to try it.
Request Method
WebSocket
Endpoint
wss://api.stepfun.ai/v1/realtimeRequest Headers
Authorizationstringrequired
Auth key; valueBearer STEP_API_KEY
Request Parameters
modelstringrequired
Model name to use; currently only supportsstep-1o-audiostep-audio-2step-audio-2-mini
Usage
After the service connection is successful, the Realtime API requires sending the corresponding Client Event and receiving the corresponding Server Event to complete the interaction.
Common Parameters
The following are common parameters for Client Events and Server Events.
| Field | Type | Description |
|---|---|---|
| event_id | string | Event ID |
| type | string | Event type; options listed below |
Client Event List
Create/Update Session
type: session.update
Send this event to create or update the default session configuration. The client can send this at any time to update the session configuration; any field may be updated at any time except “voice”. The server responds with a session.updated event.
-
modalitiesarray<string>
Modalities the model can use. Fixed to["text", "audio"] -
instructionsstring
Default system instructions (system message) attached before model calls. This lets the client guide the model to get the desired response. The model can be guided on content and format (e.g., “be very concise,” “be friendly,” “here are examples of good replies”) and audio behavior (e.g., “speak quickly,” “add emotion to your voice,” “laugh often”). The model is not guaranteed to follow the instructions, but they guide the desired behavior. -
voicestring
Voice to use during generation. Supports official voices and custom voices. Pass the corresponding voice ID for custom voices. You can view available IDs via list voices.
When using step-audio-2 or step-audio-2-mini, onlyqingchunshaonvandwenrounanshengare supported, and you need to append “please use the default male voice to talk to the user” or “please use the default female voice to talk to the user” at the end of the instructions. -
turn_detectionobjectoptional
Server VAD parameters; off by defaultExpand/Collapse
typestringrequired
Currently only supportsserver_vad; enables server-side VAD when configured. Threshold configuration is not supported yet.
-
input_audio_formatstring
Format of input audio. Currently only supportspcm16. -
output_audio_formatstring
Format of output audio. Currently only supportspcm16. -
toolsobject arrayoptional
List of functions supported by Toolcall.
Expand/Collapse
typestring
Tool type, always
functionorretrieval*functionobjectDescription of the function
Expand/Collapse
namestring
Function name; must be alphanumeric with _- characters only and preferably under 64 characters. -
descriptionstringFunction description; used to tell the model what the function does and its purpose. -
parametersobjectFunction parameters
Expand/Collapse
typeobject
Parameter description, generally an object -
propertiesobjectFunction parameter content; keys are parameter names, described by
typeanddescription.Expand/Collapse
typestring|number|integer|object|array|boolean
Parameter type, see json-schema for reference. -
descriptionstringFunction parameter description; explains what the parameter means.
When type is
retrievalExpand/Collapse
descriptionstring
Function description; used to tell the model what the function does and its purpose. -
optionsobjectFunction parameters
Expand/Collapse
vector_store_idstring
Knowledge base ID -prompt_templatestring
Template for inserting recalled content into the document. Default isFind the answer to question{' '} {{ query }} from the document {{ knowledge }}. Find the answer using sentences from the document; if the document does not contain an answer, tell the user it cannot be found, where{{ knowledge }}is the recalled content and{{ query }}is the user query. Modify as needed.
Sample
{
"event_id": "event_abc",
"type": "session.update",
"session": {
"modalities": ["text", "audio"],
"instructions": "You are an AI chat assistant provided by Stepfun. You are good at conversations in Chinese, English, and many other languages.",
"voice": "linjiajiejie",
"input_audio_format": "pcm16",
"output_audio_format": "pcm16",
"tools": [
{ # Example knowledge base config
"type": "retrieval",
"function": {
"description": "This knowledge base can answer 'One Hundred Thousand Whys' type questions.",
"options": {
"vector_store_id": "164643690285936640",
"prompt_template": "Find the answer to question {{query}} from the document {{knowledge}}. Find the answer using sentences from the document; if the document does not contain an answer, tell the user it cannot be found"
}
}
},
{ # Example knowledge base config
"type": "retrieval",
"function": {
"description": "This knowledge base can answer questions about installing Redis, etc.",
"options": {
"vector_store_id": "164643837904470016",
"prompt_template": "Find the answer to question {{query}} from the document {{knowledge}}. Find the answer using sentences from the document; if the document does not contain an answer, tell the user it cannot be found"
}
}
},
],
"turn_detection": {
"type": "server_vad"
}
}
}Append Audio Content
type: input_audio_buffer.append
Send this event to append audio bytes to the input audio buffer. The server does not acknowledge this event. In Server VAD mode it will trigger model inference.
audiostring
Base64-encoded audio bytes. Must use the format specified in the session configuration input_audio_format field.
Sample
{
"event_id": "event_abc",
"type": "input_audio_buffer.append",
"audio": "Base64EncodedAudioData"
}Submit Audio Content
type: input_audio_buffer.commit
Send this event to commit the user input audio buffer for inference. This creates a new user message item in the conversation. The server responds with input_audio_buffer.committed. If the input audio buffer is empty, this event produces an error.
Sample
{
"event_id": "event_abc",
"type": "input_audio_buffer.commit"
}Clear Audio Content
type: input_audio_buffer.clear
Send this event to clear the user input audio buffer. The server responds with input_audio_buffer.cleared.
Sample
{
"event_id": "event_abc",
"type": "input_audio_buffer.clear"
}Add Conversation Item
type: conversation.item.create
Add a new item to the conversation context, including messages, function calls, and function call responses. This can populate conversation “history” or add new message items along the way, but it cannot currently populate Assistant audio messages. If successful, the server responds with conversation.item.created; otherwise it sends an error.
-
previous_item_idstring
Previous item ID -
contentstring
Message content for message items; see message parameters.
Sample
{
"event_id": "event_abc",
"type": "conversation.item.create",
"item": {
"id": "msg_001",
"type": "message",
"role": "user",
"content": [
{
"type": "input_text",
"text": "Hello"
}
]
}
}Delete Conversation Item
type: conversation.item.delete
Send this event when you want to remove any item from the conversation history. The server responds with conversation.item.deleted, and returns an error if the item does not exist.
item_idstring
ID of the message to delete
Sample
{
"event_id": "event_abc",
"type": "conversation.item.delete",
"item_id": "msg_003"
}Submit Inference
type: response.create
This event instructs the server to create a Response, which triggers model inference. The server will return:
Sample
{
"event_id": "event_abc",
"type": "response.create"
}Cancel Inference
type: response.cancel
Send this event to cancel the response in progress. The server returns response.cancelled or an error if there is nothing to cancel.
{
"event_id": "event_abc",
"type": "response.cancel"
}Server Event List
Error Event
type: error
Returned when an error occurs during server execution. This may be a client or server problem. The session remains active.
-
typestring
Error type (e.g., “invalid_request_error,” “server_error”). -
codestring
Error code (if any). -
messagestring
Human-readable error message. -
event_idstring
event_id of the client event that caused the error (if applicable).
Sample
{
"event_id": "event_bcd",
"type": "error",
"error": {
"type": "invalid_request_error",
"code": "invalid_param",
"message": "Audio content is incomplete",
"event_id": "event_567"
}
}Session Created
type: session.created
Returned when a Session is created. Automatically emitted as the first server event when a new connection is established. Contains the default Session configuration.
-
modalitiesarray<string>
Modalities the model can use. Fixed to["text", "audio"] -
instructionsstring
Default system instructions (system message) attached before model calls. This lets the client guide the model to get the desired response. The model can be guided on content and format (e.g., “be very concise,” “be friendly,” “here are examples of good replies”) and audio behavior (e.g., “speak quickly,” “add emotion to your voice,” “laugh often”). The model is not guaranteed to follow the instructions, but they guide the desired behavior. -
voicestring
Voice to use during generation; supports official voices, with custom voices coming later. -
input_audio_formatstring
Format of input audio. Currently only supportspcm16. -
output_audio_formatstring
Format of output audio. Currently only supportspcm16.
sample
{
"event_id": "event_def",
"type": "session.created",
"session": {
"id": "sess_001",
"object": "realtime.session",
"model": "step-1o-audio",
"modalities": ["text", "audio"],
"instructions": "You are an AI chat assistant provided by Stepfun. You are good at conversations in Chinese, English, and many other languages.",
"voice": "linjiajiejie",
"input_audio_format": "pcm16",
"output_audio_format": "pcm16",
"max_response_output_tokens": "4096"
}
}Session Updated
type: session.updated
Returned when a Session is updated. Automatically emitted as the first server event when a new connection is established. Contains the default Session configuration.
Sample
{
"event_id": "event_def",
"type": "session.created",
"session": {
"modalities": ["text", "audio"],
"instructions": "You are an AI chat assistant provided by Stepfun. You are good at conversations in Chinese, English, and many other languages.",
"voice": "linjiajiejie",
"input_audio_format": "pcm16",
"output_audio_format": "pcm16",
"max_response_output_tokens": "4096"
}
}Audio Input Activation Start (VAD)
type: input_audio_buffer.speech_started
Notification that valid speech input has started in the audio input; typically used for interruption scenarios.
-
audio_start_msstring
Start time of the audio -
item_idstring
Item ID.
Sample
{
"event_id": "event_bcd",
"type": "input_audio_buffer.speech_started",
"audio_start_ms": 1000,
"item_id": "msg_003"
}Audio Input Activation End (VAD)
type: input_audio_buffer.speech_stopped
Notification that valid speech input in the audio has ended.
-
response_idstring
Usually a trace ID. -
item_idstring
Item ID.
Sample
{
"event_id": "event_1718",
"type": "input_audio_buffer.speech_stopped",
"audio_end_ms": 2000,
"item_id": "msg_003"
}Streaming Audio Output
type: response.audio.delta
Returned when the model-generated audio is updated.
-
response_idstring
Usually a trace ID. -
item_idstring
Item ID. -
output_indexstring
Index of the output item in the response. -
deltastring
Base64-encoded incremental audio data; audio format matches the session output_audio_format.
Sample
{
"event_id": "event_bcd",
"type": "response.audio.delta",
"item_id": "msg_008",
"delta": "Base64EncodedAudioDelta"
}Streaming Audio Complete
type: response.audio.done
Returned when the model-generated audio finishes. Also emitted when a Response is interrupted, incomplete, or canceled.
-
response_idstring
Usually a trace ID. -
item_idstring
Item ID.
{
"event_id": "event_bcd",
"type": "response.audio.done",
"response_id": "traceid",
"item_id": "msg_008"
}Streaming Audio Transcript
type: response.audio_transcript.delta
Returned when the client commits the input audio buffer.
-
response_idstring
Usually a trace ID. -
item_idstring
Item ID. -
output_indexstring
Index of the output item in the response. -
deltastring
Transcript delta.
{
"event_id": "event_bcd",
"type": "response.audio_transcript.delta",
"item_id": "msg_002",
"output_index": 0,
"delta": "Hello, how can I a"
}Audio Transcript Complete
type: response.audio_transcript.done
Returned when the model-generated audio transcript finishes streaming. Also emitted when a Response is interrupted, incomplete, or canceled.
-
response_idstring
Usually a trace ID. -
item_idstring
Item ID. -
output_indexstring
Index of the output item in the response. -
transcriptstring
Complete transcript of the audio.
{
"event_id": "event_4748",
"type": "response.audio_transcript.done",
"response_id": "resp_001",
"item_id": "msg_008",
"content_index": 0,
"transcript": "Hello, how can I assist you today?"
}Conversation Item Created
type: conversation.item.created
Returned when a conversation item is created. The server is generating a Response; if successful, it will generate one or two Items of type message.
-
idstring
Unique message ID; optional—if not provided, the server generates one. -
typestring
Item type, usually message -
rolestring
Sender role (user, assistant, system); only for message items. -
statusstring
Item status (completed, incomplete). These do not affect the conversation. -
contentstring
Message content for message items.
{
"event_id": "event_bcd",
"type": "conversation.item.created",
"previous_item_id": "msg_001",
"item": {
"id": "msg_002",
"object": "realtime.item",
"type": "message",
"status": "completed",
"role": "user",
"content": [
{
"type": "input_text",
"transcript": "Hello"
}
]
}
}Conversation Item Deleted
type: conversation.item.deleted
Returned when the client deletes an item in the conversation using conversation.item.delete. This synchronizes the server conversation history with the client.
item_idstring
Conversation message ID
{
"event_id": "event_bcd",
"type": "conversation.item.deleted",
"item_id": "msg_001"
}User Audio Transcription Completed
type: conversation.item.input_audio_transcription.completed
This event is the output of the audio transcription (ASR) for user audio written to the user audio buffer. Transcription starts when the client commits the buffered audio, or when buffered audio is committed in server_vad mode. Transcription runs asynchronously with response creation, so this event may occur before or after response events.
item_idstring
Conversation message IDcontent_indexint
Index of the audio content part.transcriptstring
Full transcript of the audio.
{
"event_id": "event_2122",
"type": "conversation.item.input_audio_transcription.completed",
"item_id": "msg_003",
"content_index": 0,
"transcript": "Hello"
}Audio Commit Response
type: input_audio_buffer.committed
Returned when the client commits the input audio buffer.
-
previous_item_idstring
Previous conversation message ID
-
item_idstring
Conversation message ID
{
"event_id": "event_bcd",
"type": "input_audio_buffer.committed",
"previous_item_id": "msg_001",
"item_id": "msg_002"
}Audio Buffer Cleared
type: input_audio_buffer.cleared
Returned when the client clears the input audio buffer using input_audio_buffer.clear.
{
"event_id": "event_1314",
"type": "input_audio_buffer.cleared"
}Response Output Item Added
type: response.output_item.added
Returned when a new item is created during response generation.
-
output_indexint
Index of the output item in the response. -
itemobject
Output item object.-
idstring
Item ID. -
object
Alwaysrealtime.item -
type
Item type; currently only supportsmessage -
status
Item status;completed,incomplete,in_progress -
role
Role for the item; only for message items. Options:user,assistant,system -
content
Content of the message; applies to message items.
role=system message items only support input_text content.
role=user message items support input_text and input_audio content.
role=assistant items support text content.
-
{
"event_id": "event_3334",
"type": "response.output_item.added",
"response_id": "resp_001",
"output_index": 0,
"item": {
"id": "msg_007",
"object": "realtime.item",
"type": "message",
"status": "in_progress",
"role": "assistant",
"content": []
}
}Response Output Item Done
type: response.output_item.done
Returned when an item is completed. Also emitted when a response is interrupted, incomplete, or cancelled.
-
output_indexint
Index of the output item in the response. -
itemobject
Output item object.-
idstring
Item ID. -
object
Alwaysrealtime.item -
type
Item type; currently only supportsmessage -
status
Item status;completed,incomplete,in_progress -
role
Role for the item; only for message items. Options:user,assistant,system -
content
Content of the message; applies to message items.
role=system message items only support input_text content.
role=user message items support input_text and input_audio content.
role=assistant items support text content.
-
{
"event_id": "event_3536",
"type": "response.output_item.done",
"response_id": "resp_001",
"output_index": 0,
"item": {
"id": "msg_007",
"object": "realtime.item",
"type": "message",
"status": "completed",
"role": "assistant",
"content": [
{
"type": "text",
"text": "Sure, I can help with that."
}
]
}
}Response Content Part Added
type: response.content_part.added
Returned during response generation when a new content part is added to an assistant message item.
response_idstring
Response IDitem_idstring
Corresponding item IDcontent_indexint
Index of the content part within the item content array.output_indexint
Index of the output item in the response.partobjecttypestring
Type; supportstextoraudioaudiostring
audio: base64-encoded audio data (present when type=audio)textstring
Generated text content (present when type=text)transcriptstring
Transcript of the audio (present when type=audio)
{
"event_id": "event_3738",
"type": "response.content_part.added",
"response_id": "resp_001",
"item_id": "msg_007",
"output_index": 0,
"content_index": 0,
"part": {
"type": "text",
"text": ""
}
}Response Content Part Done
type: response.content_part.done
Returned when a content_part completes. Also emitted when the corresponding response is interrupted, incomplete, or cancelled.
response_idstring
Response IDitem_idstring
Corresponding item IDcontent_indexint
Index of the content part within the item content array.output_indexint
Index of the output item in the response.partobjecttypestring
Type; supportstextoraudioaudiostring
audio: base64-encoded audio data (present when type=audio)textstring
Generated text content (present when type=text)transcriptstring
Transcript of the audio (present when type=audio)
{
"event_id": "event_3940",
"type": "response.content_part.done",
"response_id": "resp_001",
"item_id": "msg_007",
"output_index": 0,
"content_index": 0,
"part": {
"type": "text",
"text": "Sure, I can help with that."
}
}Response Created
type: response.created
Returned when a new Response is created. This is the first event for a response, with the initial status set to in_progress.
-
idstring
Unique ID for the response.
-
objectstring
Object type must be: realtime.response
-
statusstring
Final status of the response (completed, cancelled, failed, incomplete).
-
outputlist
List of output items generated in the response.
{
"event_id": "event_3132",
"type": "response.done",
"response": {
"id": "resp_001",
"object": "realtime.response",
"status": "completed",
"status_details": null,
"output": [
{
"id": "msg_006",
"object": "realtime.item",
"type": "message",
"status": "completed",
"role": "assistant",
"content": [
{
"type": "text",
"text": "Sure, how can I assist you today?"
}
]
}
]
}
}Response Done
type: response.done
Returned when a Response finishes streaming. Always emitted regardless of final status. The Response object in response.done includes all output Items but omits raw audio data.
-
idstring
Unique ID for the response.
-
objectstring
Object type must be: realtime.response
-
statusstring
Final status of the response (completed, cancelled, failed, incomplete).
-
outputlist
List of output items generated in the response.
{
"event_id": "event_bcd",
"type": "response.done",
"response": {
"id": "resp_001",
"object": "realtime.response",
"status": "completed",
"status_details": null,
"output": [
{
"id": "msg_006",
"object": "realtime.item",
"type": "message",
"status": "completed",
"role": "assistant",
"content": [
{
"type": "text",
"text": "Hello"
}
]
}
]
}
}