Start Real-Time Voice Calls

Enable real-time voice calling with voice and text input, and audio output.

Quick Demo

We provide a quick demo; click the link below to try it.

Request Method

WebSocket

Endpoint


wss://api.stepfun.ai/v1/realtime

Request Headers

Authorization string required
Auth key; value Bearer STEP_API_KEY

Request Parameters

model string required
Model name to use; currently only supports step-1o-audio step-audio-2 step-audio-2-mini

Usage

After the service connection is successful, the Realtime API requires sending the corresponding Client Event and receiving the corresponding Server Event to complete the interaction.

Common Parameters

The following are common parameters for Client Events and Server Events.

Field	Type	Description
event_id	string	Event ID
type	string	Event type; options listed below

Client Event List

Create/Update Session

type: session.update

Send this event to create or update the default session configuration. The client can send this at any time to update the session configuration; any field may be updated at any time except “voice”. The server responds with a session.updated event.

modalities array<string>

Modalities the model can use. Fixed to ["text", "audio"]
instructions string

Default system instructions (system message) attached before model calls. This lets the client guide the model to get the desired response. The model can be guided on content and format (e.g., “be very concise,” “be friendly,” “here are examples of good replies”) and audio behavior (e.g., “speak quickly,” “add emotion to your voice,” “laugh often”). The model is not guaranteed to follow the instructions, but they guide the desired behavior.
voice string

Voice to use during generation. Supports official voices and custom voices. Pass the corresponding voice ID for custom voices. You can view available IDs via list voices.
When using step-audio-2 or step-audio-2-mini, only qingchunshaonv and wenrounansheng are supported, and you need to append “please use the default male voice to talk to the user” or “please use the default female voice to talk to the user” at the end of the instructions.
turn_detection object optional

Server VAD parameters; off by default
Expand/Collapse
- type string required
  Currently only supports server_vad; enables server-side VAD when configured. Threshold configuration is not supported yet.
input_audio_format string

Format of input audio. Currently only supports pcm16.
output_audio_format string

Format of output audio. Currently only supports pcm16.
tools object array optional

List of functions supported by Toolcall.
Expand/Collapse
- type string
Tool type, always function or retrieval * function object

Description of the function
Expand/Collapse
name string

Function name; must be alphanumeric with _- characters only and preferably under 64 characters. - description string

Function description; used to tell the model what the function does and its purpose. - parameters object

Function parameters
Expand/Collapse

type object

Parameter description, generally an object - properties object

Function parameter content; keys are parameter names, described by type and description.
Expand/Collapse

type string|number|integer|object|array|boolean

Parameter type, see json-schema for reference. - description string

Function parameter description; explains what the parameter means.
When type is retrieval
Expand/Collapse
description string

Function description; used to tell the model what the function does and its purpose. - options object

Function parameters
Expand/Collapse

vector_store_id string
Knowledge base ID - prompt_template string
Template for inserting recalled content into the document. Default is Find the answer to question{' '} {{ query }} from the document {{ knowledge }}. Find the answer using sentences from the document; if the document does not contain an answer, tell the user it cannot be found, where {{ knowledge }} is the recalled content and {{ query }} is the user query. Modify as needed.

Sample


{
    "event_id": "event_abc",
    "type": "session.update",
    "session": {
        "modalities": ["text", "audio"],
        "instructions": "You are an AI chat assistant provided by Stepfun. You are good at conversations in Chinese, English, and many other languages.",
        "voice": "linjiajiejie",
        "input_audio_format": "pcm16",
        "output_audio_format": "pcm16",
        "tools": [
            { # Example knowledge base config
                "type": "retrieval",
                "function": {
                    "description": "This knowledge base can answer 'One Hundred Thousand Whys' type questions.",
                    "options": {
                        "vector_store_id": "164643690285936640",
                        "prompt_template": "Find the answer to question {{query}} from the document {{knowledge}}. Find the answer using sentences from the document; if the document does not contain an answer, tell the user it cannot be found"
                    }
                }
            },
            { # Example knowledge base config
                "type": "retrieval",
                "function": {
                    "description": "This knowledge base can answer questions about installing Redis, etc.",
                    "options": {
                        "vector_store_id": "164643837904470016",
                        "prompt_template": "Find the answer to question {{query}} from the document {{knowledge}}. Find the answer using sentences from the document; if the document does not contain an answer, tell the user it cannot be found"
                    }
                }
            },
        ],
        "turn_detection": {
            "type": "server_vad"
        }
    }
}

Append Audio Content

type: input_audio_buffer.append

Send this event to append audio bytes to the input audio buffer. The server does not acknowledge this event. In Server VAD mode it will trigger model inference.

audio string
Base64-encoded audio bytes. Must use the format specified in the session configuration input_audio_format field.

Sample


{
	"event_id": "event_abc",
	"type": "input_audio_buffer.append",
	"audio": "Base64EncodedAudioData"
}

Submit Audio Content

type: input_audio_buffer.commit

Send this event to commit the user input audio buffer for inference. This creates a new user message item in the conversation. The server responds with input_audio_buffer.committed. If the input audio buffer is empty, this event produces an error.

Sample


{
	"event_id": "event_abc",
	"type": "input_audio_buffer.commit"
}

Clear Audio Content

type: input_audio_buffer.clear

Send this event to clear the user input audio buffer. The server responds with input_audio_buffer.cleared.

Sample


{
	"event_id": "event_abc",
	"type": "input_audio_buffer.clear"
}

Add Conversation Item

type: conversation.item.create

Add a new item to the conversation context, including messages, function calls, and function call responses. This can populate conversation “history” or add new message items along the way, but it cannot currently populate Assistant audio messages. If successful, the server responds with conversation.item.created; otherwise it sends an error.

previous_item_id string

Previous item ID
content string
Message content for message items; see message parameters.

Sample


{
	"event_id": "event_abc",
	"type": "conversation.item.create",
	"item": {
		"id": "msg_001",
		"type": "message",
		"role": "user",
		"content": [
			{
				"type": "input_text",
				"text": "Hello"
			}
		]
	}
}

Delete Conversation Item

type: conversation.item.delete

Send this event when you want to remove any item from the conversation history. The server responds with conversation.item.deleted, and returns an error if the item does not exist.

item_id string
ID of the message to delete

Sample


{
	"event_id": "event_abc",
	"type": "conversation.item.delete",
	"item_id": "msg_003"
}

Submit Inference

type: response.create

This event instructs the server to create a Response, which triggers model inference. The server will return:

Sample


{
	"event_id": "event_abc",
	"type": "response.create"
}

Cancel Inference

type: response.cancel

Send this event to cancel the response in progress. The server returns response.cancelled or an error if there is nothing to cancel.


{
	"event_id": "event_abc",
	"type": "response.cancel"
}

Server Event List

Error Event

type: error

Returned when an error occurs during server execution. This may be a client or server problem. The session remains active.

type string

Error type (e.g., “invalid_request_error,” “server_error”).
code string

Error code (if any).
message string

Human-readable error message.
event_id string
event_id of the client event that caused the error (if applicable).

Sample


{
	"event_id": "event_bcd",
	"type": "error",
	"error": {
		"type": "invalid_request_error",
		"code": "invalid_param",
		"message": "Audio content is incomplete",
		"event_id": "event_567"
	}
}

Session Created

type: session.created

Returned when a Session is created. Automatically emitted as the first server event when a new connection is established. Contains the default Session configuration.

modalities array<string>

Modalities the model can use. Fixed to ["text", "audio"]
instructions string

Default system instructions (system message) attached before model calls. This lets the client guide the model to get the desired response. The model can be guided on content and format (e.g., “be very concise,” “be friendly,” “here are examples of good replies”) and audio behavior (e.g., “speak quickly,” “add emotion to your voice,” “laugh often”). The model is not guaranteed to follow the instructions, but they guide the desired behavior.
voice string

Voice to use during generation; supports official voices, with custom voices coming later.
input_audio_format string

Format of input audio. Currently only supports pcm16.
output_audio_format string
Format of output audio. Currently only supports pcm16.

sample


{
	"event_id": "event_def",
	"type": "session.created",
	"session": {
		"id": "sess_001",
		"object": "realtime.session",
		"model": "step-1o-audio",
		"modalities": ["text", "audio"],
		"instructions": "You are an AI chat assistant provided by Stepfun. You are good at conversations in Chinese, English, and many other languages.",
		"voice": "linjiajiejie",
		"input_audio_format": "pcm16",
		"output_audio_format": "pcm16",
		"max_response_output_tokens": "4096"
	}
}

Session Updated

type: session.updated

Returned when a Session is updated. Automatically emitted as the first server event when a new connection is established. Contains the default Session configuration.

Sample


{
	"event_id": "event_def",
	"type": "session.created",
	"session": {
		"modalities": ["text", "audio"],
		"instructions": "You are an AI chat assistant provided by Stepfun. You are good at conversations in Chinese, English, and many other languages.",
		"voice": "linjiajiejie",
		"input_audio_format": "pcm16",
		"output_audio_format": "pcm16",
		"max_response_output_tokens": "4096"
	}
}

Audio Input Activation Start (VAD)

type: input_audio_buffer.speech_started

Notification that valid speech input has started in the audio input; typically used for interruption scenarios.

audio_start_ms string

Start time of the audio
item_id string
Item ID.

Sample


{
	"event_id": "event_bcd",
	"type": "input_audio_buffer.speech_started",
	"audio_start_ms": 1000,
	"item_id": "msg_003"
}

Audio Input Activation End (VAD)

type: input_audio_buffer.speech_stopped

Notification that valid speech input in the audio has ended.

response_id string

Usually a trace ID.
item_id string
Item ID.

Sample


{
	"event_id": "event_1718",
	"type": "input_audio_buffer.speech_stopped",
	"audio_end_ms": 2000,
	"item_id": "msg_003"
}

Streaming Audio Output

type: response.audio.delta

Returned when the model-generated audio is updated.

response_id string

Usually a trace ID.
item_id string

Item ID.
output_index string

Index of the output item in the response.
delta string
Base64-encoded incremental audio data; audio format matches the session output_audio_format.

Sample


{
	"event_id": "event_bcd",
	"type": "response.audio.delta",
	"item_id": "msg_008",
	"delta": "Base64EncodedAudioDelta"
}

Streaming Audio Complete

type: response.audio.done

Returned when the model-generated audio finishes. Also emitted when a Response is interrupted, incomplete, or canceled.

response_id string

Usually a trace ID.
item_id string
Item ID.


{
	"event_id": "event_bcd",
	"type": "response.audio.done",
	"response_id": "traceid",
	"item_id": "msg_008"
}

Streaming Audio Transcript

type: response.audio_transcript.delta

Returned when the client commits the input audio buffer.

response_id string

Usually a trace ID.
item_id string

Item ID.
output_index string

Index of the output item in the response.
delta string
Transcript delta.


{
	"event_id": "event_bcd",
	"type": "response.audio_transcript.delta",
	"item_id": "msg_002",
	"output_index": 0,
	"delta": "Hello, how can I a"
}

Audio Transcript Complete

type: response.audio_transcript.done

Returned when the model-generated audio transcript finishes streaming. Also emitted when a Response is interrupted, incomplete, or canceled.

response_id string

Usually a trace ID.
item_id string

Item ID.
output_index string

Index of the output item in the response.
transcript string
Complete transcript of the audio.


{
	"event_id": "event_4748",
	"type": "response.audio_transcript.done",
	"response_id": "resp_001",
	"item_id": "msg_008",
	"content_index": 0,
	"transcript": "Hello, how can I assist you today?"
}

Conversation Item Created

type: conversation.item.created

Returned when a conversation item is created. The server is generating a Response; if successful, it will generate one or two Items of type message.

id string

Unique message ID; optional—if not provided, the server generates one.
type string

Item type, usually message
role string

Sender role (user, assistant, system); only for message items.
status string

Item status (completed, incomplete). These do not affect the conversation.
content string

Message content for message items.


{
	"event_id": "event_bcd",
	"type": "conversation.item.created",
	"previous_item_id": "msg_001",
	"item": {
		"id": "msg_002",
		"object": "realtime.item",
		"type": "message",
		"status": "completed",
		"role": "user",
		"content": [
			{
				"type": "input_text",
				"transcript": "Hello"
			}
		]
	}
}

Conversation Item Deleted

type: conversation.item.deleted

Returned when the client deletes an item in the conversation using conversation.item.delete. This synchronizes the server conversation history with the client.

item_id string
Conversation message ID


{
	"event_id": "event_bcd",
	"type": "conversation.item.deleted",
	"item_id": "msg_001"
}

User Audio Transcription Completed

type: conversation.item.input_audio_transcription.completed

This event is the output of the audio transcription (ASR) for user audio written to the user audio buffer. Transcription starts when the client commits the buffered audio, or when buffered audio is committed in server_vad mode. Transcription runs asynchronously with response creation, so this event may occur before or after response events.

item_id string
Conversation message ID
content_index int
Index of the audio content part.
transcript string
Full transcript of the audio.


{
	"event_id": "event_2122",
	"type": "conversation.item.input_audio_transcription.completed",
	"item_id": "msg_003",
	"content_index": 0,
	"transcript": "Hello"
}

Audio Commit Response

type: input_audio_buffer.committed

Returned when the client commits the input audio buffer.

previous_item_id string

Previous conversation message ID
item_id string

Conversation message ID


{
	"event_id": "event_bcd",
	"type": "input_audio_buffer.committed",
	"previous_item_id": "msg_001",
	"item_id": "msg_002"
}

Audio Buffer Cleared

type: input_audio_buffer.cleared

Returned when the client clears the input audio buffer using input_audio_buffer.clear.


{
	"event_id": "event_1314",
	"type": "input_audio_buffer.cleared"
}

Response Output Item Added

type: response.output_item.added

Returned when a new item is created during response generation.

output_index int

Index of the output item in the response.
item object

Output item object.
- id string
  Item ID.
- object
  Always realtime.item
- type
  Item type; currently only supports message
- status
  Item status; completed, incomplete, in_progress
- role
  Role for the item; only for message items. Options: user, assistant, system
- content
  Content of the message; applies to message items.
  role=system message items only support input_text content.
  role=user message items support input_text and input_audio content.
  role=assistant items support text content.


{
	"event_id": "event_3334",
	"type": "response.output_item.added",
	"response_id": "resp_001",
	"output_index": 0,
	"item": {
		"id": "msg_007",
		"object": "realtime.item",
		"type": "message",
		"status": "in_progress",
		"role": "assistant",
		"content": []
	}
}

Response Output Item Done

type: response.output_item.done

Returned when an item is completed. Also emitted when a response is interrupted, incomplete, or cancelled.

output_index int

Index of the output item in the response.
item object

Output item object.
- id string
  Item ID.
- object
  Always realtime.item
- type
  Item type; currently only supports message
- status
  Item status; completed, incomplete, in_progress
- role
  Role for the item; only for message items. Options: user, assistant, system
- content
  Content of the message; applies to message items.
  role=system message items only support input_text content.
  role=user message items support input_text and input_audio content.
  role=assistant items support text content.


{
	"event_id": "event_3536",
	"type": "response.output_item.done",
	"response_id": "resp_001",
	"output_index": 0,
	"item": {
		"id": "msg_007",
		"object": "realtime.item",
		"type": "message",
		"status": "completed",
		"role": "assistant",
		"content": [
			{
				"type": "text",
				"text": "Sure, I can help with that."
			}
		]
	}
}

Response Content Part Added

type: response.content_part.added

Returned during response generation when a new content part is added to an assistant message item.

response_id string
Response ID
item_id string
Corresponding item ID
content_index int
Index of the content part within the item content array.
output_index int
Index of the output item in the response.
part object
- type string
  Type; supports text or audio
- audio string
  audio: base64-encoded audio data (present when type=audio)
- text string
  Generated text content (present when type=text)
- transcript string
  Transcript of the audio (present when type=audio)


{
	"event_id": "event_3738",
	"type": "response.content_part.added",
	"response_id": "resp_001",
	"item_id": "msg_007",
	"output_index": 0,
	"content_index": 0,
	"part": {
		"type": "text",
		"text": ""
	}
}

Response Content Part Done

type: response.content_part.done

Returned when a content_part completes. Also emitted when the corresponding response is interrupted, incomplete, or cancelled.

response_id string
Response ID
item_id string
Corresponding item ID
content_index int
Index of the content part within the item content array.
output_index int
Index of the output item in the response.
part object
- type string
  Type; supports text or audio
- audio string
  audio: base64-encoded audio data (present when type=audio)
- text string
  Generated text content (present when type=text)
- transcript string
  Transcript of the audio (present when type=audio)


{
	"event_id": "event_3940",
	"type": "response.content_part.done",
	"response_id": "resp_001",
	"item_id": "msg_007",
	"output_index": 0,
	"content_index": 0,
	"part": {
		"type": "text",
		"text": "Sure, I can help with that."
	}
}

Response Created

type: response.created

Returned when a new Response is created. This is the first event for a response, with the initial status set to in_progress.

id string

Unique ID for the response.
object string

Object type must be: realtime.response
status string

Final status of the response (completed, cancelled, failed, incomplete).
output list

List of output items generated in the response.


{
	"event_id": "event_3132",
	"type": "response.done",
	"response": {
		"id": "resp_001",
		"object": "realtime.response",
		"status": "completed",
		"status_details": null,
		"output": [
			{
				"id": "msg_006",
				"object": "realtime.item",
				"type": "message",
				"status": "completed",
				"role": "assistant",
				"content": [
					{
						"type": "text",
						"text": "Sure, how can I assist you today?"
					}
				]
			}
		]
	}
}

Response Done

type: response.done

Returned when a Response finishes streaming. Always emitted regardless of final status. The Response object in response.done includes all output Items but omits raw audio data.

id string

Unique ID for the response.
object string

Object type must be: realtime.response
status string

Final status of the response (completed, cancelled, failed, incomplete).
output list

List of output items generated in the response.


{
    "event_id": "event_bcd",
    "type": "response.done",
    "response": {
        "id": "resp_001",
        "object": "realtime.response",
        "status": "completed",
        "status_details": null,
        "output": [
            {
                "id": "msg_006",
                "object": "realtime.item",
                "type": "message",
                "status": "completed",
                "role": "assistant",
                "content": [
                    {
                        "type": "text",
                        "text": "Hello"
                    }
                ]
            }
        ]
    }
}