Skip to Content
API ReferenceRealtimeCreate chat

Start Real-Time Voice Calls

Enable real-time voice calling with voice and text input, and audio output.

Quick Demo

We provide a quick demo; click the link below to try it.

Request Method

WebSocket

Endpoint

wss://api.stepfun.ai/v1/realtime

Request Headers

  • Authorization string required
    Auth key; value Bearer STEP_API_KEY

Request Parameters

  • model string required
    Model name to use; currently only supports step-1o-audio step-audio-2 step-audio-2-mini

Usage

After the service connection is successful, the Realtime API requires sending the corresponding Client Event and receiving the corresponding Server Event to complete the interaction.

Common Parameters

The following are common parameters for Client Events and Server Events.

FieldTypeDescription
event_idstringEvent ID
typestringEvent type; options listed below

Client Event List

Create/Update Session

type: session.update

Send this event to create or update the default session configuration. The client can send this at any time to update the session configuration; any field may be updated at any time except “voice”. The server responds with a session.updated event.

  • modalities array<string>


    Modalities the model can use. Fixed to ["text", "audio"]

  • instructions string


    Default system instructions (system message) attached before model calls. This lets the client guide the model to get the desired response. The model can be guided on content and format (e.g., “be very concise,” “be friendly,” “here are examples of good replies”) and audio behavior (e.g., “speak quickly,” “add emotion to your voice,” “laugh often”). The model is not guaranteed to follow the instructions, but they guide the desired behavior.

  • voice string


    Voice to use during generation. Supports official voices and custom voices. Pass the corresponding voice ID for custom voices. You can view available IDs via list voices.
    When using step-audio-2 or step-audio-2-mini, only qingchunshaonv and wenrounansheng are supported, and you need to append “please use the default male voice to talk to the user” or “please use the default female voice to talk to the user” at the end of the instructions.

  • turn_detection object optional


    Server VAD parameters; off by default

    Expand/Collapse
    • type string required
      Currently only supports server_vad; enables server-side VAD when configured. Threshold configuration is not supported yet.
  • input_audio_format string


    Format of input audio. Currently only supports pcm16.

  • output_audio_format string


    Format of output audio. Currently only supports pcm16.

  • tools object array optional


    List of functions supported by Toolcall.

    Expand/Collapse
    • type string

    Tool type, always function or retrieval * function object


    Description of the function

    Expand/Collapse
    • name string

    Function name; must be alphanumeric with _- characters only and preferably under 64 characters. - description string


    Function description; used to tell the model what the function does and its purpose. - parameters object


    Function parameters

    Expand/Collapse
    • type object

    Parameter description, generally an object - properties object


    Function parameter content; keys are parameter names, described by type and description.

    Expand/Collapse
    • type string|number|integer|object|array|boolean

    Parameter type, see json-schema  for reference. - description string


    Function parameter description; explains what the parameter means.

    When type is retrieval

    Expand/Collapse
    • description string

    Function description; used to tell the model what the function does and its purpose. - options object


    Function parameters

    Expand/Collapse
    • vector_store_id string
      Knowledge base ID - prompt_template string
      Template for inserting recalled content into the document. Default is Find the answer to question{' '} {{ query }} from the document {{ knowledge }}. Find the answer using sentences from the document; if the document does not contain an answer, tell the user it cannot be found, where {{ knowledge }} is the recalled content and {{ query }} is the user query. Modify as needed.

Sample

{ "event_id": "event_abc", "type": "session.update", "session": { "modalities": ["text", "audio"], "instructions": "You are an AI chat assistant provided by Stepfun. You are good at conversations in Chinese, English, and many other languages.", "voice": "linjiajiejie", "input_audio_format": "pcm16", "output_audio_format": "pcm16", "tools": [ { # Example knowledge base config "type": "retrieval", "function": { "description": "This knowledge base can answer 'One Hundred Thousand Whys' type questions.", "options": { "vector_store_id": "164643690285936640", "prompt_template": "Find the answer to question {{query}} from the document {{knowledge}}. Find the answer using sentences from the document; if the document does not contain an answer, tell the user it cannot be found" } } }, { # Example knowledge base config "type": "retrieval", "function": { "description": "This knowledge base can answer questions about installing Redis, etc.", "options": { "vector_store_id": "164643837904470016", "prompt_template": "Find the answer to question {{query}} from the document {{knowledge}}. Find the answer using sentences from the document; if the document does not contain an answer, tell the user it cannot be found" } } }, ], "turn_detection": { "type": "server_vad" } } }
Append Audio Content

type: input_audio_buffer.append

Send this event to append audio bytes to the input audio buffer. The server does not acknowledge this event. In Server VAD mode it will trigger model inference.

  • audio string
    Base64-encoded audio bytes. Must use the format specified in the session configuration input_audio_format field.

Sample

{ "event_id": "event_abc", "type": "input_audio_buffer.append", "audio": "Base64EncodedAudioData" }
Submit Audio Content

type: input_audio_buffer.commit

Send this event to commit the user input audio buffer for inference. This creates a new user message item in the conversation. The server responds with input_audio_buffer.committed. If the input audio buffer is empty, this event produces an error.

Sample

{ "event_id": "event_abc", "type": "input_audio_buffer.commit" }
Clear Audio Content

type: input_audio_buffer.clear

Send this event to clear the user input audio buffer. The server responds with input_audio_buffer.cleared.

Sample

{ "event_id": "event_abc", "type": "input_audio_buffer.clear" }
Add Conversation Item

type: conversation.item.create

Add a new item to the conversation context, including messages, function calls, and function call responses. This can populate conversation “history” or add new message items along the way, but it cannot currently populate Assistant audio messages. If successful, the server responds with conversation.item.created; otherwise it sends an error.

  • previous_item_id string


    Previous item ID

  • content string
    Message content for message items; see message parameters.

Sample

{ "event_id": "event_abc", "type": "conversation.item.create", "item": { "id": "msg_001", "type": "message", "role": "user", "content": [ { "type": "input_text", "text": "Hello" } ] } }
Delete Conversation Item

type: conversation.item.delete

Send this event when you want to remove any item from the conversation history. The server responds with conversation.item.deleted, and returns an error if the item does not exist.

  • item_id string
    ID of the message to delete

Sample

{ "event_id": "event_abc", "type": "conversation.item.delete", "item_id": "msg_003" }
Submit Inference

type: response.create

This event instructs the server to create a Response, which triggers model inference. The server will return:

Sample

{ "event_id": "event_abc", "type": "response.create" }
Cancel Inference

type: response.cancel

Send this event to cancel the response in progress. The server returns response.cancelled or an error if there is nothing to cancel.

{ "event_id": "event_abc", "type": "response.cancel" }

Server Event List

Error Event

type: error

Returned when an error occurs during server execution. This may be a client or server problem. The session remains active.

  • type string


    Error type (e.g., “invalid_request_error,” “server_error”).

  • code string


    Error code (if any).

  • message string


    Human-readable error message.

  • event_id string
    event_id of the client event that caused the error (if applicable).

Sample

{ "event_id": "event_bcd", "type": "error", "error": { "type": "invalid_request_error", "code": "invalid_param", "message": "Audio content is incomplete", "event_id": "event_567" } }
Session Created

type: session.created

Returned when a Session is created. Automatically emitted as the first server event when a new connection is established. Contains the default Session configuration.

  • modalities array<string>


    Modalities the model can use. Fixed to ["text", "audio"]

  • instructions string


    Default system instructions (system message) attached before model calls. This lets the client guide the model to get the desired response. The model can be guided on content and format (e.g., “be very concise,” “be friendly,” “here are examples of good replies”) and audio behavior (e.g., “speak quickly,” “add emotion to your voice,” “laugh often”). The model is not guaranteed to follow the instructions, but they guide the desired behavior.

  • voice string


    Voice to use during generation; supports official voices, with custom voices coming later.

  • input_audio_format string


    Format of input audio. Currently only supports pcm16.

  • output_audio_format string
    Format of output audio. Currently only supports pcm16.

sample

{ "event_id": "event_def", "type": "session.created", "session": { "id": "sess_001", "object": "realtime.session", "model": "step-1o-audio", "modalities": ["text", "audio"], "instructions": "You are an AI chat assistant provided by Stepfun. You are good at conversations in Chinese, English, and many other languages.", "voice": "linjiajiejie", "input_audio_format": "pcm16", "output_audio_format": "pcm16", "max_response_output_tokens": "4096" } }
Session Updated

type: session.updated

Returned when a Session is updated. Automatically emitted as the first server event when a new connection is established. Contains the default Session configuration.

Sample

{ "event_id": "event_def", "type": "session.created", "session": { "modalities": ["text", "audio"], "instructions": "You are an AI chat assistant provided by Stepfun. You are good at conversations in Chinese, English, and many other languages.", "voice": "linjiajiejie", "input_audio_format": "pcm16", "output_audio_format": "pcm16", "max_response_output_tokens": "4096" } }
Audio Input Activation Start (VAD)

type: input_audio_buffer.speech_started

Notification that valid speech input has started in the audio input; typically used for interruption scenarios.

  • audio_start_ms string


    Start time of the audio

  • item_id string
    Item ID.

Sample

{ "event_id": "event_bcd", "type": "input_audio_buffer.speech_started", "audio_start_ms": 1000, "item_id": "msg_003" }
Audio Input Activation End (VAD)

type: input_audio_buffer.speech_stopped

Notification that valid speech input in the audio has ended.

  • response_id string


    Usually a trace ID.

  • item_id string
    Item ID.

Sample

{ "event_id": "event_1718", "type": "input_audio_buffer.speech_stopped", "audio_end_ms": 2000, "item_id": "msg_003" }
Streaming Audio Output

type: response.audio.delta

Returned when the model-generated audio is updated.

  • response_id string


    Usually a trace ID.

  • item_id string


    Item ID.

  • output_index string


    Index of the output item in the response.

  • delta string
    Base64-encoded incremental audio data; audio format matches the session output_audio_format.

Sample

{ "event_id": "event_bcd", "type": "response.audio.delta", "item_id": "msg_008", "delta": "Base64EncodedAudioDelta" }
Streaming Audio Complete

type: response.audio.done

Returned when the model-generated audio finishes. Also emitted when a Response is interrupted, incomplete, or canceled.

  • response_id string


    Usually a trace ID.

  • item_id string
    Item ID.

{ "event_id": "event_bcd", "type": "response.audio.done", "response_id": "traceid", "item_id": "msg_008" }
Streaming Audio Transcript

type: response.audio_transcript.delta

Returned when the client commits the input audio buffer.

  • response_id string


    Usually a trace ID.

  • item_id string


    Item ID.

  • output_index string


    Index of the output item in the response.

  • delta string
    Transcript delta.

{ "event_id": "event_bcd", "type": "response.audio_transcript.delta", "item_id": "msg_002", "output_index": 0, "delta": "Hello, how can I a" }
Audio Transcript Complete

type: response.audio_transcript.done

Returned when the model-generated audio transcript finishes streaming. Also emitted when a Response is interrupted, incomplete, or canceled.

  • response_id string


    Usually a trace ID.

  • item_id string


    Item ID.

  • output_index string


    Index of the output item in the response.

  • transcript string
    Complete transcript of the audio.

{ "event_id": "event_4748", "type": "response.audio_transcript.done", "response_id": "resp_001", "item_id": "msg_008", "content_index": 0, "transcript": "Hello, how can I assist you today?" }
Conversation Item Created

type: conversation.item.created

Returned when a conversation item is created. The server is generating a Response; if successful, it will generate one or two Items of type message.

  • id string


    Unique message ID; optional—if not provided, the server generates one.

  • type string


    Item type, usually message

  • role string


    Sender role (user, assistant, system); only for message items.

  • status string


    Item status (completed, incomplete). These do not affect the conversation.

  • content string


    Message content for message items.

{ "event_id": "event_bcd", "type": "conversation.item.created", "previous_item_id": "msg_001", "item": { "id": "msg_002", "object": "realtime.item", "type": "message", "status": "completed", "role": "user", "content": [ { "type": "input_text", "transcript": "Hello" } ] } }
Conversation Item Deleted

type: conversation.item.deleted

Returned when the client deletes an item in the conversation using conversation.item.delete. This synchronizes the server conversation history with the client.

  • item_id string
    Conversation message ID
{ "event_id": "event_bcd", "type": "conversation.item.deleted", "item_id": "msg_001" }
User Audio Transcription Completed

type: conversation.item.input_audio_transcription.completed

This event is the output of the audio transcription (ASR) for user audio written to the user audio buffer. Transcription starts when the client commits the buffered audio, or when buffered audio is committed in server_vad mode. Transcription runs asynchronously with response creation, so this event may occur before or after response events.

  • item_id string
    Conversation message ID
  • content_index int
    Index of the audio content part.
  • transcript string
    Full transcript of the audio.
{ "event_id": "event_2122", "type": "conversation.item.input_audio_transcription.completed", "item_id": "msg_003", "content_index": 0, "transcript": "Hello" }
Audio Commit Response

type: input_audio_buffer.committed

Returned when the client commits the input audio buffer.

  • previous_item_id string


    Previous conversation message ID

  • item_id string


    Conversation message ID

{ "event_id": "event_bcd", "type": "input_audio_buffer.committed", "previous_item_id": "msg_001", "item_id": "msg_002" }
Audio Buffer Cleared

type: input_audio_buffer.cleared

Returned when the client clears the input audio buffer using input_audio_buffer.clear.

{ "event_id": "event_1314", "type": "input_audio_buffer.cleared" }
Response Output Item Added

type: response.output_item.added

Returned when a new item is created during response generation.

  • output_index int


    Index of the output item in the response.

  • item object


    Output item object.

    • id string
      Item ID.

    • object
      Always realtime.item

    • type
      Item type; currently only supports message

    • status
      Item status; completed, incomplete, in_progress

    • role
      Role for the item; only for message items. Options: user, assistant, system

    • content
      Content of the message; applies to message items.
      role=system message items only support input_text content.
      role=user message items support input_text and input_audio content.
      role=assistant items support text content.

{ "event_id": "event_3334", "type": "response.output_item.added", "response_id": "resp_001", "output_index": 0, "item": { "id": "msg_007", "object": "realtime.item", "type": "message", "status": "in_progress", "role": "assistant", "content": [] } }
Response Output Item Done

type: response.output_item.done

Returned when an item is completed. Also emitted when a response is interrupted, incomplete, or cancelled.

  • output_index int


    Index of the output item in the response.

  • item object


    Output item object.

    • id string
      Item ID.

    • object
      Always realtime.item

    • type
      Item type; currently only supports message

    • status
      Item status; completed, incomplete, in_progress

    • role
      Role for the item; only for message items. Options: user, assistant, system

    • content
      Content of the message; applies to message items.
      role=system message items only support input_text content.
      role=user message items support input_text and input_audio content.
      role=assistant items support text content.

{ "event_id": "event_3536", "type": "response.output_item.done", "response_id": "resp_001", "output_index": 0, "item": { "id": "msg_007", "object": "realtime.item", "type": "message", "status": "completed", "role": "assistant", "content": [ { "type": "text", "text": "Sure, I can help with that." } ] } }
Response Content Part Added

type: response.content_part.added

Returned during response generation when a new content part is added to an assistant message item.

  • response_id string
    Response ID
  • item_id string
    Corresponding item ID
  • content_index int
    Index of the content part within the item content array.
  • output_index int
    Index of the output item in the response.
  • part object
    • type string
      Type; supports text or audio
    • audio string
      audio: base64-encoded audio data (present when type=audio)
    • text string
      Generated text content (present when type=text)
    • transcript string
      Transcript of the audio (present when type=audio)
{ "event_id": "event_3738", "type": "response.content_part.added", "response_id": "resp_001", "item_id": "msg_007", "output_index": 0, "content_index": 0, "part": { "type": "text", "text": "" } }
Response Content Part Done

type: response.content_part.done

Returned when a content_part completes. Also emitted when the corresponding response is interrupted, incomplete, or cancelled.

  • response_id string
    Response ID
  • item_id string
    Corresponding item ID
  • content_index int
    Index of the content part within the item content array.
  • output_index int
    Index of the output item in the response.
  • part object
    • type string
      Type; supports text or audio
    • audio string
      audio: base64-encoded audio data (present when type=audio)
    • text string
      Generated text content (present when type=text)
    • transcript string
      Transcript of the audio (present when type=audio)
{ "event_id": "event_3940", "type": "response.content_part.done", "response_id": "resp_001", "item_id": "msg_007", "output_index": 0, "content_index": 0, "part": { "type": "text", "text": "Sure, I can help with that." } }
Response Created

type: response.created

Returned when a new Response is created. This is the first event for a response, with the initial status set to in_progress.

  • id string


    Unique ID for the response.

  • object string


    Object type must be: realtime.response

  • status string


    Final status of the response (completed, cancelled, failed, incomplete).

  • output list


    List of output items generated in the response.

{ "event_id": "event_3132", "type": "response.done", "response": { "id": "resp_001", "object": "realtime.response", "status": "completed", "status_details": null, "output": [ { "id": "msg_006", "object": "realtime.item", "type": "message", "status": "completed", "role": "assistant", "content": [ { "type": "text", "text": "Sure, how can I assist you today?" } ] } ] } }
Response Done

type: response.done

Returned when a Response finishes streaming. Always emitted regardless of final status. The Response object in response.done includes all output Items but omits raw audio data.

  • id string


    Unique ID for the response.

  • object string


    Object type must be: realtime.response

  • status string


    Final status of the response (completed, cancelled, failed, incomplete).

  • output list


    List of output items generated in the response.

{ "event_id": "event_bcd", "type": "response.done", "response": { "id": "resp_001", "object": "realtime.response", "status": "completed", "status_details": null, "output": [ { "id": "msg_006", "object": "realtime.item", "type": "message", "status": "completed", "role": "assistant", "content": [ { "type": "text", "text": "Hello" } ] } ] } }
Last updated on