Skip to Content
API ReferenceAudioStreaming TTS

Streaming Text-to-Speech

This API allows you to generate audio using our Streaming Text-to-Speech (TTS) model.

Request Method

WebSocket

Endpoint

wss://api.stepfun.ai/v1/realtime/audio

Request Headers

  • Authorization string required
    The API key used for authentication. Its value should be: Bearer STEP_API_KEY.

Request Body

  • model string required
    The name of the model to use. Currently, only step-tts-2 is supported.

Call Instructions

To generate audio in streaming mode, you must send the corresponding Client Event after the WebSocket connection is successfully established. The server will then return the corresponding Server Event, through which the audio is generated.

If there is no activity for 60 consecutive seconds, the system will automatically close the connection.

Client Event & Server Event Mapping

A detailed description can be found in the explanation below.

Client Event and Server Event Mapping

Message TypeClient EventServer EventDescription
Connection Established-tts.connection.doneSent by the server after the WebSocket connection is successfully established.
Create Sessiontts.createtts.response.createdCreates a new session. After receiving this response, the client may proceed with generation.
Sentence Start-tts.response.sentence.startTriggered when accumulated text meets the generation threshold and sentence generation begins.
Send Texttts.text.deltatts.response.audio.deltaSends text increments. The corresponding audio delta is returned and can be played immediately.
Sentence End-tts.response.sentence.endTriggered when accumulated text meets the end-of-sentence condition.
Flush Buffertts.text.flushtts.text.flushedQuickly clears the buffer and returns all remaining audio that has not yet been sent.
Text Donetts.text.donetts.response.audio.doneMarks the end of this generation task. No further audio will be produced, and the server will release the connection.
Audio Error-tts.response.errorReturned when audio generation encounters an error.

Client Event Details

Create Session tts.create

The event used to create a new session. After the WebSocket connection is established and the tts.connection.done Server Event is received, the client should send this event to initiate audio generation.

  • type string required
    Must be set to tts.create.
  • data object required
    Event payload.
    • session_id string required
      The session ID used to identify which conversation the request belongs to. Returned by the tts.connection.done Server Event.
    • voice_id string required
      The ID of the voice to use. Refer to the System Voice ID List for supported voices and samples.
    • response_format string optional
      The audio format. Supported values are: wav, mp3, flac, opus, pcm. Defaults to mp3.
    • sample_rate int optional
      The sampling rate of the output audio. Supported values: 8000, 16000, 22050. Default: 22050.
    • pronunciation_map object array optional
      Defines a pronunciation rule to annotate or override the reading of specific characters or symbols. In Chinese text, tones are represented by numbers: 1 for the first tone, 2 for the second tone, 3 for the third tone, 4 for the fourth tone, and 5 for the neutral tone.
      • tone string required
        Specific pronunciation mapping rules, separated by /. Example: ["omg/oh my god"].
    • speed_ratio float optional
      The speaking rate. Valid range: 0.5 to 2.0. Default: 1.0.
    • volume_ratio float optional
      The volume level. Valid range: 0.1 to 2.0. Default: 1.0.
    • mode string optional
      The generation mode. Supported values: default (character-level generation, suitable for LLM real-time streaming scenarios) and sentence (sentence-level generation, suitable when full sentences are already prepared). Default: default.
    • voice_label object optional
      Voice tags. Required when using a custom voice. Only one of the following fields can be set at a time: language, emotion, or style.
      • language string optional
        The language tag. Supported values: Cantonese, Sichuanese, Japanese. If not specified, the system will automatically determine whether the input text is English or Chinese.
      • emotion string optional
        Emotion tag. Supports up to 11 options such as Happy, Angry, etc. Supported values may vary by model.
      • style string optional
        Speaking or delivery style. Supports up to 17 styles. Supported values may vary by model.

Default mode is designed for scenarios where TTS is used together with a large language model. In this mode, the system automatically buffers and segments sentences. Therefore, it does not return audio immediately; generation begins only when the accumulated input forms a complete sentence. If you need to force an immediate return, you can send tts.text.flush, and the model will promptly return the available audio.

Sentence mode is suitable for scenarios where the full text is already available. In this mode, the system automatically segments the text based on punctuation marks such as . ! ? and generates audio accordingly.

Example:

{ "type": "tts.create", "data": { "session_id": "01956e7388477cfcbdc3aaabf364bc70", "voice_id": "lively-girl", "response_format": "wav", "volume_ratio": 1.0, "speed_ratio": 1.0, "sample_rate": 16000, "pronunciation_map": { "tone": [ "LOL/laugh out loudly" ] } } }

Generate Audio tts.text.delta

Client Event used to generate audio. During generation, if the TTS engine determines that the conditions for audio generation have been met, it returns a tts.response.sentence.start event to indicate that inference has begun. It then returns one or more tts.response.audio.delta events containing the audio data. After all audio for the sentence has been sent, the engine returns a tts.response.sentence.end event to indicate that the sentence has finished generating.

If the TTS engine determines that the conditions have not been met, no events will be returned.

  • type string required
    Must be set to tts.text.delta.
  • data object required
    Event payload.
    • session_id string required
      The session ID used to identify which conversation the request belongs to. Returned by the tts.connection.done Server Event.
    • text string required
      The text to be synthesized. The maximum length is 10,000 characters.

Example:

{ "type": "tts.text.delta", "data": { "session_id": "01956e7388477cfcbdc3aaabf364bc70", "text": "The weather is great today, and I want to learn StepFun large model technologies." } }

Flush Buffer tts.text.flush

Forces the TTS engine to return all audio generated so far by clearing the internal buffer.

  • type string required
    Must be set to tts.text.flush.
  • data object required
    Event payload.
    • session_id string required
      The session ID used to identify which conversation the request belongs to. Returned by the tts.connection.done Server Event.

Example:

{ "type": "tts.text.flush", "data": { "session_id": "01956e8dc1d77bb98f9da8d1b642fcf0" } }

Finish Audio Generation tts.text.done

Complete Audio Generation.

  • type string required
    Must be set to tts.text.done.
  • data object required
    Event payload.
    • session_id string required
      The session ID used to identify which conversation the request belongs to. Returned by the tts.connection.done Server Event.

Example:

{ "type": "tts.text.done", "data": { "session_id": "01956e8dc1d77bb98f9da8d1b642fcf0" } }

Server Event Details

Connection Established tts.connection.done

Indicates that the WebSocket connection has been successfully established.

  • event_id string required
    The unique ID of this event. When contacting support, providing this ID helps with troubleshooting.
  • type string required
    Must be set to tts.connection.done.
  • data object required
    Event payload.
    • session_id string required
      The session ID that must be included in subsequent requests.

Example:

{ "event_id": "01956e73888c7953896a6e176bf3d760", "type": "tts.connection.done", "data": { "session_id": "01956e7388477cfcbdc3aaabf364bc70" } }

Session Created tts.response.created

Indicates that the session has been successfully created.

  • event_id string required
    The unique ID of this event. When contacting support, providing this ID helps with troubleshooting.
  • type string required
    Must be set to tts.response.created.
  • data object required
    Event payload.
    • session_id string required
      The session ID that must be included in subsequent requests.

Example:

{ "event_id": "01956e73888c7953896a6e176bf3d760", "type": "tts.response.created", "data": { "session_id": "01956e7388477cfcbdc3aaabf364bc70" } }

Sentence Start tts.response.sentence.start

Indicates that the TTS engine has begun generating a new sentence.

  • event_id string required
    The unique ID of this event. When contacting support, providing this ID helps with troubleshooting.
  • type string required
    Must be set to tts.response.sentence.start.
  • data object required
    Event payload.
    • session_id string required
      The session ID for the current conversation.
    • text string required
      The text content being generated in this sentence.
    • started_at string required
      The timestamp indicating when sentence generation started.

Example:

{ "event_id": "01956e73888c7953896a6e176bf3d760", "type": "tts.response.sentence.start", "data": { "session_id": "01956e7388477cfcbdc3aaabf364bc70", "text": "blah blah", "started_at": 10292929292 } }

Receive Generated Audio tts.response.audio.delta

Indicates that the server is returning a chunk of generated audio.

  • event_id string required
    The unique identifier of this event. When contacting support, providing this ID helps with troubleshooting.
  • type string required
    Must be set to tts.response.audio.delta.
  • data object required
    Event payload.
    • session_id string required
      The session ID must be used in subsequent requests.
    • status string required
      The generation status. Supported values are unfinished and finished.
    • audio string required
      The Base64-encoded audio data.
    • duration float required
      The duration of this audio chunk, in seconds.

Example:

{ "event_id": "42bd707a-ba16-4ddb-a751-54d84301b474", "type": "tts.response.audio.delta", "data": { "session_id": "01956e8dc1d77bb98f9da8d1b642fcf0", "status": "unfinished", "audio": "BASE64 audio data", "duration": 2.043375 } }

Sentence End tts.response.sentence.end

Indicates that the TTS engine has finished generating a sentence.

  • event_id string required
    The unique identifier of this event. When contacting support, providing this ID helps with troubleshooting.
  • type string required
    Must be set to tts.response.sentence.end.
  • data object required
    Event payload.
    • session_id string required
      The session ID.
    • text string required
      The text content generated for this sentence.
    • ended_at string required
      The timestamp indicating when the generation of this sentence ended.

Example:

{ "event_id": "01956e73888c7953896a6e176bf3d760", "type": "tts.response.sentence.end", "data": { "session_id": "01956e7388477cfcbdc3aaabf364bc70", "text": "blah blah", "ended_at": 10292929292 } }

Flush Start tts.text.flushed

Indicates that the system has received the flush command and has begun clearing the buffer.

  • event_id string required
    The unique identifier of this event. When contacting support, providing this ID helps with troubleshooting.
  • type string required
    Must be set to tts.text.flushed.
  • data object required
    Event payload.
    • session_id string required
      The session ID that must be included in subsequent requests.

Example:

{ "event_id": "01956e8ee1b9788c95d5981b1cfdbf12", "type": "tts.text.flushed", "data": { "session_id": "01956e8dc1d77bb98f9da8d1b642fcf0" } }

Generation Completed tts.response.audio.done

Indicates that the audio generation task has been completed. After receiving this event, the connection will automatically close. Additionally, if the connection remains idle for more than 60 seconds, the system will also complete the generation and close the connection.

  • event_id string required
    The unique identifier of this event. When contacting support, providing this ID helps with troubleshooting.
  • type string required
    Must be set to tts.response.audio.done.
  • data object required
    Event payload.
    • session_id string required
      The session ID that must be included in subsequent requests.
    • audio string required
      The Base64-encoded audio data, containing all audio content generated in this session.

Example:

{ "event_id": "01956e8bf5067d6499cdfa0dad34f805", "type": "tts.response.audio.done", "data": { "session_id": "01956e7388477cfcbdc3aaabf364bc70", "audio": "" } }

Error Event tts.response.error

Indicates that an error occurred during audio generation.

{ "event_id": "01956e8fdb157619a852bdf38028db45", "type": "tts.response.error", "data": { "session_id": "01956e8dc1d77bb98f9da8d1b642fcf0", "code": "503", "message": "The engine is currently overloaded, please try again later", "details": { "error": "The engine is currently overloaded, please try again later" } } }

Code Example

First, run pip install websocket-client, and then execute the following code.

import websocket import rel import json headers = { "Authorization": "Bearer STEP_API_KEY" } def get_start_event(sid): return json.dumps( { "type": "tts.create", "data": { "session_id": sid, "voice_id": "lively-girl", "response_format": "wav", "volume_ratio": 1.0, "speed_ratio": 1.0, "sample_rate": 16000 }, } ) def on_message(ws, message): data = json.loads(message) session_id = data["data"]["session_id"] event_type = data["type"] if event_type == "tts.connection.done": start_event = get_start_event(session_id) ws.send(start_event) print(message) def on_error(ws, error): print(error) if __name__ == "__main__": websocket.enableTrace(True) ws = websocket.WebSocketApp( "wss://api.stepfun.ai/v1/realtime/audio?model=step-tts-2", header=headers, on_message=on_message, on_error=on_error, ) ws.run_forever( dispatcher=rel, reconnect=5 ) rel.signal(2, rel.abort) rel.dispatch()
const WebSocket = require('ws'); const url = 'wss://api.stepfun.ai/v1/realtime/audio?model=step-tts-2'; const headers = { Authorization: 'Bearer STEP_API_KEY', 'Content-Type': 'application/json', }; const ws = new WebSocket(url, { headers: headers }); ws.on('open', () => { console.log('Connection established'); }); ws.on('message', (message) => { console.log(`Message received: ${message}`); const event = JSON.parse(message); const session_id = event.data.session_id; const event_type = event.type; if (event_type === 'tts.connection.done') { ws.send(JSON.stringify({ type: 'tts.create', data: { session_id: session_id, voice_id: 'lively-girl', response_format: 'wav', volume_ratio: 1.0, speed_ratio: 1.0, sample_rate: 16000 } })); } }); ws.on('error', (error) => { console.error(`Error occurred: ${error}`); }); ws.on('close', (code, reason) => { console.log(`Connection closed, code: ${code}, reason: ${reason}`); });
Last updated on