Skip to main content

Overview

This interface provides HTTP + SSE based recognition. It’s designed to accept a single audio submission and receive transcription results pushed back via SSE.

Key Features

  • One-shot audio submission
  • Transcription results continuously delivered over SSE
  • Supports incremental and final results
  • Fits server-side calls, file transcription, and near-realtime processing

Access Overview

CapabilityHTTP + SSE
Audio inputOne-shot submission
Result returnSSE streaming
ConnectionHTTP POST
Session persistenceNo
Typical clientBackend

Service Endpoints

ProtocolEnvironmentEndpointNotes
HTTP/SSEProductionhttps://api.stepfun.ai/v1/audio/asr/sseUnidirectional streaming response
For Step Plan, use POST https://api.stepfun.ai/step_plan/v1/audio/asr/sse

HTTP Protocol

Sequence Diagram

Authentication

ParameterRequiredDescriptionExample
Content-TypeYesMust be application/jsonapplication/json
AcceptYesMust be text/event-streamtext/event-stream
AuthorizationYesAuthentication tokenBearer sk-xxxx

Request Format

{
  "audio": {
    "data": "audioData",
    "input": {
      "transcription": {
        "language": "zh",
        "hotwords": ["hotword1", "hotword2"],
        "prompt": "Please transcribe the audio content.",
        "model": "stepaudio-2.5-asr",
        "enable_itn": true
      },
      "format": {
        "type": "pcm",
        "codec": "pcm_s16le",
        "rate": 16000,
        "bits": 16,
        "channel": 1
      }
    }
  }
}

Parameters

PathTypeDescriptionExample
audio.datastringBase64-encoded audio data"audioData"
audio.input.transcription.languagestringRecognition language"zh"
audio.input.transcription.hotwordsarrayHotword list, e.g. ["hotword1", "hotword2"]["hotword1", "hotword2"]
audio.input.transcription.promptstringPrompt providing context or technical terminology (only applies to stepaudio-2-asr-pro)"Please transcribe ..."
audio.input.transcription.modelstringModel name. Supports stepaudio-2.5-asr, stepaudio-2-asr-pro"stepaudio-2.5-asr"
audio.input.transcription.enable_itnboolWhether to enable ITN text normalization (default true)true
audio.input.format.typestringAudio container format. Supports ogg, mp3, wav, pcm"pcm"
audio.input.format.codecstringEncoding format; when type=pcm, typically pcm_s16le"pcm_s16le"
audio.input.format.rateintSample rate; required for pcm, optional for others16000
audio.input.format.bitsintBit depth; required for pcm, optional for others16
audio.input.format.channelintChannel count; required for pcm, optional for others1
Compatibility notes:
  • For backward compatibility, the SSE endpoint still accepts step-asr-1.1-stream as the model value, which is equivalent to stepaudio-2.5-asr.
  • The full_rerun_on_commit parameter (second-pass recognition correction) is no longer supported on SSE. If legacy clients still send it, the server silently ignores it and recognition proceeds normally.
Additional notes:
  • Audio data must be Base64-encoded.
  • Supported audio formats: ogg, mp3, wav, pcm.
  • When the audio format is pcm, rate, bits, and channel are required.
  • When the audio format is ogg, mp3, or wav, rate, bits, and channel are optional.

Response Format

SSE streaming response with the following event types:

Delta Event (transcript.text.delta)

Incremental transcription text.
{
  "type": "transcript.text.delta",
  "meta": {
    "session_id": "sse_1642694400123456789",
    "timestamp": 1642694400123
  },
  "delta": "recognized "
}
FieldTypeDescription
typestringEvent type. Fixed as transcript.text.delta
meta.session_idstringSession ID
meta.timestampint64Unix timestamp (ms)
deltastringIncremental transcription text

Done Event (transcript.text.done)

Final transcription text is ready.
{
  "type": "transcript.text.done",
  "meta": {
    "session_id": "sse_1642694400123456789",
    "timestamp": 1642694400456
  },
  "text": "The complete recognized text",
  "usage": {
    "type": "realtime_asr",
    "input_tokens": 1000,
    "input_token_details": {
      "text_tokens": 0,
      "audio_tokens": 1000
    },
    "output_tokens": 50,
    "total_tokens": 1050
  }
}
FieldTypeDescription
typestringEvent type. Fixed as transcript.text.done
meta.session_idstringSession ID
meta.timestampint64Unix timestamp (ms)
textstringComplete transcription text
usageobjectUsage statistics

Error Event (error)

Returned when recognition fails.
{
  "type": "error",
  "meta": {
    "session_id": "sse_1642694400123456789",
    "timestamp": 1642694400789
  },
  "message": "Error description"
}
FieldTypeDescription
typestringEvent type. Fixed as error
meta.session_idstringSession ID
meta.timestampint64Unix timestamp (ms)
messagestringError description