Skip to main content
Transcription API

Realtime API Reference

GET

wss://eu2.rt.speechmatics.com/v2/

Protocol overview

A basic Realtime session will have the following message exchanges:

Browser based transcription

When starting a Real-Time transcription session in the browser, temporary keys should be used to avoid exposing your long-lived API key.

To do so, you must provide the temporary key as a part of a query parameter. This is due to a browser limitation. For example:

 wss://eu2.rt.speechmatics.com/v2?jwt=<temporary-key>

Handshake Responses

Successful Response

  • 101 Switching Protocols - Switch to WebSocket protocol

Here is an example for a successful WebSocket handshake:

GET /v2/ HTTP/1.1
Host: eu2.rt.speechmatics.com
Upgrade: websocket
Connection: Upgrade
Sec-WebSocket-Key: ujRTbIaQsXO/0uCbjjkSZQ==
Sec-WebSocket-Version: 13
Sec-WebSocket-Extensions: permessage-deflate; client_max_window_bits
Authorization: Bearer wmz9fkLJM6U5NdyaG3HLHybGZj65PXp
User-Agent: Python/3.8 websockets/8.1

A successful response should look like:

HTTP/1.1 101 Switching Protocols
Server: nginx/1.17.8
Date: Wed, 06 Jan 2021 11:01:05 GMT
Connection: upgrade
Upgrade: WebSocket
Sec-WebSocket-Accept: 87kiC/LI5WgXG52nSylnfXdz260=

Malformed Request

A malformed handshake request will result in one of the following HTTP responses:

  • 400 Bad Request
  • 401 Unauthorized - when the API key is not valid
  • 405 Method Not Allowed - when the request method is not GET

Client Retry

Following a successful handshake and switch to the WebSocket protocol, the client could receive an immediate error message and WebSocket close handshake from the server. For the following errors only, we recommend adding a client retry interval of at least 5-10 seconds:

  • 4005 quota_exceeded
  • 4013 job_error
  • 1011 internal_error

Message Handling

Each message that the Server accepts is a stringified JSON object with the following fields:

  • message (String): The name of the message we are sending. Any other fields depend on the value of the message and are described below.

The messages sent by the Server to a Client are stringified JSON objects as well.

The only exception is a binary message sent from the Client to the Server containing a chunk of audio which will be referred to as AddAudio.

The following values of the message field are supported:

Sent messages

StartRecognition

Initiates a new recognition session.
messagerequired
Constant value: StartRecognition
audio_format objectrequired
oneOf

Raw audio samples, described by the following additional mandatory fields:

typerequired
Constant value: raw
encodingstringrequired

Possible values: [pcm_f32le, pcm_s16le, mulaw]

sample_rateintegerrequired

The sample rate of the audio in Hz.

Example: {"type":"raw","encoding":"pcm_s16le","sample_rate":44100}

transcription_config objectrequired
languagestringrequired
domainstring

Request a specialized model based on 'language' but optimized for a particular field, e.g. "finance" or "medical".

output_localestring

Possible values: non-empty

additional_vocab object[]
  • Array [
  • oneOf
    string

    Possible values: non-empty

  • ]
  • diarizationstring

    Possible values: [none, speaker]

    max_delaynumber

    Possible values: >= 0

    max_delay_modestring

    Possible values: [flexible, fixed]

    speaker_diarization_config object
    max_speakersinteger

    Possible values: >= 2 and <= 100

    prefer_current_speakerboolean
    speaker_sensitivityfloat

    Possible values: >= 0 and <= 1

    audio_filtering_config object
    volume_thresholdfloat

    Possible values: >= 0 and <= 100

    transcript_filtering_config
    remove_disfluenciesboolean
    replacementsarray[]

    A list of replacement rules to apply to the transcript. Each rule consists of a pattern to match and a replacement string.

    enable_partialsboolean
    Default value: false
    enable_entitiesboolean
    Default value: true
    operating_pointstring

    Possible values: [standard, enhanced]

    punctuation_overrides object
    permitted_marksstring[]

    The punctuation marks which the client is prepared to accept in transcription output, or the special value 'all' (the default). Unsupported marks are ignored. This value is used to guide the transcription process.

    Possible values: Value must match regular expression ^(.|all)$

    sensitivityfloat

    Ranges between zero and one. Higher values will produce more punctuation. The default is 0.5.

    Possible values: >= 0 and <= 1

    conversation_config object

    This mode will detect when a speaker has stopped talking. The end_of_utterance_silence_trigger is the time in seconds after which the server will assume that the speaker has finished speaking, and will emit an EndOfUtterance message. A value of 0 disables the feature.

    end_of_utterance_silence_triggerfloat

    Possible values: >= 0 and <= 2

    Default value: 0
    translation_config object
    target_languagesstring[]required
    enable_partialsboolean
    Default value: false
    audio_events_config object
    typesstring[]

    AddAudio

    A binary chunk of audio. The server confirms receipt by sending an AudioAdded message.
    stringbinary

    EndOfStream

    Declares that the client has no more audio to send.
    messagerequired
    Constant value: EndOfStream
    last_seq_nointegerrequired

    SetRecognitionConfig

    Allows the client to re-configure the recognition session.
    messagerequired
    Constant value: SetRecognitionConfig
    transcription_config objectrequired
    languagestringrequired
    domainstring

    Request a specialized model based on 'language' but optimized for a particular field, e.g. "finance" or "medical".

    output_localestring

    Possible values: non-empty

    additional_vocab object[]
  • Array [
  • oneOf
    string

    Possible values: non-empty

  • ]
  • diarizationstring

    Possible values: [none, speaker]

    max_delaynumber

    Possible values: >= 0

    max_delay_modestring

    Possible values: [flexible, fixed]

    speaker_diarization_config object
    max_speakersinteger

    Possible values: >= 2 and <= 100

    prefer_current_speakerboolean
    speaker_sensitivityfloat

    Possible values: >= 0 and <= 1

    audio_filtering_config object
    volume_thresholdfloat

    Possible values: >= 0 and <= 100

    transcript_filtering_config
    remove_disfluenciesboolean
    replacementsarray[]

    A list of replacement rules to apply to the transcript. Each rule consists of a pattern to match and a replacement string.

    enable_partialsboolean
    Default value: false
    enable_entitiesboolean
    Default value: true
    operating_pointstring

    Possible values: [standard, enhanced]

    punctuation_overrides object
    permitted_marksstring[]

    The punctuation marks which the client is prepared to accept in transcription output, or the special value 'all' (the default). Unsupported marks are ignored. This value is used to guide the transcription process.

    Possible values: Value must match regular expression ^(.|all)$

    sensitivityfloat

    Ranges between zero and one. Higher values will produce more punctuation. The default is 0.5.

    Possible values: >= 0 and <= 1

    conversation_config object

    This mode will detect when a speaker has stopped talking. The end_of_utterance_silence_trigger is the time in seconds after which the server will assume that the speaker has finished speaking, and will emit an EndOfUtterance message. A value of 0 disables the feature.

    end_of_utterance_silence_triggerfloat

    Possible values: >= 0 and <= 2

    Default value: 0

    Received messages

    RecognitionStarted

    Server response to StartRecognition, acknowledging that a recognition session has started.
    messagerequired
    Constant value: RecognitionStarted
    orchestrator_versionstring
    idstring

    AudioAdded

    Server response to AddAudio, indicating that audio has been added successfully.
    messagerequired
    Constant value: AudioAdded
    seq_nointegerrequired

    AddPartialTranscript

    Contains a work-in-progress transcript of a part of the audio that the client has sent.
    messagerequired
    Constant value: AddPartialTranscript
    formatstring

    Speechmatics JSON output format version number.

    Example: 2.1
    metadata objectrequired
    start_timefloatrequired
    end_timefloatrequired
    transcriptstringrequired
    results object[]required
  • Array [
  • typestringrequired

    Possible values: [word, punctuation]

    start_timefloatrequired
    end_timefloatrequired
    channelstring
    attaches_tostring

    Possible values: [next, previous, none, both]

    is_eosboolean
    alternatives object[]
  • Array [
  • contentstringrequired
    confidencefloatrequired
    languagestring
    display
    directionstringrequired

    Possible values: [ltr, rtl]

    speakerstring
  • ]
  • scorefloat

    Possible values: >= 0 and <= 1

    volumefloat

    Possible values: >= 0 and <= 100

  • ]
  • AddTranscript

    Contains the final transcript of a part of the audio that the client has sent.
    messagerequired
    Constant value: AddTranscript
    formatstring

    Speechmatics JSON output format version number.

    Example: 2.1
    metadata objectrequired
    start_timefloatrequired
    end_timefloatrequired
    transcriptstringrequired
    results object[]required
  • Array [
  • typestringrequired

    Possible values: [word, punctuation]

    start_timefloatrequired
    end_timefloatrequired
    channelstring
    attaches_tostring

    Possible values: [next, previous, none, both]

    is_eosboolean
    alternatives object[]
  • Array [
  • contentstringrequired
    confidencefloatrequired
    languagestring
    display
    directionstringrequired

    Possible values: [ltr, rtl]

    speakerstring
  • ]
  • scorefloat

    Possible values: >= 0 and <= 1

    volumefloat

    Possible values: >= 0 and <= 100

  • ]
  • AddPartialTranslation

    Contains a work-in-progress translation of a part of the audio that the client has sent.
    messagerequired
    Constant value: AddPartialTranslation
    formatstring

    Speechmatics JSON output format version number.

    Example: 2.1
    languagestringrequired
    results object[]required
  • Array [
  • contentstringrequired
    start_timefloatrequired
    end_timefloatrequired
    speakerstring
  • ]
  • AddTranslation

    Contains the final translation of a part of the audio that the client has sent.
    messagerequired
    Constant value: AddTranslation
    formatstring

    Speechmatics JSON output format version number.

    Example: 2.1
    languagestringrequired
    results object[]required
  • Array [
  • contentstringrequired
    start_timefloatrequired
    end_timefloatrequired
    speakerstring
  • ]
  • EndOfTranscript

    Server response to `EndOfStream`, after the server has finished sending all AddTranscript messages.
    messagerequired
    Constant value: EndOfTranscript

    AudioEventStarted

    Start of an audio event detected.
    messagerequired
    Constant value: AudioEventStarted
    event objectrequired
    typestringrequired
    start_timefloatrequired
    confidencefloatrequired

    AudioEventEnded

    End of an audio event detected.
    messagerequired
    Constant value: AudioEventEnded
    event objectrequired
    typestringrequired
    end_timefloatrequired

    EndOfUtterance

    Indicates the end of an utterance, triggered by a configurable period of non-speech.
    messagerequired
    Constant value: EndOfUtterance
    metadata objectrequired
    start_timefloat
    end_timefloat

    Info

    Additional information sent from the server to the client.
    messagerequired
    Constant value: Info
    typestringrequired

    The following are the possible info types:

    Info TypeDescription
    recognition_qualityInforms the client what particular quality-based model is used to handle the recognition. Sent to the client immediately after the WebSocket handshake is completed.
    model_redirectInforms the client that a deprecated language code has been specified, and will be handled with a different model. For example, if the model parameter is set to one of en-US, en-GB, or en-AU, then the request may be internally redirected to the Global English model (en).
    deprecatedInforms about using a feature that is going to be removed in a future release.
    session_transferInforms that the session has been seamlessly transferred to another backend, with the reason: Session has been transferred to a new backend. This typically occurs due to backend maintenance operations or migration from a faulty backend.

    Possible values: [recognition_quality, model_redirect, deprecated, concurrent_session_usage]

    reasonstringrequired
    codeinteger
    seq_nointeger
    qualitystring

    Only set when type is recognition_quality. Quality-based model name. It is one of "telephony", "broadcast". The model is selected automatically, for high-quality audio (12kHz+) the broadcast model is used, for lower quality audio the telephony model is used.

    usagenumber

    Only set when type is concurrent_session_usage. Indicates the current usage (number of active concurrent sessions).

    quotanumber

    Only set when type is concurrent_session_usage. Indicates the current quota (maximum number of concurrent sessions allowed).

    last_updatedstring

    Only set when type is concurrent_session_usage. Indicates the timestamp of the most recent usage update, in the format YYYY-MM-DDTHH:MM:SSZ (UTC). This value is updated even when usage exceeds the quota, as it represents the most recent known data. In some cases, it may be empty or outdated due to internal errors preventing successful update.

    Example: 2025-03-25T08:45:31Z

    Warning

    Warning messages sent from the server to the client.
    messagerequired
    Constant value: Warning
    typestringrequired

    The following are the possible warning types:

    Warning TypeDescription
    duration_limit_exceededThe maximum allowed duration of a single utterance to process has been exceeded. Any AddAudio messages received that exceed this limit are confirmed with AudioAdded, but are ignored by the transcription engine. Exceeding the limit triggers the same mechanism as receiving an EndOfStream message, so the Server will eventually send an EndOfTranscript message and suspend.
    unsupported_translation_pairOne of the requested translation target languages is unsupported (given the source audio language). The error message specifies the unsupported language pair.
    idle_timeoutInforms that the session is approaching the idle duration limit (no audio data sent within the last hour), with a reason of the form:

    Session will timeout in {time_remaining}m due to inactivity, no audio sent within the last {time_elapsed}m

    Currently the server will send messages at 15, 10 and 5m prior to timeout, and will send a final error message on timeout, before closing the connection with the code 1008. (see Realtime limits for more information).
    session_timeoutInforms that the session is approaching the max session duration limit (maximum session duration of 48 hours), with a reason of the form:

    Session will timeout in {time_remaining}m due to max duration, session has been active for {time_elapsed}m

    Currently the server will send messages at 45, 30 and 15m prior to timeout, and will send a final error message on timeout, before closing the connection with the code 1008. (see Realtime limits for more information).
    empty_translation_target_listNo supported translation target languages specified. Translation will not run.
    add_audio_after_eosProtocol specification doesn't allow adding audio after EndOfStream has been received. Any `AddAudio messages after this, will be ignored.

    Possible values: [duration_limit_exceeded, unsupported_translation_pair, idle_timeout, session_timeout, empty_translation_target_list, add_audio_after_eos]

    reasonstringrequired
    codeinteger
    seq_nointeger
    duration_limitnumber

    Only set when type is duration_limit_exceeded. Indicates the limit that was exceeded (in seconds).

    Error

    Error messages sent from the server to the client. After any error, the transcription is terminated and the connection is closed.
    messagerequired
    Constant value: Error
    typestringrequired

    The following are the possible error types:

    Error TypeDescription
    invalid_messageThe message received was not understood.
    invalid_modelUnable to use the model for the recognition. This can happen if the language is not supported at all, or is not available for the user.
    invalid_configThe config received contains some wrong or unsupported fields, or too many translation target languages were requested.
    invalid_audio_typeAudio type is not supported, is deprecated, or the audio_type is malformed.
    invalid_output_formatOutput format is not supported, is deprecated, or the output_format is malformed.
    not_authorisedUser was not recognised, or the API key provided is not valid.
    insufficient_fundsUser doesn't have enough credits or any other reason preventing the user to be charged for the job properly.
    not_allowedUser is not allowed to use this message (is not allowed to perform the action the message would invoke).
    job_errorUnable to do any work on this job, the server might have timed out etc.
    data_errorUnable to accept the data specified - usually because there is too much data being sent at once
    buffer_errorUnable to fit the data in a corresponding buffer. This can happen for clients sending the input data faster than real-time.
    protocol_errorMessage received was syntactically correct, but could not be accepted due to protocol limitations. This is usually caused by messages sent in the wrong order.
    quota_exceededMaximum number of concurrent connections allowed for the contract has been reached
    timelimit_exceededUsage quota for the contract has been reached
    idle_timeoutIdle duration limit was reached (no audio data sent within the last hour), a closing handshake with code 1008 follows this in-band error.
    session_timeoutMax session duration was reached (maximum session duration of 48 hours), a closing handshake with code 1008 follows this in-band error.
    session_transferAn error while transferring session to another backend with the reason: Session transfer failed. This may occur when moving sessions due to backend maintenance operations or migration from a faulty backend.
    unknown_errorAn error that did not fit any of the types above.

    invalid_message, protocol_error and unknown_error can be triggered as a response to any type of messages.

    Possible values: [invalid_message, invalid_model, invalid_config, invalid_audio_type, invalid_output_format, not_authorised, insufficient_funds, not_allowed, job_error, data_error, buffer_error, protocol_error, quota_exceeded, timelimit_exceeded, idle_timeout, session_timeout, session_transfer, unknown_error]

    reasonstringrequired
    codeinteger
    seq_nointeger

    Websocket errors

    In the Real-time SaaS, an in-band error message can be followed by a WebSocket close message. The table below shows the possible WebSocket close codes and associated error types. The error types are provided in the payload of the close message.

    WebSocket Close CodeWebSocket Close Payload
    1003protocol_error
    1008policy_violation
    1011internal_error
    4001not_authorised
    4003not_allowed
    4004invalid_model
    4005quota_exceeded
    4006timelimit_exceeded
    4013job_error