Skip to main content
Transcription API

Realtime API Reference

GET

wss://eu2.rt.speechmatics.com/v2/

Protocol overview

A basic Realtime session will have the following message exchanges:

Browser based transcription

When starting a Realtime transcription session in the browser, temporary keys should be used to avoid exposing your long-lived API key.

To do so, you must provide the temporary key as a part of a query parameter. This is due to a browser limitation. For example:

 wss://eu2.rt.speechmatics.com/v2?jwt=<temporary-key>

Handshake responses

Successful Response

  • 101 Switching Protocols - Switch to WebSocket protocol

Here is an example for a successful WebSocket handshake:

GET /v2/ HTTP/1.1
Host: eu2.rt.speechmatics.com
Upgrade: websocket
Connection: Upgrade
Sec-WebSocket-Key: ujRTbIaQsXO/0uCbjjkSZQ==
Sec-WebSocket-Version: 13
Sec-WebSocket-Extensions: permessage-deflate; client_max_window_bits
Authorization: Bearer wmz9fkLJM6U5NdyaG3HLHybGZj65PXp
User-Agent: Python/3.8 websockets/8.1

A successful response should look like:

HTTP/1.1 101 Switching Protocols
Server: nginx/1.17.8
Date: Wed, 06 Jan 2021 11:01:05 GMT
Connection: upgrade
Upgrade: WebSocket
Sec-WebSocket-Accept: 87kiC/LI5WgXG52nSylnfXdz260=

Malformed request

A malformed handshake request will result in one of the following HTTP responses:

  • 400 Bad Request
  • 401 Unauthorized - when the API key is not valid
  • 405 Method Not Allowed - when the request method is not GET

Client Retry

Following a successful handshake and switch to the WebSocket protocol, the client could receive an immediate error message and WebSocket close handshake from the server. For the following errors only, we recommend adding a client retry interval of at least 5-10 seconds:

  • 4005 quota_exceeded
  • 4013 job_error
  • 1011 internal_error

Message Handling

Each message that the Server accepts is a stringified JSON object with the following fields:

  • message (String): The name of the message we are sending. Any other fields depend on the value of the message and are described below.

The messages sent by the Server to a Client are stringified JSON objects as well.

The only exception is a binary message sent from the Client to the Server containing a chunk of audio which will be referred to as AddAudio.

The following values of the message field are supported:

Sent messages

StartRecognition

Initiates a new recognition session.

messagerequired
Constant value: StartRecognition
audio_format objectrequired
oneOf

Raw audio samples, described by the following additional mandatory fields:

typerequired
Constant value: raw
encodingstringrequired

Possible values: [pcm_f32le, pcm_s16le, mulaw]

sample_rateintegerrequired

The sample rate of the audio in Hz.

Example: {"type":"raw","encoding":"pcm_s16le","sample_rate":44100}

transcription_config objectrequired

Contains configuration for this recognition session.

languagestringrequired

Language model to process the audio input, normally specified as an ISO language code. The value must be consistent with the language code used in the API endpoint URL.

Example: en
domainstring

Request a specialized model based on 'language' but optimized for a particular field, e.g. finance or medical.

output_localestring

Configure locale for outputted transcription. See output formatting.

Possible values: non-empty

additional_vocab object[]

Configure custom dictionary. Default is an empty list. You should be aware that there is a performance penalty (latency degradation and memory increase) from using additional_vocab, especially if you use a large word list. When initializing a session that uses additional_vocab in the config, you should expect a delay of up to 15 seconds (depending on the size of the list).

  • Array [
  • oneOf
    string

    Possible values: non-empty

  • ]
  • diarizationstring

    Set to speaker to apply Speaker Diarization to the audio.

    Possible values: [none, speaker]

    Default value: none
    max_delaynumber

    This is the delay in seconds between the end of a spoken word and returning the Final transcript results. See Latency for more details

    Possible values: >= 0.7 and <= 4

    Default value: 4
    max_delay_modestring

    This allows some additional time for Smart Formatting.

    Possible values: [flexible, fixed]

    Default value: flexible
    speaker_diarization_config object
    max_speakersinteger

    Configure the maximum number of speakers to detect. See Max Speakers.

    Possible values: >= 2 and <= 100

    Default value: 50
    prefer_current_speakerboolean

    When set to true, reduces the likelihood of incorrectly switching between similar sounding speakers. See Prefer Current Speaker.

    Default value: false
    speaker_sensitivityfloat

    Possible values: >= 0 and <= 1

    audio_filtering_config object

    Puts a lower limit on the volume of processed audio by using the volume_threshold setting. See Audio Filtering.

    volume_thresholdfloat

    Possible values: >= 0 and <= 100

    transcript_filtering_config object
    remove_disfluenciesboolean

    When set to true, removes disfluencies from the transcript. See Removing disfluencies

    replacements object[]

    A list of replacement rules to apply to the transcript. Each rule consists of a pattern to match and a replacement string. See Word replacement

  • Array [
  • fromstringrequired
    tostringrequired
  • ]
  • enable_partialsboolean

    Whether or not to send Partials (i.e. AddPartialTranslation messages) as well as Finals (i.e. AddTranslation messages) See Partial transcripts.

    Default value: false
    enable_entitiesboolean
    Default value: true
    operating_pointstring

    Which model you wish to use. See Operating points for more details.

    Possible values: [standard, enhanced]

    Default value: standard
    punctuation_overrides object

    Options for controlling punctuation in the output transcripts. See Punctuation Settings

    permitted_marksstring[]

    The punctuation marks which the client is prepared to accept in transcription output, or the special value 'all' (the default). Unsupported marks are ignored. This value is used to guide the transcription process.

    Possible values: Value must match regular expression ^(.|all)$

    sensitivityfloat

    Ranges between zero and one. Higher values will produce more punctuation. The default is 0.5.

    Possible values: >= 0 and <= 1

    conversation_config object

    This mode will detect when a speaker has stopped talking. The end_of_utterance_silence_trigger is the time in seconds after which the server will assume that the speaker has finished speaking, and will emit an EndOfUtterance message. A value of 0 disables the feature.

    end_of_utterance_silence_triggerfloat

    Possible values: >= 0 and <= 2

    Default value: 0
    translation_config object

    Specifies various configuration values for translation. All fields except target_languages are optional, using default values when omitted.

    target_languagesstring[]required

    List of languages to translate to from the source transcription language. Specified as an ISO Language Code.

    enable_partialsboolean

    Whether or not to send Partials (i.e. AddPartialTranslation messages) as well as Finals (i.e. AddTranslation messages).

    Default value: false
    audio_events_config object

    Contains configuration for Audio Events

    typesstring[]

    List of Audio Event types to enable.

    AddAudio

    A binary chunk of audio. The server confirms receipt by sending an AudioAdded message.

    stringbinary

    EndOfStream

    Declares that the client has no more audio to send.

    messagerequired
    Constant value: EndOfStream
    last_seq_nointegerrequired

    SetRecognitionConfig

    Allows the client to re-configure the recognition session.

    Only the following fields can be set through a SetRecognitionConfig message:

    • max_delay
    • max_delay_mode
    • enable_partials

    If you wish to alter any other parameters you must terminate the session and restart with the altered configuration. Attempting otherwise will result in an error.

    messagerequired
    Constant value: SetRecognitionConfig
    transcription_config objectrequired

    Contains configuration for this recognition session.

    languagestringrequired

    Language model to process the audio input, normally specified as an ISO language code. The value must be consistent with the language code used in the API endpoint URL.

    Example: en
    domainstring

    Request a specialized model based on 'language' but optimized for a particular field, e.g. finance or medical.

    output_localestring

    Configure locale for outputted transcription. See output formatting.

    Possible values: non-empty

    additional_vocab object[]

    Configure custom dictionary. Default is an empty list. You should be aware that there is a performance penalty (latency degradation and memory increase) from using additional_vocab, especially if you use a large word list. When initializing a session that uses additional_vocab in the config, you should expect a delay of up to 15 seconds (depending on the size of the list).

  • Array [
  • oneOf
    string

    Possible values: non-empty

  • ]
  • diarizationstring

    Set to speaker to apply Speaker Diarization to the audio.

    Possible values: [none, speaker]

    Default value: none
    max_delaynumber

    This is the delay in seconds between the end of a spoken word and returning the Final transcript results. See Latency for more details

    Possible values: >= 0.7 and <= 4

    Default value: 4
    max_delay_modestring

    This allows some additional time for Smart Formatting.

    Possible values: [flexible, fixed]

    Default value: flexible
    speaker_diarization_config object
    max_speakersinteger

    Configure the maximum number of speakers to detect. See Max Speakers.

    Possible values: >= 2 and <= 100

    Default value: 50
    prefer_current_speakerboolean

    When set to true, reduces the likelihood of incorrectly switching between similar sounding speakers. See Prefer Current Speaker.

    Default value: false
    speaker_sensitivityfloat

    Possible values: >= 0 and <= 1

    audio_filtering_config object

    Puts a lower limit on the volume of processed audio by using the volume_threshold setting. See Audio Filtering.

    volume_thresholdfloat

    Possible values: >= 0 and <= 100

    transcript_filtering_config object
    remove_disfluenciesboolean

    When set to true, removes disfluencies from the transcript. See Removing disfluencies

    replacements object[]

    A list of replacement rules to apply to the transcript. Each rule consists of a pattern to match and a replacement string. See Word replacement

  • Array [
  • fromstringrequired
    tostringrequired
  • ]
  • enable_partialsboolean

    Whether or not to send Partials (i.e. AddPartialTranslation messages) as well as Finals (i.e. AddTranslation messages) See Partial transcripts.

    Default value: false
    enable_entitiesboolean
    Default value: true
    operating_pointstring

    Which model you wish to use. See Operating points for more details.

    Possible values: [standard, enhanced]

    Default value: standard
    punctuation_overrides object

    Options for controlling punctuation in the output transcripts. See Punctuation Settings

    permitted_marksstring[]

    The punctuation marks which the client is prepared to accept in transcription output, or the special value 'all' (the default). Unsupported marks are ignored. This value is used to guide the transcription process.

    Possible values: Value must match regular expression ^(.|all)$

    sensitivityfloat

    Ranges between zero and one. Higher values will produce more punctuation. The default is 0.5.

    Possible values: >= 0 and <= 1

    conversation_config object

    This mode will detect when a speaker has stopped talking. The end_of_utterance_silence_trigger is the time in seconds after which the server will assume that the speaker has finished speaking, and will emit an EndOfUtterance message. A value of 0 disables the feature.

    end_of_utterance_silence_triggerfloat

    Possible values: >= 0 and <= 2

    Default value: 0

    Received messages

    RecognitionStarted

    Server response to StartRecognition, acknowledging that a recognition session has started.

    messagerequired
    Constant value: RecognitionStarted
    orchestrator_versionstring
    idstring

    AudioAdded

    Server response to AddAudio, indicating that audio has been added successfully.

    When clients send audio faster than real-time, the server may read data slower than it's sent. If binary AddAudio messages exceed the server's internal buffer, the server will process other WebSocket messages until buffer space is available. Clients receive AudioAdded responses only after binary data is read. This can fill TCP buffers, potentially causing WebSocket write failures and connection closure with prejudice. Clients can monitor the WebSocket's bufferedAmount attribute to prevent this.

    messagerequired
    Constant value: AudioAdded
    seq_nointegerrequired

    AddPartialTranscript

    A partial transcript is a transcript that can be changed in a future AddPartialTranscript as more words are spoken until the AddTranscript Final message is sent for that audio.

    Partials will only be sent if transcription_config.enable_partials is set to true in the StartRecognition message.

    The message structure is the same as AddTranscript, with a few limitations.

    For AddPartialTranscript messages the confidence field for alternatives has no meaning and should not be relied on.

    messagerequired
    Constant value: AddPartialTranscript
    formatstring

    Speechmatics JSON output format version number.

    Example: 2.1
    metadata objectrequired
    start_timefloatrequired
    end_timefloatrequired
    transcriptstringrequired

    The entire transcript contained in the segment in text format. Providing the entire transcript here is designed for ease of consumption; we have taken care of all the necessary formatting required to concatenate the transcription results into a block of text. This transcript lacks the detailed information however which is contained in the results field of the message - such as the timings and confidences for each word.

    results object[]required
  • Array [
  • typestringrequired

    Possible values: [word, punctuation]

    start_timefloatrequired
    end_timefloatrequired
    channelstring
    attaches_tostring

    Possible values: [next, previous, none, both]

    is_eosboolean
    alternatives object[]
  • Array [
  • contentstringrequired

    A word or punctuation mark.

    confidencefloatrequired

    A confidence score assigned to the alternative. Ranges from 0.0 (least confident) to 1.0 (most confident).

    languagestring

    The language that the alternative word is assumed to be spoken in. Currently, this will always be equal to the language that was requested in the initial StartRecognition message.

    display object

    Information about how the word/symbol should be displayed.

    directionstringrequired

    Either ltr for words that should be displayed left-to-right, or rtl vice versa.

    Possible values: [ltr, rtl]

    speakerstring

    Label indicating who said that word. Only set if diarization is enabled.

    tagsstring[]

    This is a set list of profanities and disfluencies respectively that cannot be altered by the end user. [disfluency] is only present in English, and [profanity] is present in English, Spanish, and Italian

    Possible values: [disfluency, profanity]

  • ]
  • scorefloat

    Possible values: >= 0 and <= 1

    volumefloat

    Possible values: >= 0 and <= 100

  • ]
  • AddTranscript

    Contains the final transcript of a part of the audio that the client has sent.

    messagerequired
    Constant value: AddTranscript
    formatstring

    Speechmatics JSON output format version number.

    Example: 2.1
    metadata objectrequired
    start_timefloatrequired
    end_timefloatrequired
    transcriptstringrequired

    The entire transcript contained in the segment in text format. Providing the entire transcript here is designed for ease of consumption; we have taken care of all the necessary formatting required to concatenate the transcription results into a block of text. This transcript lacks the detailed information however which is contained in the results field of the message - such as the timings and confidences for each word.

    results object[]required
  • Array [
  • typestringrequired

    Possible values: [word, punctuation]

    start_timefloatrequired
    end_timefloatrequired
    channelstring
    attaches_tostring

    Possible values: [next, previous, none, both]

    is_eosboolean
    alternatives object[]
  • Array [
  • contentstringrequired

    A word or punctuation mark.

    confidencefloatrequired

    A confidence score assigned to the alternative. Ranges from 0.0 (least confident) to 1.0 (most confident).

    languagestring

    The language that the alternative word is assumed to be spoken in. Currently, this will always be equal to the language that was requested in the initial StartRecognition message.

    display object

    Information about how the word/symbol should be displayed.

    directionstringrequired

    Either ltr for words that should be displayed left-to-right, or rtl vice versa.

    Possible values: [ltr, rtl]

    speakerstring

    Label indicating who said that word. Only set if diarization is enabled.

    tagsstring[]

    This is a set list of profanities and disfluencies respectively that cannot be altered by the end user. [disfluency] is only present in English, and [profanity] is present in English, Spanish, and Italian

    Possible values: [disfluency, profanity]

  • ]
  • scorefloat

    Possible values: >= 0 and <= 1

    volumefloat

    Possible values: >= 0 and <= 100

  • ]
  • AddPartialTranslation

    Contains a work-in-progress translation of a part of the audio that the client has sent.

    messagerequired
    Constant value: AddPartialTranslation
    formatstring

    Speechmatics JSON output format version number.

    Example: 2.1
    languagestringrequired

    Language translation relates to given as an ISO language code.

    results object[]required
  • Array [
  • contentstringrequired
    start_timefloatrequired

    The start time (in seconds) of the original transcribed audio segment

    end_timefloatrequired

    The end time (in seconds) of the original transcribed audio segment

    speakerstring

    The speaker that uttered the speech if speaker diarization is enabled

  • ]
  • AddTranslation

    Contains the final translation of a part of the audio that the client has sent.

    messagerequired
    Constant value: AddTranslation
    formatstring

    Speechmatics JSON output format version number.

    Example: 2.1
    languagestringrequired

    Language translation relates to given as an ISO language code.

    results object[]required
  • Array [
  • contentstringrequired
    start_timefloatrequired

    The start time (in seconds) of the original transcribed audio segment

    end_timefloatrequired

    The end time (in seconds) of the original transcribed audio segment

    speakerstring

    The speaker that uttered the speech if speaker diarization is enabled

  • ]
  • EndOfTranscript

    Server response to EndOfStream, after the server has finished sending all AddTranscript messages.

    messagerequired
    Constant value: EndOfTranscript

    AudioEventStarted

    Start of an audio event detected.

    messagerequired
    Constant value: AudioEventStarted
    event objectrequired
    typestringrequired

    The type of audio event that has started or ended. See our list of supported Audio Event types.

    start_timefloatrequired

    The time (in seconds) of the audio corresponding to the beginning of the audio event.

    confidencefloatrequired

    A confidence score assigned to the audio event. Ranges from 0.0 (least confident) to 1.0 (most confident).

    Possible values: >= 0 and <= 1

    AudioEventEnded

    End of an audio event detected.

    messagerequired
    Constant value: AudioEventEnded
    event objectrequired
    typestringrequired

    The type of audio event that has started or ended. See our list of supported Audio Event types.

    end_timefloatrequired

    EndOfUtterance

    Indicates the end of an utterance, triggered by a configurable period of non-speech. The message is sent when no speech has been detected for a short period of time, configurable by the end_of_utterance_silence_trigger parameter in conversation_config (see End Of Utterance).

    Like punctuation, an EndOfUtterance has zero duration.

    messagerequired
    Constant value: EndOfUtterance
    metadata objectrequired
    start_timefloat

    The time (in seconds) that the end of utterance was detected.

    end_timefloat

    The time (in seconds) that the end of utterance was detected.

    Info

    Additional information sent from the server to the client.

    messagerequired
    Constant value: Info
    typestringrequired

    The following are the possible info types:

    Info TypeDescription
    recognition_qualityInforms the client what particular quality-based model is used to handle the recognition. Sent to the client immediately after the WebSocket handshake is completed.
    model_redirectInforms the client that a deprecated language code has been specified, and will be handled with a different model. For example, if the model parameter is set to one of en-US, en-GB, or en-AU, then the request may be internally redirected to the Global English model (en).
    deprecatedInforms about using a feature that is going to be removed in a future release.
    session_transferInforms that the session has been seamlessly transferred to another backend, with the reason: Session has been transferred to a new backend. This typically occurs due to backend maintenance operations or migration from a faulty backend.

    Possible values: [recognition_quality, model_redirect, deprecated, concurrent_session_usage]

    reasonstringrequired
    codeinteger
    seq_nointeger
    qualitystring

    Only set when type is recognition_quality. Quality-based model name. It is one of "telephony", "broadcast". The model is selected automatically, for high-quality audio (12kHz+) the broadcast model is used, for lower quality audio the telephony model is used.

    usagenumber

    Only set when type is concurrent_session_usage. Indicates the current usage (number of active concurrent sessions).

    quotanumber

    Only set when type is concurrent_session_usage. Indicates the current quota (maximum number of concurrent sessions allowed).

    last_updatedstring

    Only set when type is concurrent_session_usage. Indicates the timestamp of the most recent usage update, in the format YYYY-MM-DDTHH:MM:SSZ (UTC). This value is updated even when usage exceeds the quota, as it represents the most recent known data. In some cases, it may be empty or outdated due to internal errors preventing successful update.

    Example: 2025-03-25T08:45:31Z

    Warning

    Warning messages sent from the server to the client.

    messagerequired
    Constant value: Warning
    typestringrequired

    The following are the possible warning types:

    Warning TypeDescription
    duration_limit_exceededThe maximum allowed duration of a single utterance to process has been exceeded. Any AddAudio messages received that exceed this limit are confirmed with AudioAdded, but are ignored by the transcription engine. Exceeding the limit triggers the same mechanism as receiving an EndOfStream message, so the Server will eventually send an EndOfTranscript message and suspend.
    unsupported_translation_pairOne of the requested translation target languages is unsupported (given the source audio language). The error message specifies the unsupported language pair.
    idle_timeoutInforms that the session is approaching the idle duration limit (no audio data sent within the last hour), with a reason of the form:

    Session will timeout in {time_remaining}m due to inactivity, no audio sent within the last {time_elapsed}m

    Currently the server will send messages at 15, 10 and 5m prior to timeout, and will send a final error message on timeout, before closing the connection with the code 1008. (see Realtime limits for more information).
    session_timeoutInforms that the session is approaching the max session duration limit (maximum session duration of 48 hours), with a reason of the form:

    Session will timeout in {time_remaining}m due to max duration, session has been active for {time_elapsed}m

    Currently the server will send messages at 45, 30 and 15m prior to timeout, and will send a final error message on timeout, before closing the connection with the code 1008. (see Realtime limits for more information).
    empty_translation_target_listNo supported translation target languages specified. Translation will not run.
    add_audio_after_eosProtocol specification doesn't allow adding audio after EndOfStream has been received. Any `AddAudio messages after this, will be ignored.

    Possible values: [duration_limit_exceeded, unsupported_translation_pair, idle_timeout, session_timeout, empty_translation_target_list, add_audio_after_eos]

    reasonstringrequired
    codeinteger
    seq_nointeger
    duration_limitnumber

    Only set when type is duration_limit_exceeded. Indicates the limit that was exceeded (in seconds).

    Error

    Error messages sent from the server to the client. After any error, the transcription is terminated and the connection is closed.

    messagerequired
    Constant value: Error
    typestringrequired

    The following are the possible error types:

    Error TypeDescription
    invalid_messageThe message received was not understood.
    invalid_modelUnable to use the model for the recognition. This can happen if the language is not supported at all, or is not available for the user.
    invalid_configThe config received contains some wrong or unsupported fields, or too many translation target languages were requested.
    invalid_audio_typeAudio type is not supported, is deprecated, or the audio_type is malformed.
    invalid_output_formatOutput format is not supported, is deprecated, or the output_format is malformed.
    not_authorisedUser was not recognised, or the API key provided is not valid.
    insufficient_fundsUser doesn't have enough credits or any other reason preventing the user to be charged for the job properly.
    not_allowedUser is not allowed to use this message (is not allowed to perform the action the message would invoke).
    job_errorUnable to do any work on this job, the server might have timed out etc.
    data_errorUnable to accept the data specified - usually because there is too much data being sent at once
    buffer_errorUnable to fit the data in a corresponding buffer. This can happen for clients sending the input data faster than real-time.
    protocol_errorMessage received was syntactically correct, but could not be accepted due to protocol limitations. This is usually caused by messages sent in the wrong order.
    quota_exceededMaximum number of concurrent connections allowed for the contract has been reached
    timelimit_exceededUsage quota for the contract has been reached
    idle_timeoutIdle duration limit was reached (no audio data sent within the last hour), a closing handshake with code 1008 follows this in-band error.
    session_timeoutMax session duration was reached (maximum session duration of 48 hours), a closing handshake with code 1008 follows this in-band error.
    session_transferAn error while transferring session to another backend with the reason: Session transfer failed. This may occur when moving sessions due to backend maintenance operations or migration from a faulty backend.
    unknown_errorAn error that did not fit any of the types above.

    invalid_message, protocol_error and unknown_error can be triggered as a response to any type of messages.

    Possible values: [invalid_message, invalid_model, invalid_config, invalid_audio_type, invalid_output_format, not_authorised, insufficient_funds, not_allowed, job_error, data_error, buffer_error, protocol_error, quota_exceeded, timelimit_exceeded, idle_timeout, session_timeout, session_transfer, unknown_error]

    reasonstringrequired
    codeinteger
    seq_nointeger

    Websocket errors

    In the Realtime SaaS, an in-band error message can be followed by a WebSocket close message. The table below shows the possible WebSocket close codes and associated error types. The error types are provided in the payload of the close message.

    WebSocket Close CodeWebSocket Close Payload
    1003protocol_error
    1008policy_violation
    1011internal_error
    4001not_authorised
    4003not_allowed
    4004invalid_model
    4005quota_exceeded
    4006timelimit_exceeded
    4013job_error