Transcription API

Realtime API Reference

GETwss://eu2.rt.speechmatics.com/v2/

Protocol overview

A basic Realtime session will have the following message exchanges:

Browser based transcription

When starting a Realtime transcription session in the browser, temporary keys should be used to avoid exposing your long-lived API key.

To do so, you must provide the temporary key as a part of a query parameter. This is due to a browser limitation. For example:

 wss://eu2.rt.speechmatics.com/v2?jwt=<temporary-key>

Handshake responses

Successful Response

101 Switching Protocols - Switch to WebSocket protocol

Here is an example for a successful WebSocket handshake:

GET /v2/ HTTP/1.1
Host: eu2.rt.speechmatics.com
Upgrade: websocket
Connection: Upgrade
Sec-WebSocket-Key: ujRTbIaQsXO/0uCbjjkSZQ==
Sec-WebSocket-Version: 13
Sec-WebSocket-Extensions: permessage-deflate; client_max_window_bits
Authorization: Bearer wmz9fkLJM6U5NdyaG3HLHybGZj65PXp
User-Agent: Python/3.8 websockets/8.1

A successful response should look like:

HTTP/1.1 101 Switching Protocols
Server: nginx/1.17.8
Date: Wed, 06 Jan 2021 11:01:05 GMT
Connection: upgrade
Upgrade: WebSocket
Sec-WebSocket-Accept: 87kiC/LI5WgXG52nSylnfXdz260=

Malformed request

A malformed handshake request will result in one of the following HTTP responses:

400 Bad Request
401 Unauthorized - when the API key is not valid
405 Method Not Allowed - when the request method is not GET

Client Retry

Following a successful handshake and switch to the WebSocket protocol, the client could receive an immediate error message and WebSocket close handshake from the server. For the following errors only, we recommend adding a client retry interval of at least 5-10 seconds:

4005 quota_exceeded
4013 job_error
1011 internal_error

Message Handling

Each message that the Server accepts is a stringified JSON object with the following fields:

message (String): The name of the message we are sending. Any other fields depend on the value of the message and are described below.

The messages sent by the Server to a Client are stringified JSON objects as well.

The only exception is a binary message sent from the Client to the Server containing a chunk of audio which will be referred to as AddAudio.

The following values of the message field are supported:

Sent messages

StartRecognition

Initiates a new recognition session.

messagerequired

Constant value: StartRecognition

audio_format objectrequired

oneOf

Raw
File

Raw audio samples, described by the following additional mandatory fields:

typerequired

Constant value: raw

encodingstringrequired

Possible values: [pcm_f32le, pcm_s16le, mulaw]

sample_rateintegerrequired

The sample rate of the audio in Hz.

Example: {"type":"raw","encoding":"pcm_s16le","sample_rate":44100}

Choose this option to send audio encoded in a recognized format. The AddAudio messages have to provide all the file contents, including any headers. The file is usually not accepted all at once, but segmented into reasonably sized messages.

Note: Only the following formats are supported: wav, mp3, aac, ogg, mpeg, amr, m4a, mp4, flac

typerequired

Constant value: file

transcription_config objectrequired

Contains configuration for this recognition session.

languagestringrequired

Language model to process the audio input, normally specified as an ISO language code. The value must be consistent with the language code used in the API endpoint URL.

Example: en

domainstring

Request a specialized model based on 'language' but optimized for a particular field, e.g. finance or medical.

output_localestring

Configure locale for outputted transcription. See output formatting.

Possible values: non-empty

additional_vocab object[]

Configure custom dictionary. Default is an empty list. You should be aware that there is a performance penalty (latency degradation and memory increase) from using additional_vocab, especially if you use a large word list. When initializing a session that uses additional_vocab in the config, you should expect a delay of up to 15 seconds (depending on the size of the list).

Array [

oneOf

String
Object

string

Possible values: non-empty

contentstringrequired

Possible values: non-empty

sounds_likestring[]

Possible values: >= 1

]

diarizationstring

Set to speaker to apply Speaker Diarization to the audio.

Possible values: [none, speaker]

Default value: none

max_delaynumber

This is the delay in seconds between the end of a spoken word and returning the Final transcript results. See Latency for more details

Possible values: >= 0.7 and <= 4

Default value: 4

max_delay_modestring

This allows some additional time for Smart Formatting.

Possible values: [flexible, fixed]

Default value: flexible

speaker_diarization_config object

max_speakersinteger

Configure the maximum number of speakers to detect. See Max Speakers.

Possible values: >= 2 and <= 100

Default value: 50

prefer_current_speakerboolean

When set to true, reduces the likelihood of incorrectly switching between similar sounding speakers. See Prefer Current Speaker.

Default value: false

speaker_sensitivityfloat

Possible values: >= 0 and <= 1

audio_filtering_config object

Puts a lower limit on the volume of processed audio by using the volume_threshold setting. See Audio Filtering.

volume_thresholdfloat

Possible values: >= 0 and <= 100

transcript_filtering_config object

remove_disfluenciesboolean

When set to true, removes disfluencies from the transcript. See Removing disfluencies

replacements object[]

A list of replacement rules to apply to the transcript. Each rule consists of a pattern to match and a replacement string. See Word replacement

Array [

fromstringrequired

tostringrequired

]

enable_partialsboolean

Whether or not to send Partials (i.e. AddPartialTranslation messages) as well as Finals (i.e. AddTranslation messages) See Partial transcripts.

Default value: false

enable_entitiesboolean

Default value: true

operating_pointstring

Which model you wish to use. See Operating points for more details.

Possible values: [standard, enhanced]

Default value: standard

punctuation_overrides object

Options for controlling punctuation in the output transcripts. See Punctuation Settings

permitted_marksstring[]

The punctuation marks which the client is prepared to accept in transcription output, or the special value 'all' (the default). Unsupported marks are ignored. This value is used to guide the transcription process.

Possible values: Value must match regular expression ^(.|all)$

sensitivityfloat

Ranges between zero and one. Higher values will produce more punctuation. The default is 0.5.

Possible values: >= 0 and <= 1

conversation_config object

This mode will detect when a speaker has stopped talking. The end_of_utterance_silence_trigger is the time in seconds after which the server will assume that the speaker has finished speaking, and will emit an EndOfUtterance message. A value of 0 disables the feature.

end_of_utterance_silence_triggerfloat

Possible values: >= 0 and <= 2

Default value: 0

translation_config object

Specifies various configuration values for translation. All fields except target_languages are optional, using default values when omitted.

target_languagesstring[]required

List of languages to translate to from the source transcription language. Specified as an ISO Language Code.

enable_partialsboolean

Whether or not to send Partials (i.e. AddPartialTranslation messages) as well as Finals (i.e. AddTranslation messages).

Default value: false

audio_events_config object

Contains configuration for Audio Events

typesstring[]

List of Audio Event types to enable.

AddAudio

A binary chunk of audio. The server confirms receipt by sending an AudioAdded message.

stringbinary

EndOfStream

Declares that the client has no more audio to send.

messagerequired

Constant value: EndOfStream

last_seq_nointegerrequired

SetRecognitionConfig

Allows the client to re-configure the recognition session.

Only the following fields can be set through a SetRecognitionConfig message:

max_delay
max_delay_mode
enable_partials

If you wish to alter any other parameters you must terminate the session and restart with the altered configuration. Attempting otherwise will result in an error.

messagerequired

Constant value: SetRecognitionConfig

transcription_config objectrequired

Contains configuration for this recognition session.

languagestringrequired

Language model to process the audio input, normally specified as an ISO language code. The value must be consistent with the language code used in the API endpoint URL.

Example: en

domainstring

Request a specialized model based on 'language' but optimized for a particular field, e.g. finance or medical.

output_localestring

Configure locale for outputted transcription. See output formatting.

Possible values: non-empty

additional_vocab object[]

Array [

oneOf

String
Object

string

Possible values: non-empty

contentstringrequired

Possible values: non-empty

sounds_likestring[]

Possible values: >= 1

]

diarizationstring

Set to speaker to apply Speaker Diarization to the audio.

Possible values: [none, speaker]

Default value: none

max_delaynumber

This is the delay in seconds between the end of a spoken word and returning the Final transcript results. See Latency for more details

Possible values: >= 0.7 and <= 4

Default value: 4

max_delay_modestring

This allows some additional time for Smart Formatting.

Possible values: [flexible, fixed]

Default value: flexible

speaker_diarization_config object

max_speakersinteger

Configure the maximum number of speakers to detect. See Max Speakers.

Possible values: >= 2 and <= 100

Default value: 50

prefer_current_speakerboolean

When set to true, reduces the likelihood of incorrectly switching between similar sounding speakers. See Prefer Current Speaker.

Default value: false

speaker_sensitivityfloat

Possible values: >= 0 and <= 1

audio_filtering_config object

Puts a lower limit on the volume of processed audio by using the volume_threshold setting. See Audio Filtering.

volume_thresholdfloat

Possible values: >= 0 and <= 100

transcript_filtering_config object

remove_disfluenciesboolean

When set to true, removes disfluencies from the transcript. See Removing disfluencies

replacements object[]

A list of replacement rules to apply to the transcript. Each rule consists of a pattern to match and a replacement string. See Word replacement

Array [

fromstringrequired

tostringrequired

]

enable_partialsboolean

Whether or not to send Partials (i.e. AddPartialTranslation messages) as well as Finals (i.e. AddTranslation messages) See Partial transcripts.

Default value: false

enable_entitiesboolean

Default value: true

operating_pointstring

Which model you wish to use. See Operating points for more details.

Possible values: [standard, enhanced]

Default value: standard

punctuation_overrides object

Options for controlling punctuation in the output transcripts. See Punctuation Settings

permitted_marksstring[]

Possible values: Value must match regular expression ^(.|all)$

sensitivityfloat

Ranges between zero and one. Higher values will produce more punctuation. The default is 0.5.

Possible values: >= 0 and <= 1

conversation_config object

end_of_utterance_silence_triggerfloat

Possible values: >= 0 and <= 2

Default value: 0

Received messages

RecognitionStarted

Server response to StartRecognition, acknowledging that a recognition session has started.

messagerequired

Constant value: RecognitionStarted

orchestrator_versionstring

idstring

AudioAdded

Server response to AddAudio, indicating that audio has been added successfully.

When clients send audio faster than real-time, the server may read data slower than it's sent. If binary AddAudio messages exceed the server's internal buffer, the server will process other WebSocket messages until buffer space is available. Clients receive AudioAdded responses only after binary data is read. This can fill TCP buffers, potentially causing WebSocket write failures and connection closure with prejudice. Clients can monitor the WebSocket's bufferedAmount attribute to prevent this.

messagerequired

Constant value: AudioAdded

seq_nointegerrequired

AddPartialTranscript

A partial transcript is a transcript that can be changed in a future AddPartialTranscript as more words are spoken until the AddTranscript Final message is sent for that audio.

Partials will only be sent if transcription_config.enable_partials is set to true in the StartRecognition message.

The message structure is the same as AddTranscript, with a few limitations.

For AddPartialTranscript messages the confidence field for alternatives has no meaning and should not be relied on.

messagerequired

Constant value: AddPartialTranscript

formatstring

Speechmatics JSON output format version number.

Example: 2.1

metadata objectrequired

start_timefloatrequired

end_timefloatrequired

transcriptstringrequired

The entire transcript contained in the segment in text format. Providing the entire transcript here is designed for ease of consumption; we have taken care of all the necessary formatting required to concatenate the transcription results into a block of text. This transcript lacks the detailed information however which is contained in the results field of the message - such as the timings and confidences for each word.

results object[]required

Array [

typestringrequired

Possible values: [word, punctuation]

start_timefloatrequired

end_timefloatrequired

channelstring

attaches_tostring

Possible values: [next, previous, none, both]

is_eosboolean

alternatives object[]

Array [

contentstringrequired

A word or punctuation mark.

confidencefloatrequired

A confidence score assigned to the alternative. Ranges from 0.0 (least confident) to 1.0 (most confident).

languagestring

The language that the alternative word is assumed to be spoken in. Currently, this will always be equal to the language that was requested in the initial StartRecognition message.

display object

Information about how the word/symbol should be displayed.

directionstringrequired

Either ltr for words that should be displayed left-to-right, or rtl vice versa.

Possible values: [ltr, rtl]

speakerstring

Label indicating who said that word. Only set if diarization is enabled.

tagsstring[]

This is a set list of profanities and disfluencies respectively that cannot be altered by the end user. [disfluency] is only present in English, and [profanity] is present in English, Spanish, and Italian

Possible values: [disfluency, profanity]

]

scorefloat

Possible values: >= 0 and <= 1

volumefloat

Possible values: >= 0 and <= 100

]

AddTranscript

Contains the final transcript of a part of the audio that the client has sent.

messagerequired

Constant value: AddTranscript

formatstring

Speechmatics JSON output format version number.

Example: 2.1

metadata objectrequired

start_timefloatrequired

end_timefloatrequired

transcriptstringrequired

results object[]required

Array [

typestringrequired

Possible values: [word, punctuation]

start_timefloatrequired

end_timefloatrequired

channelstring

attaches_tostring

Possible values: [next, previous, none, both]

is_eosboolean

alternatives object[]

Array [

contentstringrequired

A word or punctuation mark.

confidencefloatrequired

A confidence score assigned to the alternative. Ranges from 0.0 (least confident) to 1.0 (most confident).

languagestring

The language that the alternative word is assumed to be spoken in. Currently, this will always be equal to the language that was requested in the initial StartRecognition message.

display object

Information about how the word/symbol should be displayed.

directionstringrequired

Either ltr for words that should be displayed left-to-right, or rtl vice versa.

Possible values: [ltr, rtl]

speakerstring

Label indicating who said that word. Only set if diarization is enabled.

tagsstring[]

Possible values: [disfluency, profanity]

]

scorefloat

Possible values: >= 0 and <= 1

volumefloat

Possible values: >= 0 and <= 100

]

AddPartialTranslation

Contains a work-in-progress translation of a part of the audio that the client has sent.

messagerequired

Constant value: AddPartialTranslation

formatstring

Speechmatics JSON output format version number.

Example: 2.1

languagestringrequired

Language translation relates to given as an ISO language code.

results object[]required

Array [

contentstringrequired

start_timefloatrequired

The start time (in seconds) of the original transcribed audio segment

end_timefloatrequired

The end time (in seconds) of the original transcribed audio segment

speakerstring

The speaker that uttered the speech if speaker diarization is enabled

]

AddTranslation

Contains the final translation of a part of the audio that the client has sent.

messagerequired

Constant value: AddTranslation

formatstring

Speechmatics JSON output format version number.

Example: 2.1

languagestringrequired

Language translation relates to given as an ISO language code.

results object[]required

Array [

contentstringrequired

start_timefloatrequired

The start time (in seconds) of the original transcribed audio segment

end_timefloatrequired

The end time (in seconds) of the original transcribed audio segment

speakerstring

The speaker that uttered the speech if speaker diarization is enabled

]

EndOfTranscript

Server response to EndOfStream, after the server has finished sending all AddTranscript messages.

messagerequired

Constant value: EndOfTranscript

AudioEventStarted

Start of an audio event detected.

messagerequired

Constant value: AudioEventStarted

event objectrequired

typestringrequired

The type of audio event that has started or ended. See our list of supported Audio Event types.

start_timefloatrequired

The time (in seconds) of the audio corresponding to the beginning of the audio event.

confidencefloatrequired

A confidence score assigned to the audio event. Ranges from 0.0 (least confident) to 1.0 (most confident).

Possible values: >= 0 and <= 1

AudioEventEnded

End of an audio event detected.

messagerequired

Constant value: AudioEventEnded

event objectrequired

typestringrequired

The type of audio event that has started or ended. See our list of supported Audio Event types.

end_timefloatrequired

EndOfUtterance

Indicates the end of an utterance, triggered by a configurable period of non-speech. The message is sent when no speech has been detected for a short period of time, configurable by the end_of_utterance_silence_trigger parameter in conversation_config (see End Of Utterance).

Like punctuation, an EndOfUtterance has zero duration.

messagerequired

Constant value: EndOfUtterance

metadata objectrequired

start_timefloat

The time (in seconds) that the end of utterance was detected.

end_timefloat

The time (in seconds) that the end of utterance was detected.

Info

Additional information sent from the server to the client.

messagerequired

Constant value: Info

typestringrequired

The following are the possible info types:

Info Type	Description
`recognition_quality`	Informs the client what particular quality-based model is used to handle the recognition. Sent to the client immediately after the WebSocket handshake is completed.
`model_redirect`	Informs the client that a deprecated language code has been specified, and will be handled with a different model. For example, if the model parameter is set to one of `en-US`, `en-GB`, or `en-AU`, then the request may be internally redirected to the Global English model (`en`).
`deprecated`	Informs about using a feature that is going to be removed in a future release.
`session_transfer`	Informs that the session has been seamlessly transferred to another backend, with the reason: Session has been transferred to a new backend. This typically occurs due to backend maintenance operations or migration from a faulty backend.

Possible values: [recognition_quality, model_redirect, deprecated, concurrent_session_usage]

reasonstringrequired

codeinteger

seq_nointeger

qualitystring

Only set when type is recognition_quality. Quality-based model name. It is one of "telephony", "broadcast". The model is selected automatically, for high-quality audio (12kHz+) the broadcast model is used, for lower quality audio the telephony model is used.

usagenumber

Only set when type is concurrent_session_usage. Indicates the current usage (number of active concurrent sessions).

quotanumber

Only set when type is concurrent_session_usage. Indicates the current quota (maximum number of concurrent sessions allowed).

last_updatedstring

Only set when type is concurrent_session_usage. Indicates the timestamp of the most recent usage update, in the format YYYY-MM-DDTHH:MM:SSZ (UTC). This value is updated even when usage exceeds the quota, as it represents the most recent known data. In some cases, it may be empty or outdated due to internal errors preventing successful update.

Example: 2025-03-25T08:45:31Z

Warning

Warning messages sent from the server to the client.

messagerequired

Constant value: Warning

typestringrequired

The following are the possible warning types:

Warning Type	Description
`duration_limit_exceeded`	The maximum allowed duration of a single utterance to process has been exceeded. Any `AddAudio` messages received that exceed this limit are confirmed with `AudioAdded`, but are ignored by the transcription engine. Exceeding the limit triggers the same mechanism as receiving an `EndOfStream` message, so the Server will eventually send an `EndOfTranscript` message and suspend.
`unsupported_translation_pair`	One of the requested translation target languages is unsupported (given the source audio language). The error message specifies the unsupported language pair.
`idle_timeout`	Informs that the session is approaching the idle duration limit (no audio data sent within the last hour), with a `reason` of the form: `Session will timeout in {time_remaining}m due to inactivity, no audio sent within the last {time_elapsed}m` Currently the server will send messages at 15, 10 and 5m prior to timeout, and will send a final error message on timeout, before closing the connection with the code 1008. (see Realtime limits for more information).
`session_timeout`	Informs that the session is approaching the max session duration limit (maximum session duration of 48 hours), with a `reason` of the form: `Session will timeout in {time_remaining}m due to max duration, session has been active for {time_elapsed}m` Currently the server will send messages at 45, 30 and 15m prior to timeout, and will send a final error message on timeout, before closing the connection with the code 1008. (see Realtime limits for more information).
`empty_translation_target_list`	No supported translation target languages specified. Translation will not run.
`add_audio_after_eos`	Protocol specification doesn't allow adding audio after `EndOfStream` has been received. Any `AddAudio messages after this, will be ignored.

Possible values: [duration_limit_exceeded, unsupported_translation_pair, idle_timeout, session_timeout, empty_translation_target_list, add_audio_after_eos]

reasonstringrequired

codeinteger

seq_nointeger

duration_limitnumber

Only set when type is duration_limit_exceeded. Indicates the limit that was exceeded (in seconds).

Error

Error messages sent from the server to the client. After any error, the transcription is terminated and the connection is closed.

messagerequired

Constant value: Error

typestringrequired

The following are the possible error types:

Error Type	Description
`invalid_message`	The message received was not understood.
`invalid_model`	Unable to use the model for the recognition. This can happen if the language is not supported at all, or is not available for the user.
`invalid_config`	The config received contains some wrong or unsupported fields, or too many translation target languages were requested.
`invalid_audio_type`	Audio type is not supported, is deprecated, or the `audio_type` is malformed.
`invalid_output_format`	Output format is not supported, is deprecated, or the `output_format` is malformed.
`not_authorised`	User was not recognised, or the API key provided is not valid.
`insufficient_funds`	User doesn't have enough credits or any other reason preventing the user to be charged for the job properly.
`not_allowed`	User is not allowed to use this message (is not allowed to perform the action the message would invoke).
`job_error`	Unable to do any work on this job, the server might have timed out etc.
`data_error`	Unable to accept the data specified - usually because there is too much data being sent at once
`buffer_error`	Unable to fit the data in a corresponding buffer. This can happen for clients sending the input data faster than real-time.
`protocol_error`	Message received was syntactically correct, but could not be accepted due to protocol limitations. This is usually caused by messages sent in the wrong order.
`quota_exceeded`	Maximum number of concurrent connections allowed for the contract has been reached
`timelimit_exceeded`	Usage quota for the contract has been reached
`idle_timeout`	Idle duration limit was reached (no audio data sent within the last hour), a closing handshake with code 1008 follows this in-band error.
`session_timeout`	Max session duration was reached (maximum session duration of 48 hours), a closing handshake with code 1008 follows this in-band error.
`session_transfer`	An error while transferring session to another backend with the reason: Session transfer failed. This may occur when moving sessions due to backend maintenance operations or migration from a faulty backend.
`unknown_error`	An error that did not fit any of the types above.

invalid_message, protocol_error and unknown_error can be triggered as a response to any type of messages.

Possible values: [invalid_message, invalid_model, invalid_config, invalid_audio_type, invalid_output_format, not_authorised, insufficient_funds, not_allowed, job_error, data_error, buffer_error, protocol_error, quota_exceeded, timelimit_exceeded, idle_timeout, session_timeout, session_transfer, unknown_error]

reasonstringrequired

codeinteger

seq_nointeger

Websocket errors

In the Realtime SaaS, an in-band error message can be followed by a WebSocket close message. The table below shows the possible WebSocket close codes and associated error types. The error types are provided in the payload of the close message.

WebSocket Close Code	WebSocket Close Payload
1003	`protocol_error`
1008	`policy_violation`
1011	`internal_error`
4001	`not_authorised`
4003	`not_allowed`
4004	`invalid_model`
4005	`quota_exceeded`
4006	`timelimit_exceeded`
4013	`job_error`

Realtime API Reference

wss://eu2.rt.speechmatics.com/v2/

Protocol overview​

Browser based transcription​

Handshake responses​

Message Handling​

Sent messages​

StartRecognition​

AddAudio​

EndOfStream​

SetRecognitionConfig​

Received messages​

RecognitionStarted​

AudioAdded​

AddPartialTranscript​

AddTranscript​

AddPartialTranslation​

AddTranslation​

EndOfTranscript​

AudioEventStarted​

AudioEventEnded​

EndOfUtterance​

Info​

Warning​

Error​

Websocket errors​

Protocol overview

Browser based transcription

Handshake responses

Message Handling

Sent messages

StartRecognition

AddAudio

EndOfStream

SetRecognitionConfig

Received messages

RecognitionStarted

AudioAdded

AddPartialTranscript

AddTranscript

AddPartialTranslation

AddTranslation

EndOfTranscript

AudioEventStarted

AudioEventEnded

EndOfUtterance

Info

Warning

Error

Websocket errors