Realtime API Reference
GETwss://eu2.rt.speechmatics.com/v2/
Protocol overview
A basic Realtime session will have the following message exchanges:
Browser based transcription
When starting a Real-Time transcription session in the browser, temporary keys should be used to avoid exposing your long-lived API key.
To do so, you must provide the temporary key as a part of a query parameter. This is due to a browser limitation. For example:
wss://eu2.rt.speechmatics.com/v2?jwt=<temporary-key>
Handshake Responses
Successful Response
101 Switching Protocols
- Switch to WebSocket protocol
Here is an example for a successful WebSocket handshake:
GET /v2/ HTTP/1.1
Host: eu2.rt.speechmatics.com
Upgrade: websocket
Connection: Upgrade
Sec-WebSocket-Key: ujRTbIaQsXO/0uCbjjkSZQ==
Sec-WebSocket-Version: 13
Sec-WebSocket-Extensions: permessage-deflate; client_max_window_bits
Authorization: Bearer wmz9fkLJM6U5NdyaG3HLHybGZj65PXp
User-Agent: Python/3.8 websockets/8.1
A successful response should look like:
HTTP/1.1 101 Switching Protocols
Server: nginx/1.17.8
Date: Wed, 06 Jan 2021 11:01:05 GMT
Connection: upgrade
Upgrade: WebSocket
Sec-WebSocket-Accept: 87kiC/LI5WgXG52nSylnfXdz260=
Malformed Request
A malformed handshake request will result in one of the following HTTP responses:
400 Bad Request
401 Unauthorized
- when the API key is not valid405 Method Not Allowed
- when the request method is not GET
Client Retry
Following a successful handshake and switch to the WebSocket protocol, the client could receive an immediate error message and WebSocket close handshake from the server. For the following errors only, we recommend adding a client retry interval of at least 5-10 seconds:
4005 quota_exceeded
4013 job_error
1011 internal_error
Message Handling
Each message that the Server accepts is a stringified JSON object with the following fields:
message
(String): The name of the message we are sending. Any other fields depend on the value of themessage
and are described below.
The messages sent by the Server to a Client are stringified JSON objects as well.
The only exception is a binary message sent from the Client to the Server containing a chunk of audio which will be referred to as AddAudio
.
The following values of the message
field are supported:
Sent messages
StartRecognition
StartRecognition
audio_format objectrequired
- Raw
- File
Raw audio samples, described by the following additional mandatory fields:
raw
Possible values: [pcm_f32le
, pcm_s16le
, mulaw
]
The sample rate of the audio in Hz.
Example: {"type":"raw","encoding":"pcm_s16le","sample_rate":44100}
Choose this option to send audio encoded in a recognized format. The AddAudio messages have to provide all the file contents, including any headers. The file is usually not accepted all at once, but segmented into reasonably sized messages.
Note: Only the following formats are supported: wav
, mp3
, aac
, ogg
, mpeg
, amr
, m4a
, mp4
, flac
file
transcription_config objectrequired
Request a specialized model based on 'language' but optimized for a particular field, e.g. "finance" or "medical".
Possible values: non-empty
additional_vocab object[]
- MOD1
- MOD2
Possible values: non-empty
Possible values: non-empty
Possible values: >= 1
Possible values: [none
, speaker
]
Possible values: >= 0
Possible values: [flexible
, fixed
]
speaker_diarization_config object
Possible values: >= 2
and <= 100
Possible values: >= 0
and <= 1
audio_filtering_config object
Possible values: >= 0
and <= 100
transcript_filtering_config
A list of replacement rules to apply to the transcript. Each rule consists of a pattern to match and a replacement string.
false
true
Possible values: [standard
, enhanced
]
punctuation_overrides object
The punctuation marks which the client is prepared to accept in transcription output, or the special value 'all' (the default). Unsupported marks are ignored. This value is used to guide the transcription process.
Possible values: Value must match regular expression ^(.|all)$
Ranges between zero and one. Higher values will produce more punctuation. The default is 0.5.
Possible values: >= 0
and <= 1
conversation_config object
This mode will detect when a speaker has stopped talking. The end_of_utterance_silence_trigger is the time in seconds after which the server will assume that the speaker has finished speaking, and will emit an EndOfUtterance message. A value of 0 disables the feature.
Possible values: >= 0
and <= 2
0
translation_config object
false
audio_events_config object
AddAudio
EndOfStream
EndOfStream
SetRecognitionConfig
SetRecognitionConfig
transcription_config objectrequired
Request a specialized model based on 'language' but optimized for a particular field, e.g. "finance" or "medical".
Possible values: non-empty
additional_vocab object[]
- MOD1
- MOD2
Possible values: non-empty
Possible values: non-empty
Possible values: >= 1
Possible values: [none
, speaker
]
Possible values: >= 0
Possible values: [flexible
, fixed
]
speaker_diarization_config object
Possible values: >= 2
and <= 100
Possible values: >= 0
and <= 1
audio_filtering_config object
Possible values: >= 0
and <= 100
transcript_filtering_config
A list of replacement rules to apply to the transcript. Each rule consists of a pattern to match and a replacement string.
false
true
Possible values: [standard
, enhanced
]
punctuation_overrides object
The punctuation marks which the client is prepared to accept in transcription output, or the special value 'all' (the default). Unsupported marks are ignored. This value is used to guide the transcription process.
Possible values: Value must match regular expression ^(.|all)$
Ranges between zero and one. Higher values will produce more punctuation. The default is 0.5.
Possible values: >= 0
and <= 1
conversation_config object
This mode will detect when a speaker has stopped talking. The end_of_utterance_silence_trigger is the time in seconds after which the server will assume that the speaker has finished speaking, and will emit an EndOfUtterance message. A value of 0 disables the feature.
Possible values: >= 0
and <= 2
0
Received messages
RecognitionStarted
RecognitionStarted
AudioAdded
AudioAdded
AddPartialTranscript
AddPartialTranscript
Speechmatics JSON output format version number.
2.1
metadata objectrequired
results object[]required
Possible values: [word
, punctuation
]
Possible values: [next
, previous
, none
, both
]
alternatives object[]
display
Possible values: [ltr
, rtl
]
Possible values: >= 0
and <= 1
Possible values: >= 0
and <= 100
AddTranscript
AddTranscript
Speechmatics JSON output format version number.
2.1
metadata objectrequired
results object[]required
Possible values: [word
, punctuation
]
Possible values: [next
, previous
, none
, both
]
alternatives object[]
display
Possible values: [ltr
, rtl
]
Possible values: >= 0
and <= 1
Possible values: >= 0
and <= 100
AddPartialTranslation
AddPartialTranslation
Speechmatics JSON output format version number.
2.1
results object[]required
AddTranslation
AddTranslation
Speechmatics JSON output format version number.
2.1
results object[]required
EndOfTranscript
EndOfTranscript
AudioEventStarted
AudioEventStarted
event objectrequired
AudioEventEnded
AudioEventEnded
event objectrequired
EndOfUtterance
EndOfUtterance
metadata objectrequired
Info
Info
The following are the possible info types:
Info Type | Description |
---|---|
recognition_quality | Informs the client what particular quality-based model is used to handle the recognition. Sent to the client immediately after the WebSocket handshake is completed. |
model_redirect | Informs the client that a deprecated language code has been specified, and will be handled with a different model. For example, if the model parameter is set to one of en-US , en-GB , or en-AU , then the request may be internally redirected to the Global English model (en ). |
deprecated | Informs about using a feature that is going to be removed in a future release. |
session_transfer | Informs that the session has been seamlessly transferred to another backend, with the reason: Session has been transferred to a new backend. This typically occurs due to backend maintenance operations or migration from a faulty backend. |
Possible values: [recognition_quality
, model_redirect
, deprecated
, concurrent_session_usage
]
Only set when type
is recognition_quality
. Quality-based model name. It is one of "telephony", "broadcast". The model is selected automatically, for high-quality audio (12kHz+) the broadcast model is used, for lower quality audio the telephony model is used.
Only set when type
is concurrent_session_usage
. Indicates the current usage (number of active concurrent sessions).
Only set when type
is concurrent_session_usage
. Indicates the current quota (maximum number of concurrent sessions allowed).
Only set when type
is concurrent_session_usage
. Indicates the timestamp of the most recent usage update, in the format YYYY-MM-DDTHH:MM:SSZ
(UTC). This value is updated even when usage exceeds the quota, as it represents the most recent known data. In some cases, it may be empty or outdated due to internal errors preventing successful update.
2025-03-25T08:45:31Z
Warning
Warning
The following are the possible warning types:
Warning Type | Description |
---|---|
duration_limit_exceeded | The maximum allowed duration of a single utterance to process has been exceeded. Any AddAudio messages received that exceed this limit are confirmed with AudioAdded , but are ignored by the transcription engine. Exceeding the limit triggers the same mechanism as receiving an EndOfStream message, so the Server will eventually send an EndOfTranscript message and suspend. |
unsupported_translation_pair | One of the requested translation target languages is unsupported (given the source audio language). The error message specifies the unsupported language pair. |
idle_timeout | Informs that the session is approaching the idle duration limit (no audio data sent within the last hour), with a reason of the form:
|
session_timeout | Informs that the session is approaching the max session duration limit (maximum session duration of 48 hours), with a reason of the form:
|
empty_translation_target_list | No supported translation target languages specified. Translation will not run. |
add_audio_after_eos | Protocol specification doesn't allow adding audio after EndOfStream has been received. Any `AddAudio messages after this, will be ignored. |
Possible values: [duration_limit_exceeded
, unsupported_translation_pair
, idle_timeout
, session_timeout
, empty_translation_target_list
, add_audio_after_eos
]
Only set when type
is duration_limit_exceeded
. Indicates the limit that was exceeded (in seconds).
Error
Error
The following are the possible error types:
Error Type | Description |
---|---|
invalid_message | The message received was not understood. |
invalid_model | Unable to use the model for the recognition. This can happen if the language is not supported at all, or is not available for the user. |
invalid_config | The config received contains some wrong or unsupported fields, or too many translation target languages were requested. |
invalid_audio_type | Audio type is not supported, is deprecated, or the audio_type is malformed. |
invalid_output_format | Output format is not supported, is deprecated, or the output_format is malformed. |
not_authorised | User was not recognised, or the API key provided is not valid. |
insufficient_funds | User doesn't have enough credits or any other reason preventing the user to be charged for the job properly. |
not_allowed | User is not allowed to use this message (is not allowed to perform the action the message would invoke). |
job_error | Unable to do any work on this job, the server might have timed out etc. |
data_error | Unable to accept the data specified - usually because there is too much data being sent at once |
buffer_error | Unable to fit the data in a corresponding buffer. This can happen for clients sending the input data faster than real-time. |
protocol_error | Message received was syntactically correct, but could not be accepted due to protocol limitations. This is usually caused by messages sent in the wrong order. |
quota_exceeded | Maximum number of concurrent connections allowed for the contract has been reached |
timelimit_exceeded | Usage quota for the contract has been reached |
idle_timeout | Idle duration limit was reached (no audio data sent within the last hour), a closing handshake with code 1008 follows this in-band error. |
session_timeout | Max session duration was reached (maximum session duration of 48 hours), a closing handshake with code 1008 follows this in-band error. |
session_transfer | An error while transferring session to another backend with the reason: Session transfer failed. This may occur when moving sessions due to backend maintenance operations or migration from a faulty backend. |
unknown_error | An error that did not fit any of the types above. |
invalid_message
, protocol_error
and unknown_error
can be triggered as a response to any type of messages.
Possible values: [invalid_message
, invalid_model
, invalid_config
, invalid_audio_type
, invalid_output_format
, not_authorised
, insufficient_funds
, not_allowed
, job_error
, data_error
, buffer_error
, protocol_error
, quota_exceeded
, timelimit_exceeded
, idle_timeout
, session_timeout
, session_transfer
, unknown_error
]
Websocket errors
In the Real-time SaaS, an in-band error message can be followed by a WebSocket close message. The table below shows the possible WebSocket close codes and associated error types. The error types are provided in the payload of the close message.
WebSocket Close Code | WebSocket Close Payload |
---|---|
1003 | protocol_error |
1008 | policy_violation |
1011 | internal_error |
4001 | not_authorised |
4003 | not_allowed |
4004 | invalid_model |
4005 | quota_exceeded |
4006 | timelimit_exceeded |
4013 | job_error |