Realtime API Reference
GETwss://eu2.rt.speechmatics.com/v2/
Protocol overview
A basic Realtime session will have the following message exchanges:
Browser based transcription
When starting a Realtime transcription session in the browser, temporary keys should be used to avoid exposing your long-lived API key.
To do so, you must provide the temporary key as a part of a query parameter. This is due to a browser limitation. For example:
wss://eu2.rt.speechmatics.com/v2?jwt=<temporary-key>
Handshake responses
Successful Response
101 Switching Protocols
- Switch to WebSocket protocol
Here is an example for a successful WebSocket handshake:
GET /v2/ HTTP/1.1
Host: eu2.rt.speechmatics.com
Upgrade: websocket
Connection: Upgrade
Sec-WebSocket-Key: ujRTbIaQsXO/0uCbjjkSZQ==
Sec-WebSocket-Version: 13
Sec-WebSocket-Extensions: permessage-deflate; client_max_window_bits
Authorization: Bearer wmz9fkLJM6U5NdyaG3HLHybGZj65PXp
User-Agent: Python/3.8 websockets/8.1
A successful response should look like:
HTTP/1.1 101 Switching Protocols
Server: nginx/1.17.8
Date: Wed, 06 Jan 2021 11:01:05 GMT
Connection: upgrade
Upgrade: WebSocket
Sec-WebSocket-Accept: 87kiC/LI5WgXG52nSylnfXdz260=
Malformed request
A malformed handshake request will result in one of the following HTTP responses:
400 Bad Request
401 Unauthorized
- when the API key is not valid405 Method Not Allowed
- when the request method is not GET
Client Retry
Following a successful handshake and switch to the WebSocket protocol, the client could receive an immediate error message and WebSocket close handshake from the server. For the following errors only, we recommend adding a client retry interval of at least 5-10 seconds:
4005 quota_exceeded
4013 job_error
1011 internal_error
Message Handling
Each message that the Server accepts is a stringified JSON object with the following fields:
message
(String): The name of the message we are sending. Any other fields depend on the value of themessage
and are described below.
The messages sent by the Server to a Client are stringified JSON objects as well.
The only exception is a binary message sent from the Client to the Server containing a chunk of audio which will be referred to as AddAudio
.
The following values of the message
field are supported:
Sent messages
StartRecognition
Initiates a new recognition session.
StartRecognition
audio_format objectrequired
- Raw
- File
Raw audio samples, described by the following additional mandatory fields:
raw
Possible values: [pcm_f32le
, pcm_s16le
, mulaw
]
The sample rate of the audio in Hz.
Example: {"type":"raw","encoding":"pcm_s16le","sample_rate":44100}
Choose this option to send audio encoded in a recognized format. The AddAudio messages have to provide all the file contents, including any headers. The file is usually not accepted all at once, but segmented into reasonably sized messages.
Note: Only the following formats are supported: wav
, mp3
, aac
, ogg
, mpeg
, amr
, m4a
, mp4
, flac
file
transcription_config objectrequired
Contains configuration for this recognition session.
Language model to process the audio input, normally specified as an ISO language code. The value must be consistent with the language code used in the API endpoint URL.
en
Request a specialized model based on 'language' but optimized for a particular field, e.g. finance
or medical
.
Configure locale for outputted transcription. See output formatting.
Possible values: non-empty
additional_vocab object[]
Configure custom dictionary. Default is an empty list. You should be aware that there is a performance penalty (latency degradation and memory increase) from using additional_vocab
, especially if you use a large word list. When initializing a session that uses additional_vocab
in the config, you should expect a delay of up to 15 seconds (depending on the size of the list).
- String
- Object
Possible values: non-empty
Possible values: non-empty
Possible values: >= 1
Set to speaker
to apply Speaker Diarization to the audio.
Possible values: [none
, speaker
]
none
This is the delay in seconds between the end of a spoken word and returning the Final transcript results. See Latency for more details
Possible values: >= 0.7
and <= 4
4
This allows some additional time for Smart Formatting.
Possible values: [flexible
, fixed
]
flexible
speaker_diarization_config object
Configure the maximum number of speakers to detect. See Max Speakers.
Possible values: >= 2
and <= 100
50
When set to true
, reduces the likelihood of incorrectly switching between similar sounding speakers.
See Prefer Current Speaker.
false
Possible values: >= 0
and <= 1
audio_filtering_config object
Puts a lower limit on the volume of processed audio by using the volume_threshold
setting. See Audio Filtering.
Possible values: >= 0
and <= 100
transcript_filtering_config object
When set to true
, removes disfluencies from the transcript. See Removing disfluencies
replacements object[]
A list of replacement rules to apply to the transcript. Each rule consists of a pattern to match and a replacement string. See Word replacement
Whether or not to send Partials (i.e. AddPartialTranslation
messages) as well as Finals (i.e. AddTranslation
messages)
See Partial transcripts.
false
true
Which model you wish to use. See Operating points for more details.
Possible values: [standard
, enhanced
]
standard
punctuation_overrides object
Options for controlling punctuation in the output transcripts. See Punctuation Settings
The punctuation marks which the client is prepared to accept in transcription output, or the special value 'all' (the default). Unsupported marks are ignored. This value is used to guide the transcription process.
Possible values: Value must match regular expression ^(.|all)$
Ranges between zero and one. Higher values will produce more punctuation. The default is 0.5.
Possible values: >= 0
and <= 1
conversation_config object
This mode will detect when a speaker has stopped talking. The end_of_utterance_silence_trigger
is the time in seconds after which the server will assume that the speaker has finished speaking, and will emit an EndOfUtterance
message. A value of 0 disables the feature.
Possible values: >= 0
and <= 2
0
translation_config object
Specifies various configuration values for translation. All fields except target_languages
are optional, using default values when omitted.
List of languages to translate to from the source transcription language
. Specified as an ISO Language Code.
Whether or not to send Partials (i.e. AddPartialTranslation
messages) as well as Finals (i.e. AddTranslation
messages).
false
audio_events_config object
Contains configuration for Audio Events
List of Audio Event types to enable.
AddAudio
A binary chunk of audio. The server confirms receipt by sending an AudioAdded message.
EndOfStream
Declares that the client has no more audio to send.
EndOfStream
SetRecognitionConfig
Allows the client to re-configure the recognition session.
Only the following fields can be set through a SetRecognitionConfig message:
max_delay
max_delay_mode
enable_partials
If you wish to alter any other parameters you must terminate the session and restart with the altered configuration. Attempting otherwise will result in an error.
SetRecognitionConfig
transcription_config objectrequired
Contains configuration for this recognition session.
Language model to process the audio input, normally specified as an ISO language code. The value must be consistent with the language code used in the API endpoint URL.
en
Request a specialized model based on 'language' but optimized for a particular field, e.g. finance
or medical
.
Configure locale for outputted transcription. See output formatting.
Possible values: non-empty
additional_vocab object[]
Configure custom dictionary. Default is an empty list. You should be aware that there is a performance penalty (latency degradation and memory increase) from using additional_vocab
, especially if you use a large word list. When initializing a session that uses additional_vocab
in the config, you should expect a delay of up to 15 seconds (depending on the size of the list).
- String
- Object
Possible values: non-empty
Possible values: non-empty
Possible values: >= 1
Set to speaker
to apply Speaker Diarization to the audio.
Possible values: [none
, speaker
]
none
This is the delay in seconds between the end of a spoken word and returning the Final transcript results. See Latency for more details
Possible values: >= 0.7
and <= 4
4
This allows some additional time for Smart Formatting.
Possible values: [flexible
, fixed
]
flexible
speaker_diarization_config object
Configure the maximum number of speakers to detect. See Max Speakers.
Possible values: >= 2
and <= 100
50
When set to true
, reduces the likelihood of incorrectly switching between similar sounding speakers.
See Prefer Current Speaker.
false
Possible values: >= 0
and <= 1
audio_filtering_config object
Puts a lower limit on the volume of processed audio by using the volume_threshold
setting. See Audio Filtering.
Possible values: >= 0
and <= 100
transcript_filtering_config object
When set to true
, removes disfluencies from the transcript. See Removing disfluencies
replacements object[]
A list of replacement rules to apply to the transcript. Each rule consists of a pattern to match and a replacement string. See Word replacement
Whether or not to send Partials (i.e. AddPartialTranslation
messages) as well as Finals (i.e. AddTranslation
messages)
See Partial transcripts.
false
true
Which model you wish to use. See Operating points for more details.
Possible values: [standard
, enhanced
]
standard
punctuation_overrides object
Options for controlling punctuation in the output transcripts. See Punctuation Settings
The punctuation marks which the client is prepared to accept in transcription output, or the special value 'all' (the default). Unsupported marks are ignored. This value is used to guide the transcription process.
Possible values: Value must match regular expression ^(.|all)$
Ranges between zero and one. Higher values will produce more punctuation. The default is 0.5.
Possible values: >= 0
and <= 1
conversation_config object
This mode will detect when a speaker has stopped talking. The end_of_utterance_silence_trigger
is the time in seconds after which the server will assume that the speaker has finished speaking, and will emit an EndOfUtterance
message. A value of 0 disables the feature.
Possible values: >= 0
and <= 2
0
Received messages
RecognitionStarted
Server response to StartRecognition, acknowledging that a recognition session has started.
RecognitionStarted
AudioAdded
Server response to AddAudio, indicating that audio has been added successfully.
When clients send audio faster than real-time, the server may read data slower than it's sent. If binary AddAudio messages exceed the server's internal buffer, the server will process other WebSocket messages until buffer space is available. Clients receive AudioAdded responses only after binary data is read. This can fill TCP buffers, potentially causing WebSocket write failures and connection closure with prejudice. Clients can monitor the WebSocket's bufferedAmount
attribute to prevent this.
AudioAdded
AddPartialTranscript
A partial transcript is a transcript that can be changed in a future AddPartialTranscript
as more words are spoken until the AddTranscript
Final message is sent for that audio.
Partials will only be sent if transcription_config.enable_partials
is set to true
in the StartRecognition
message.
The message structure is the same as AddTranscript
, with a few limitations.
For AddPartialTranscript
messages the confidence
field for alternatives
has no meaning and should not be relied on.
AddPartialTranscript
Speechmatics JSON output format version number.
2.1
metadata objectrequired
The entire transcript contained in the segment in text format. Providing the entire transcript here is designed for ease of consumption; we have taken care of all the necessary formatting required to concatenate the transcription results into a block of text.
This transcript lacks the detailed information however which is contained in the results
field of the message - such as the timings and confidences for each word.
results object[]required
Possible values: [word
, punctuation
]
Possible values: [next
, previous
, none
, both
]
alternatives object[]
A word or punctuation mark.
A confidence score assigned to the alternative. Ranges from 0.0 (least confident) to 1.0 (most confident).
The language that the alternative word is assumed to be spoken in. Currently, this will always be equal to the language that was requested in the initial StartRecognition
message.
display object
Information about how the word/symbol should be displayed.
Either ltr
for words that should be displayed left-to-right, or rtl
vice versa.
Possible values: [ltr
, rtl
]
Label indicating who said that word. Only set if diarization is enabled.
This is a set list of profanities and disfluencies respectively that cannot be altered by the end user. [disfluency]
is only present in English, and [profanity]
is present in English, Spanish, and Italian
Possible values: [disfluency
, profanity
]
Possible values: >= 0
and <= 1
Possible values: >= 0
and <= 100
AddTranscript
Contains the final transcript of a part of the audio that the client has sent.
AddTranscript
Speechmatics JSON output format version number.
2.1
metadata objectrequired
The entire transcript contained in the segment in text format. Providing the entire transcript here is designed for ease of consumption; we have taken care of all the necessary formatting required to concatenate the transcription results into a block of text.
This transcript lacks the detailed information however which is contained in the results
field of the message - such as the timings and confidences for each word.
results object[]required
Possible values: [word
, punctuation
]
Possible values: [next
, previous
, none
, both
]
alternatives object[]
A word or punctuation mark.
A confidence score assigned to the alternative. Ranges from 0.0 (least confident) to 1.0 (most confident).
The language that the alternative word is assumed to be spoken in. Currently, this will always be equal to the language that was requested in the initial StartRecognition
message.
display object
Information about how the word/symbol should be displayed.
Either ltr
for words that should be displayed left-to-right, or rtl
vice versa.
Possible values: [ltr
, rtl
]
Label indicating who said that word. Only set if diarization is enabled.
This is a set list of profanities and disfluencies respectively that cannot be altered by the end user. [disfluency]
is only present in English, and [profanity]
is present in English, Spanish, and Italian
Possible values: [disfluency
, profanity
]
Possible values: >= 0
and <= 1
Possible values: >= 0
and <= 100
AddPartialTranslation
Contains a work-in-progress translation of a part of the audio that the client has sent.
AddPartialTranslation
Speechmatics JSON output format version number.
2.1
Language translation relates to given as an ISO language code.
results object[]required
The start time (in seconds) of the original transcribed audio segment
The end time (in seconds) of the original transcribed audio segment
The speaker that uttered the speech if speaker diarization is enabled
AddTranslation
Contains the final translation of a part of the audio that the client has sent.
AddTranslation
Speechmatics JSON output format version number.
2.1
Language translation relates to given as an ISO language code.
results object[]required
The start time (in seconds) of the original transcribed audio segment
The end time (in seconds) of the original transcribed audio segment
The speaker that uttered the speech if speaker diarization is enabled
EndOfTranscript
Server response to EndOfStream
, after the server has finished sending all AddTranscript messages.
EndOfTranscript
AudioEventStarted
Start of an audio event detected.
AudioEventStarted
event objectrequired
The type of audio event that has started or ended. See our list of supported Audio Event types.
The time (in seconds) of the audio corresponding to the beginning of the audio event.
A confidence score assigned to the audio event. Ranges from 0.0 (least confident) to 1.0 (most confident).
Possible values: >= 0
and <= 1
AudioEventEnded
End of an audio event detected.
AudioEventEnded
event objectrequired
The type of audio event that has started or ended. See our list of supported Audio Event types.
EndOfUtterance
Indicates the end of an utterance, triggered by a configurable period of non-speech.
The message is sent when no speech has been detected for a short period of time, configurable by the end_of_utterance_silence_trigger
parameter in conversation_config
(see End Of Utterance).
Like punctuation, an EndOfUtterance
has zero duration.
EndOfUtterance
metadata objectrequired
The time (in seconds) that the end of utterance was detected.
The time (in seconds) that the end of utterance was detected.
Info
Additional information sent from the server to the client.
Info
The following are the possible info types:
Possible values: [recognition_quality
, model_redirect
, deprecated
, concurrent_session_usage
]
Only set when type
is recognition_quality
. Quality-based model name. It is one of "telephony", "broadcast". The model is selected automatically, for high-quality audio (12kHz+) the broadcast model is used, for lower quality audio the telephony model is used.
Only set when type
is concurrent_session_usage
. Indicates the current usage (number of active concurrent sessions).
Only set when type
is concurrent_session_usage
. Indicates the current quota (maximum number of concurrent sessions allowed).
Only set when type
is concurrent_session_usage
. Indicates the timestamp of the most recent usage update, in the format YYYY-MM-DDTHH:MM:SSZ
(UTC). This value is updated even when usage exceeds the quota, as it represents the most recent known data. In some cases, it may be empty or outdated due to internal errors preventing successful update.
2025-03-25T08:45:31Z
Warning
Warning messages sent from the server to the client.
Warning
The following are the possible warning types:
Possible values: [duration_limit_exceeded
, unsupported_translation_pair
, idle_timeout
, session_timeout
, empty_translation_target_list
, add_audio_after_eos
]
Only set when type
is duration_limit_exceeded
. Indicates the limit that was exceeded (in seconds).
Error
Error messages sent from the server to the client. After any error, the transcription is terminated and the connection is closed.
Error
The following are the possible error types:
invalid_message
, protocol_error
and unknown_error
can be triggered as a response to any type of messages.
Possible values: [invalid_message
, invalid_model
, invalid_config
, invalid_audio_type
, invalid_output_format
, not_authorised
, insufficient_funds
, not_allowed
, job_error
, data_error
, buffer_error
, protocol_error
, quota_exceeded
, timelimit_exceeded
, idle_timeout
, session_timeout
, session_transfer
, unknown_error
]
Websocket errors
In the Realtime SaaS, an in-band error message can be followed by a WebSocket close message. The table below shows the possible WebSocket close codes and associated error types. The error types are provided in the payload of the close message.