Real-Time API Reference
WebSocket Handshake
Handshake Request
When starting a Real-Time transcription session on the server, your API key can be provided in the WebSocket connection request header. For browser based transcription, see the notes on Browser based transcription.
On-demand SaaS customers should use the following endpoint to open a WebSocket connection:
wss://eu2.rt.speechmatics.com/v2
Enterprise customers should use one of our Supported Endpoints.
When implementing your WebSocket client, we recommend using a ping/pong timeout of at least 60 seconds and a ping interval of 20 to 60 seconds. More details about ping/pong messages can be found in the WebSocket RFC.
Browser based transcription
When starting a Real-Time transcription session in the browser, temporary keys should be used to avoid exposing your long-lived API key.
To do so, you must provide the temporary key as a part of a query parameter. This is due to a browser limitation. For example:
wss://eu2.rt.speechmatics.com/v2?jwt=<temporary-key>
Handshake Responses
Successful Response
101 Switching Protocols
- Switch to WebSocket protocol
Here is an example for a successful WebSocket handshake:
GET /v2/en HTTP/1.1
Host: eu2.rt.speechmatics.com
Upgrade: websocket
Connection: Upgrade
Sec-WebSocket-Key: ujRTbIaQsXO/0uCbjjkSZQ==
Sec-WebSocket-Version: 13
Sec-WebSocket-Extensions: permessage-deflate; client_max_window_bits
Authorization: Bearer wmz9fkLJM6U5NdyaG3HLHybGZj65PXp
User-Agent: Python/3.8 websockets/8.1
A successful response should look like:
HTTP/1.1 101 Switching Protocols
Server: nginx/1.17.8
Date: Wed, 06 Jan 2021 11:01:05 GMT
Connection: upgrade
Upgrade: WebSocket
Sec-WebSocket-Accept: 87kiC/LI5WgXG52nSylnfXdz260=
Malformed Request
A malformed handshake request will result in one of the following HTTP responses:
400 Bad Request
401 Unauthorized
- when the API key is not valid405 Method Not Allowed
- when the request method is not GET
Client Retry
Following a successful handshake and switch to the WebSocket protocol, the client could receive an immediate error message and WebSocket close handshake from the server. For the following errors only, we recommend adding a client retry interval of at least 5-10 seconds:
4005 quota_exceeded
4013 job_error
1011 internal_error
Supported Audio Types
An AudioType
object always has one mandatory field type
, and potentially more mandatory fields based on the value of type
. The following types are supported:
type: "raw"
Raw audio samples, described by the following additional mandatory fields:
encoding
(String): Encoding used to store individual audio samples. Currently supported values:pcm_f32le
- Corresponds to 32-bit float PCM used in the WAV audio format, little-endian architecture. 4 bytes per sample.pcm_s16le
- Corresponds to 16-bit signed integer PCM used in the WAV audio format, little-endian architecture. 2 bytes per sample.mulaw
- Corresponds to 8 bit μ-law (mu-law) encoding. 1 byte per sample.
sample_rate
(Int): Sample rate of the audio
Please ensure when sending raw audio samples in real-time that the samples are undivided. For example, if you are sending raw audio via pcm_f32le
, the sample should always contain 4 bytes. Here, if a sample did not contain 4 bytes, and then an EndOfStream message were sent, this would then cause an error. Required byte sizes per sample for each type of raw audio are listed above.
type: "file"
Any audio/video format supported by GStreamer. The AddAudio
messages have to provide all the file contents, including any headers. The file is usually not accepted all at once, but segmented into reasonably sized messages.
Example audio_format
field value:
audio_format: {type: "raw", encoding: "pcm_s16le", sample_rate: 44100}
Message Handling
Each message that the Server accepts is a stringified JSON object with the following fields:
message
(String): The name of the message we are sending. Any other fields depend on the value of themessage
and are described below.
The messages sent by the Server to a Client are stringified JSON objects as well.
The only exception is a binary message sent from the Client to the Server containing a chunk of audio which will be referred to as AddAudio
.
The following values of the message
field are supported:
StartRecognition
Initiates recognition, based on details provided in the following fields:
message: "StartRecognition"
audio_format
(Object:AudioType): Required. Audio stream type you are going to send: see Supported audio types.transcription_config
(Object:TranscriptionConfig): Required. Contains configuration for this recognition session, see Transcription config.translation_config
(Object:TranslationConfig): Optional. Contains configuration for enabling Translation, see Translation config.audio_events_config
(Object:AudioEventsConfig): Optional. Contains configuration for Audio Events, see Audio Events config.
A StartRecognition
message must be sent exactly once after the WebSocket connection is opened. The client must wait for a RecognitionStarted
message before sending any audio.
An example of the StartRecognition
message:
{
"message": "StartRecognition",
"audio_format": {
"type": "raw",
"encoding": "pcm_f32le",
"sample_rate": 16000
},
"transcription_config": {
"language": "en",
"operating_point": "enhanced",
"output_locale": "en-US",
"additional_vocab": ["gnocchi", "bucatini", "bigoli"],
"diarization": "speaker",
"enable_partials": true
},
"translation_config": {
"target_languages": ["es", "de"],
"enable_partials": true
},
"audio_events_config": {
"types": ["applause", "music"]
}
}
RecognitionStarted
In case of a successful StartRecognition
attempt, a message with the following format is sent as a response:
message: "RecognitionStarted"
id
(String): Required. A randomly-generated GUID which acts as an identifier for the session. e.g. "807670e9-14af-4fa2-9e8f-5d525c22156e".language_pack_info
(Object:LanguagePackInfo): Required. Useful metadata about the language being used for transcription.
An example of the RecognitionStarted
message:
{
"message": "RecognitionStarted",
"id": "807670e9-14af-4fa2-9e8f-5d525c22156e",
"language_pack_info": {
"adapted": false,
"itn": true,
"language_description": "English",
"word_delimiter": " ",
"writing_direction": "left-to-right"
}
}
Language pack info currently contains the following information:
adapted
: Whether the language pack is adapted with Language Model Adaptation (an upcoming feature)itn
: Whether Inverse Text Normalization (ITN) is available for this language. ITN improves the formatting of entities in the text such as numerals and dates.language_description
: The full name of the languageword_delimiter
: The character to put between wordswriting_direction
: left-to-right or right-to-left
In case of failure, an error message is sent, with type
being one of the following:
invalid_model, invalid_audio_type, not_authorised, insufficient_funds, not_allowed, job_error
The example above starts a session with the Global English model ready to consume raw PCM encoded audio with float samples at 16kHz. It also includes an additional_vocab
list containing the names of different types of pasta. speaker
diarization is enabled, and partials are enabled meaning that AddPartialTranscript
messages will be received as well as AddTranscript
messages. Punctuation is configured to restrict the set of punctuation marks that will be added to only commas and full stops.
AddAudio
Adds more audio data to the recognition job started on the WebSocket using StartRecognition
. The server will only accept audio after it is initialized with a job, which is indicated by a RecognitionStarted
message. Only one audio stream in one format is currently supported per WebSocket (and hence one recognition job). AddAudio
is a binary message containing a chunk of audio data and no additional metadata.
AudioAdded
If the AddAudio
message is successfully received, an AudioAdded message is sent as a response. This message confirms that the Server has accepted the data and will start transcription. If the Client implementation holds the data in an internal buffer to resubmit in case of an error, it can safely discard the corresponding data after this message. The following fields are present in the response:
message: "AudioAdded"
seq_no
(Int): Required. An incrementing number which is equal to the number of audio chunks that the server has processed so far in the session. The count begins at 1 meaning that the 5thAddAudio
message sent by the client, for example, should be answered by anAudioAdded
message withseq_no
equal to 5.
Possible errors:
data_error
,job_error
,buffer_error
When sending audio faster than real time (for instance when sending files), make sure you don't send too much audio ahead of time. For large files, this causes the audio to fill out networking buffers, which might lead to disconnects due to WebSocket ping/pong timeout. Use AudioAdded messages to keep track what messages are processed by the engine, and don't send more than 10s of audio data or 500 individual AddAudio messages ahead of time (whichever is lower).
Implementation Details
Under special circumstances, such as when the client is sending the audio data faster than real time, the Server might read the data slower than the Client is sending it. The Server will not read the binary AddAudio
message if it is larger than the internal audio buffer on the Server. In that case, the server will read any messages coming in on the WebSocket, until enough space is made in the buffer. The Client will only receive the corresponding AudioAdded
response message once the binary data is read. The WebSocket might eventually fill all the TCP buffers on the way, causing a corresponding WebSocket to fail to write and close the connection with prejudice. The Client can use the bufferedAmount attribute of the WebSocket to prevent this.
AddTranscript
This message is sent from the Server to the Client, and contains part of the transcript. Each message corresponds to the audio since the last AddTranscript
message. These messages are also referred to as Finals since the transcript will not change any further.
An AddTranscript
message is sent when we reach an endpoint (end of a sentence or a phrase in the audio), or after the max_delay
. Any further AddTranscript
or AddPartialTranscript
messages will only correspond to the newly processed audio.
message: "AddTranscript"
metadata
(Object): Required.start_time
(Number): Required. The time (in seconds) of the audio corresponding to the beginning of the first word in the segment.end_time
(Number): Required. The time (in seconds) of the audio corresponding to the ending of the final word in the segment.transcript
(String): Required. The entire transcript contained in the segment in text format. Providing the entire transcript here is designed for ease of consumption; we have taken care of all the necessary formatting required to concatenate the transcription results into a block of text. This transcript lacks the detailed information however which is contained in theresults
field of the message - such as the timings and confidences for each word.
results
(List:Object):type
(String): Required. One of 'word', 'punctuation'. 'word' results represent a single word. 'punctuation' results represent a single punctuation symbol. 'word' and 'punctuation' results will both have one or morealternatives
representing the possible alternatives we think the word or punctuation symbol could be.start_time
(Number): Required. The time (in seconds) of the audio corresponding to the beginning of the result.end_time
(Number): Required. The time (in seconds) of the audio corresponding to the end of the result. Note that punctuation symbols results are considered to be zero-duration and thus for those resultsstart_time
is equal toend_time
.is_eos
(Boolean): Optional. Only present for 'punctuation' results. This indicates whether the punctuation mark is considered an end-of-sentence symbol. For example full-stops are an end-of-sentence symbol in English, whereas commas are not. Other languages, such as Japanese, may use different end-of-sentence symbols.alternatives
(List:Object): Optional. For 'word' and 'punctuation' results this contains a list of possible alternative options for the word/symbol.content
(String): Required. A word or punctuation mark.confidence
(Number): Required. A confidence score assigned to the alternative. Ranges from 0.0 (least confident) to 1.0 (most confident).display
(Object): Optional. Information about how the word/symbol should be displayed.direction
(String): Required. Either 'ltr' for words that should be displayed left-to-right, or 'rtl' vice versa.
language
(String): Optional. The language that the alternative word is assumed to be spoken in. Currently, this will always be equal to the language that was requested in the initialStartRecognition
message.speaker
(String): Optional. Label indicating who said that word. Only set if Diarization is enabled.tags
(array): Optional. Only[disfluency]
and[profanity]
are displayed. This is a set list of profanities and disfluencies respectively that cannot be altered by the end user.[disfluency]
is only present in English, and[profanity]
is present in English, Spanish, and Italianvolume
(Number): Optional. If the content is a word, and audio filtering is enabled, this field is the volume of that word. Ranges from 0 to 100 but the upper limits are unlikely to be reached.
AddPartialTranscript
This message is sent from the Server to the Client. A partial transcript is a transcript that can be changed and expanded by a future AddTranscript
or AddPartialTranscript
message and corresponds to the part of audio since the last AddTranscript
message. For AddPartialTranscript
messages the confidence
field for alternatives
has no meaning and should not be relied on.
Partials will only be sent if transcription_config.enable_partials
is set to true
in the StartRecognition
message.
The message structure is the same as AddTranscript
, with a few limitations.
AddTranslation
This message is sent from the Server to the Client, and contains part of the translation, if a translation has been requested.
Each message corresponds to the audio since the last AddTranslation
message. These messages are also referred to as Finals since the transcript will not change any further.
An AddTranslation
message is sent when we reach the end of a sentence in the transcription. Any further AddTranslation
or Partial
messages will only correspond to the newly processed audio.
message: "AddTranslation"
language
(String): Required. Language translation relates to.results
(List:Object):start_time
(Number): Required. The start time (in seconds) of the original transcribed audio segment.end_time
(Number): Required. The end time (in seconds) of the original transcribed audio segment.content
(String): Required. The translated segment of speech.speaker
(String): Optional. The speaker that uttered the speech if speaker diarization is enabled. See Transcription config.
AddPartialTranslation
This message is sent from the Server to the Client. A partial translation is a translation that can be changed and expanded by a future AddTranslation
or AddPartialTranslation
message and corresponds to the part of audio since the last AddTranslation
message.
Partials will only be sent if translation_config.enable_partials
is set to true
in the StartRecognition
message.
The structure is the same as AddTranslation
except speakers are not included and message
is AddPartialTranslation
.
AudioEventStarted
This message is sent from the Server to the Client, and contains information about the start of an audio event. The message is sent when the audio event is detected in the audio stream. The message contains the following fields:
message: "AudioEventStarted"
type
(String): Required. The type of audio event that has started or ended. The possible values are defined in theaudio_events_config
object in theStartRecognition
message.start_time
(Number): Required. The time (in seconds) of the audio corresponding to the beginning of the audio event.confidence
(Number): Required. A confidence score assigned to the audio event. Ranges from 0.0 (least confident) to 1.0 (most confident).
AudioEventEnded
This message is sent from the Server to the Client, and contains information about the end of an audio event. The message is sent when the audio event is no longer detected in the audio stream.
message: "AudioEventEnded"
type
(String): Required. The type of audio event that has started or ended. The possible values are defined in theaudio_events_config
object in theStartRecognition
message.end_time
(Number): Required. The time (in seconds) of the audio corresponding to the end of the audio event.
SetRecognitionConfig
Allows the Client to configure the recognition session even after the initial StartRecognition
message without restarting the connection. This is only supported for certain parameters.
message: "SetRecognitionConfig"
transcription_config
(Object:TranscriptionConfig): A TranscriptionConfig object containing the new configuration for the session, see Transcription config.
The following is an example of such a configuration message:
{
"message": "SetRecognitionConfig",
"transcription_config": {
"language": "en",
"operating_point": "enhanced",
"max_delay": 3.5,
"enable_partials": true
}
}
Note: The language
property is a mandatory element in the transcription_config
object; however it is not possible to change the language midway through the session (it will be ignored if you do). It is only possible to modify the following settings through a SetRecognitionConfig message after the initial StartRecognition
message:
max_delay
max_delay_mode
enable_partials
If you wish to alter any other parameters you must terminate the session and restart with the altered configuration. Attempting otherwise will result in an error.
EndOfStream
This message is sent from the Client to the API to announce that it has finished sending all the audio that it intended to send. No more AddAudio
message are accepted after this message. The Server will finish processing the audio it has received already and then send an EndOfTranscript message. This message is usually sent at the end of file or when the microphone input is stopped.
message: "EndOfStream"
last_seq_no
(Int): Required. The total number of audio chunks sent (in theAddAudio
messages).
EndOfTranscript
Sent from the API to the Client when the API has finished all the audio, as marked with the EndOfStream
message. The API sends this only after it sends all the corresponding AddTranscript
messages first. Upon receiving this message the Client can safely disconnect immediately because there will be no more messages coming from the API.
Error Handling
Error, Warning and Info messages may be sent by the server
to the client
. When using the RT SaaS, there may also be other WebSockets error messages.
Error Messages
Error messages have the following fields:
message: "Error"
code
(Int): Optional. A numerical code for the error.type
(String): Required. A code for the error message. See the list of possible errors below.reason
(String): Required. A human-readable reason for the error message.
Error Types
type: "invalid_message"
- The message received was not understood.
type: "invalid_model"
- Unable to use the model for the recognition. This can happen if the language is not supported at all, or is not available for the user.
type: "invalid_config"
- The config received contains some wrong/unsupported fields, or too many translation target languages were requested.
type: "invalid_audio_type"
- Audio type is not supported, is deprecated, or the audio_type is malformed.
type: "invalid_output_format"
- Output format is not supported, is deprecated, or the output_format is malformed.
type: "not_authorised"
- User was not recognised, or the API key provided is not valid.
type: "insufficient_funds"
- User doesn't have enough credits or any other reason preventing the user to be charged for the job properly.
type: "not_allowed"
- User is not allowed to use this message (is not allowed to perform the action the message would invoke).
type: "job_error"
- Unable to do any work on this job, the server might have timed out etc.
type: "data_error"
- Unable to accept the data specified - usually because there is too much data being sent at once
type: "buffer_error"
- Unable to fit the data in a corresponding buffer. This can happen for clients sending the input data faster than real-time.
type: "protocol_error"
- Message received was syntactically correct, but could not be accepted due to protocol limitations. This is usually caused by messages sent in the wrong order.
type: "quota_exceeded"
- Maximum number of concurrent connections allowed for the contract has been reached
type: "timelimit_exceeded"
- Usage quota for the contract has been reached
type: "unknown_error"
- An error that did not fit any of the types above.
Note that invalid_message
, protocol_error
and unknown_error
can be triggered as a response to any type of messages.
After any error, the transcription is terminated and the connection is closed.
WebSockets Errors
In the Real-time SaaS, an in-band error message can be followed by a WebSocket close message. The table below shows the possible WebSocket close codes and associated error types. The error types are provided in the payload of the close message.
WebSocket Close Code | WebSocket Close Payload |
---|---|
1003 | protocol_error |
1011 | internal_error |
4001 | not_authorised |
4003 | not_allowed |
4004 | invalid_model |
4005 | quota_exceeded |
4006 | timelimit_exceeded |
4013 | job_error |
There maybe other WebSockets close codes not mentioned here, such as code 1006
for abnormal_close
. This may happen when the underlying TCP connection is unexpectedly terminated, for instance, with network issues.
A full list of WebSockets error codes can be downloaded here.
Warning
Warning messages have the following fields:
message: "Warning"
code
(Int): Optional. A numerical code for the warning.type
(String): Required. A code for the warning message. See the list of possible warnings below.reason
(String): Required. A human-readable reason for the warning message.
Warning Types
type: "duration_limit_exceeded"
- The maximum allowed duration of a single utterance to process has been exceeded. Any AddAudio messages received that exceed this limit are confirmed with AudioAdded, but are ignored by the transcription engine. Exceeding the limit triggers the same mechanism as receiving an
EndOfStream
message, so the Server will eventually send anEndOfTranscript
message and suspend. - It has the following extra field:
duration_limit
(Number): The limit that was exceeded (in seconds).
- The maximum allowed duration of a single utterance to process has been exceeded. Any AddAudio messages received that exceed this limit are confirmed with AudioAdded, but are ignored by the transcription engine. Exceeding the limit triggers the same mechanism as receiving an
type: "unsupported_translation_pair"
- One of the requested translation target languages is unsupported (given the source audio language).
- The error message specifies the unsupported language pair.
Info
Info messages denote additional information sent form the Server to the Client. Those are similar to Error
and Warning
messages in syntax, but don't actually denote any problem. The Client can safely ignore these messages or use them for additional client-side logging.
message: "Info"
code
(Int): Optional. A numerical code for the informational message.type
(String): Required. A code for the info message. See the list of possible info messages below.reason
(String): Required. A human-readable reason for the informational message.
Info Message Types
type: "recognition_quality"
- Informs the client what particular quality-based model is used to handle the recognition.
- It has the following extra field:
quality
(String): Quality-based model name. It is one of"telephony"
,"broadcast"
. The model is selected automatically, for high-quality audio (12kHz+) the broadcast model is used, for lower quality audio the telephony model is used.
type: "model_redirect"
- Informs the client that a deprecated language code has been specified, and will be handled with a different model. For example, if the
model
parameter is set to one of en-US, en-GB, or en-AU, then the request may be internally redirected to the Global English model (en).
- Informs the client that a deprecated language code has been specified, and will be handled with a different model. For example, if the
type: "deprecated"
- Informs about using a feature that is going to be removed in a future release.
Configuration Settings
JSON config objects that are sent by the client
to the server
as part of the StartRecognition
or SetRecognitionConfig
messages.
Transcription Config
A TranscriptionConfig
object specifies various configuration values for the transcription. All fields except language
are optional, using default values when omitted.
language
(String): Required. Language model to process the audio input, normally specified as an ISO language code, e.g., 'en'. The value must be consistent with the language code used in the API endpoint URL.additional_vocab
(List:AdditionalWord): Optional. Configure custom dictionary. Default is an empty list. You should be aware that there is a performance penalty (latency degradation and memory increase) from usingadditional_vocab
, especially if you use a large word list. When initializing a session that usesadditional_vocab
in the config, you should expect a delay of up to 15 seconds (depending on the size of the list).diarization
(String): Optional. The Speaker Diarization method to apply to the audio. The default is "none" indicating that no diarization will be performed.speaker_diarization_config
(Object): Optional. Allows you to prevent too many speakers from being detected by using themax_speakers
setting. The default value formax_speakers
is 50. The minimum and maximum values are 2 and 100 inclusive. See Max Speakers.enable_partials
(Boolean): Optional. Whether or not to send partials (i.e.AddPartialTranscript
messages) as well as finals (i.e.AddTranscript
messages). The default is false.max_delay
(Number): Optional. Maximum delay in seconds between receiving input audio and returning final transcription results. The default is 10. The minimum and maximum values are 0.7 and 20. See Real-Time Latency for more details.max_delay_mode
(String): Optional. Allowed values arefixed
andflexible
. The default is flexible. Where an entity is detected when usingflexible
, the latency of a transcript may exceed the threshold specified inmax_delay
to allow recognition of entities to be more accurate. When usingfixed
, the transcript will be returned in segments that will never exceed themax_delay
threshold even if this results in inaccuracies in entity recognition.output_locale
(String): Optional. Configure output locale. See Output Locale. Default is an empty string.punctuation_overrides
(Object): Optional. Options for controlling punctuation in the output transcripts. See Punctuation Settings.operating_point
(String): Optional. Which model within the language pack you wish to use for transcription with a choice betweenstandard
andenhanced
. See Requesting an Enhanced Model for more details.enable_entities
(Boolean): Optional. Whether a user wishes for entities to be identified with additional spoken and written word format. Supported valuestrue
orfalse
. The default isfalse
.audio_filtering_config
(Object): Optional. Puts a lower limit on the volume of processed audio by using thevolume_threshold
setting. See Audio Filtering.transcript_filtering_config
(Object): Optional. Removes disfluencies with theremove_disfluencies
setting. See Disfluency Removal.
Translation Config
A TranslationConfig
object specifies various configuration values for translation. All fields except target_languages
are optional, using default values when omitted.
target_languages
(Array): Required. List of languages to translate to from the source transcriptionlanguage
. Specified as an ISO Language Codeenable_partials
(Boolean): Optional. Whether or not to send Partials (i.e.AddPartialTranslation
messages) as well as Finals (i.e.AddTranslation
messages). The default is false.
WebSocket Call Flow
A basic WebSocket call flow looks like this:
The WebSocket client initiates a secure connection to the server with an HTTPS GET request (which is called an Upgrade request in WebSocket terminology). The Speechmatics Real-Time SaaS expects the API Key in an Authorization header to authenticate the client.
The client receives a
400 Bad Request
or405 Method Not Allowed
HTTP response if the handshake request is malformed.If the handshake is successful, the server sends a
101 Switching Protocols
response to the client and upgrades the connection to the WebSocket protocol.After a successful handshake, the server attempts to authenticate and connect to the transcription backend. The server could send one of the following errors (as an in-band message followed by a WebSocket close handshake):
4001 not_authorised
- authentication failed, usually because the API Key is invalid4003 not_allowed
- forbidden, typically because the API Key is expired4004 invalid_model
- an unsupported language was requested4005 quota_exceeded
- the maximum number of concurrent connections allowed for the contract has been reached4006 timelimit_exceeded
- the usage quota for the contract has been reached4013 job_error
or1011 internal_error
- the service unexpectedly fails to start transcription
Once the WebSocket connection is established, the client starts sending the sequence of audio frames to the server. The server starts transcribing the audio and sends the transcription asynchronously to the client.
When the client does not want to send any more audio, it sends an
EndOfStream
message to the server.The client waits for the final
EndOfTranscript
message from the server before disconnecting.
Example Communication
The communication consists of 3 stages - initialization, transcription and a disconnect handshake.
On initialization, the StartRecognition
message is sent from the Client to the API and the Client must block and wait until it receives a RecognitionStarted
message.
Afterwards, the transcription stage happens. The client keeps sending AddAudio
messages. The API asynchronously replies with AudioAdded
messages. The API also asynchronously sends AddPartialTranscript
(if Partials enabled) and/or AddTranscript
messages, depending on whether Partials were enabled.
Once the Client doesn't want to send any more audio, the disconnect handshake is performed. The Client sends an EndOfStream
message as its last message. No more messages are handled by the API afterwards. The API processes whatever audio it has buffered at that point and sends all the AddTranscript
and AddPartialTranscript
messages accordingly. Once the API processes all the buffered audio, it sends an EndOfTranscript
message and the Client can then safely disconnect.
Note: In the example below, ->
denotes a message sent by the Client to the API, <-
denotes a message send by the API to the Client. Any comments are denoted "[like this]"
.
-> {"message": "StartRecognition", "audio_format": {"type": "file"},
"transcription_config": {"language": "en", "enable_partials": true}}
<- {"message": "RecognitionStarted", "id": "807670e9-14af-4fa2-9e8f-5d525c22156e"}
-> "[binary message - AddAudio 1]"
-> "[binary message - AddAudio 2]"
<- {"message": "AudioAdded", "seq_no": 1}
<- {"message": "Info", "type": "recognition_quality", "quality": "broadcast", "reason": "Running recognition using a broadcast model quality."}
<- {"message": "AudioAdded", "seq_no": 2}
-> "[binary message - AddAudio 3]"
<- {"message": "AudioAdded", "seq_no": 3}
"[asynchronously received transcripts:]"
<- {"message": "AddPartialTranscript", "metadata": {"start_time": 0.0, "end_time": 0.5399999618530273, "transcript": "One"},
"results": [{"alternatives": [{"confidence": 0.0, "content": "One"}],
"start_time": 0.47999998927116394, "end_time": 0.5399999618530273, "type": "word"}
]}
<- {"message": "AddPartialTranscript", "metadata": {"start_time": 0.0, "end_time": 0.7498992613545260, "transcript": "One to"},
"results": [{"alternatives": [{"confidence": 0.0, "content": "One"}],
"start_time": 0.47999998927116394, "end_time": 0.5399999618530273, "type": "word"},
{"alternatives": [{"confidence": 0.0, "content": "to"}],
"start_time": 0.6091238623430891, "end_time": 0.7498992613545260, "type": "word"}
]}
<- {"message": "AddPartialTranscript", "metadata": {"start_time": 0.0, "end_time": 0.9488123643240011, "transcript": "One to three"},
"results": [{"alternatives": [{"confidence": 0.0, "content": "One"}],
"start_time": 0.47999998927116394, "end_time": 0.5399999618530273, "type": "word"},
{"alternatives": [{"confidence": 0.0, "content": "to"}],
"start_time": 0.6091238623430891, "end_time": 0.7498992613545260, "type": "word"}
{"alternatives": [{"confidence": 0.0, "content": "three"}],
"start_time": 0.8022338627780892, "end_time": 0.9488123643240011, "type": "word"}
]}
<- {"message": "AddTranscript", "metadata": {"start_time": 0.0, "end_time": 0.9488123643240011, "transcript": "One two three."},
"results": [{"alternatives": [{"confidence": 1.0, "content": "One"}],
"start_time": 0.47999998927116394, "end_time": 0.5399999618530273, "type": "word"},
{"alternatives": [{"confidence": 1.0, "content": "to"}],
"start_time": 0.6091238623430891, "end_time": 0.7498992613545260, "type": "word"}
{"alternatives": [{"confidence": 0.96, "content": "three"}],
"start_time": 0.8022338627780892, "end_time": 0.9488123643240011, "type": "word"}
{"alternatives": [{"confidence": 1.0, "content": "."}],
"start_time": 0.9488123643240011, "end_time": 0.9488123643240011, "type": "punctuation", "is_eos": true}
]}
"[closing handshake]"
-> {"message":"EndOfStream","last_seq_no":3}
<- {"message": "EndOfTranscript"}