Real-Time API Reference

WebSocket Handshake

Handshake Request

When starting a Real-Time transcription session on the server, your key can be provided in the WebSocket connection request header. For browser based transcription, see the notes on Temporary Tokens.

Here is the format of the URI to establish a WebSocket connection to the Real-Time API:

On-demand SaaS customers should use the following endpoint to open a WebSocket connection:

wss://eu2.rt.speechmatics.com/v2

Enterprise customers should use one of our Supported Endpoints.

When implementing your WebSocket client, we recommend using a ping/pong timeout of at least 60 seconds and a ping interval of 20 to 60 seconds. More details about ping/pong messages can be found in the WebSocket RFC.

Handshake Responses

Successful Response

101 Switching Protocols - Switch to WebSocket protocol

Here is an example for a successful WebSocket handshake:

GET /v2/en HTTP/1.1
Host: eu2.rt.speechmatics.com
Upgrade: websocket
Connection: Upgrade
Sec-WebSocket-Key: ujRTbIaQsXO/0uCbjjkSZQ==
Sec-WebSocket-Version: 13
Sec-WebSocket-Extensions: permessage-deflate; client_max_window_bits
Authorization: Bearer wmz9fkLJM6U5NdyaG3HLHybGZj65PXp
User-Agent: Python/3.8 websockets/8.1

A successful response should look like:

HTTP/1.1 101 Switching Protocols
Server: nginx/1.17.8
Date: Wed, 06 Jan 2021 11:01:05 GMT
Connection: upgrade
Upgrade: WebSocket
Sec-WebSocket-Accept: 87kiC/LI5WgXG52nSylnfXdz260=

Malformed Request

A malformed handshake request will result in one of the following HTTP responses:

400 Bad Request
401 Unauthorized - when the API key is not valid
405 Method Not Allowed - when the request method is not GET

Client Retry

Following a successful handshake and switch to the WebSocket protocol, the client could receive an immediate error message and WebSocket close handshake from the server. For the following errors only, we recommend adding a client retry interval of at least 5-10 seconds:

4005 quota_exceeded
4013 job_error
1011 internal_error

Temporary Tokens

Speechmatics also allows you to generate a temporary token which can be used for authentication.

This is particularly useful for real-time transcription happening on an end-user's browser. It allows for lower latency transcription, with lower implementation effort and without exposing your long-lived API key.

You can generate a temporary token as shown below:

curl -L -X POST "https://mp.speechmatics.com/v1/api_keys?type=rt" \
     -H "Content-Type: application/json" \
     -H "Authorization: Bearer $API_KEY" \
     -d '{"ttl": 60}'

The temporary token's time to live can be between 60 and 3600 seconds. This is configured using the ttl property in the request above.

The temporary token can be used in place of an API key when making the Handshake Request.

Note that when starting a Real-Time transcription session in the browser, you must provide the temporary key as a part of a query parameter. This is due to a browser limitation. For example:

 wss://eu2.rt.speechmatics.com/v2?jwt=<temporary-key>

info

If you are an enterprise customer and would like to use temporary tokens, reach out to Support or speak to your Account Manager.

Supported Audio Types

An AudioType object always has one mandatory field type, and potentially more mandatory fields based on the value of type. The following types are supported:

type: "raw"

Raw audio samples, described by the following additional mandatory fields:

encoding (String): Encoding used to store individual audio samples. Currently supported values:
- pcm_f32le - Corresponds to 32-bit float PCM used in the WAV audio format, little-endian architecture. 4 bytes per sample.
- pcm_s16le - Corresponds to 16-bit signed integer PCM used in the WAV audio format, little-endian architecture. 2 bytes per sample.
- mulaw - Corresponds to 8 bit μ-law (mu-law) encoding. 1 byte per sample.
sample_rate (Int): Sample rate of the audio

Please ensure when sending raw audio samples in real-time that the samples are undivided. For example, if you are sending raw audio via pcm_f32le, the sample should always contain 4 bytes. Here, if a sample did not contain 4 bytes, and then an EndOfStream message were sent, this would then cause an error. Required byte sizes per sample for each type of raw audio are listed above.

type: "file"

Any audio/video format supported by GStreamer. The AddAudio messages have to provide all the file contents, including any headers. The file is usually not accepted all at once, but segmented into reasonably sized messages.

Example audio_format field value: audio_format: {type: "raw", encoding: "pcm_s16le", sample_rate: 44100}

Message Handling

Each message that the Server accepts is a stringified JSON object with the following fields:

message (String): The name of the message we are sending. Any other fields depend on the value of the message and are described below.

The messages sent by the Server to a Client are stringified JSON objects as well.

The only exception is a binary message sent from the Client to the Server containing a chunk of audio which will be referred to as AddAudio.

The following values of the message field are supported:

StartRecognition

Initiates recognition, based on details provided in the following fields:

message: "StartRecognition"
audio_format (Object:AudioType): Required. Audio stream type you are going to send: see Supported audio types.
transcription_config (Object:TranscriptionConfig): Required. Contains configuration for this recognition session, see Transcription config.
translation_config (Object:TranslationConfig): Optional. Contains configuration for enabling Translation, see Translation config.

A StartRecognition message must be sent exactly once after the WebSocket connection is opened. The client must wait for a RecognitionStarted message before sending any audio.

An example of the StartRecognition message:

{
  "message": "StartRecognition",
  "audio_format": {
    "type": "raw",
    "encoding": "pcm_f32le",
    "sample_rate": 16000
  },
  "transcription_config": {
    "language": "en",
    "operating_point": "enhanced",
    "output_locale": "en-US",
    "additional_vocab": ["gnocchi", "bucatini", "bigoli"],
    "diarization": "speaker",
    "enable_partials": true
  },
  "translation_config": {
    "target_languages": ["es", "de"],
    "enable_partials": true
  }
}

RecognitionStarted

In case of a successful StartRecognition attempt, a message with the following format is sent as a response:

message: "RecognitionStarted"
id (String): Required. A randomly-generated GUID which acts as an identifier for the session. e.g. "807670e9-14af-4fa2-9e8f-5d525c22156e".
language_pack_info (Object:LanguagePackInfo): Required. Useful metadata about the language being used for transcription.

An example of the RecognitionStarted message:

{
  "message": "RecognitionStarted",
  "id": "807670e9-14af-4fa2-9e8f-5d525c22156e",
  "language_pack_info": {
    "adapted": false,
    "itn": true,
    "language_description": "English",
    "word_delimiter": " ",
    "writing_direction": "left-to-right"
  }
}

Language pack info currently contains the following information:

adapted: Whether the language pack is adapted with Language Model Adaptation (an upcoming feature)
itn: Whether Inverse Text Normalization (ITN) is available for this language. ITN improves the formatting of entities in the text such as numerals and dates.
language_description: The full name of the language
word_delimiter: The character to put between words
writing_direction: left-to-right or right-to-left

In case of failure, an error message is sent, with type being one of the following: invalid_model, invalid_audio_type, not_authorised, insufficient_funds, not_allowed, job_error

The example above starts a session with the Global English model ready to consume raw PCM encoded audio with float samples at 16kHz. It also includes an additional_vocab list containing the names of different types of pasta. speaker diarization is enabled, and partials are enabled meaning that AddPartialTranscript messages will be received as well as AddTranscript messages. Punctuation is configured to restrict the set of punctuation marks that will be added to only commas and full stops.

AddAudio

Adds more audio data to the recognition job started on the WebSocket using StartRecognition. The server will only accept audio after it is initialized with a job, which is indicated by a RecognitionStarted message. Only one audio stream in one format is currently supported per WebSocket (and hence one recognition job). AddAudio is a binary message containing a chunk of audio data and no additional metadata.

AudioAdded

If the AddAudio message is successfully received, an AudioAdded message is sent as a response. This message confirms that the Server has accepted the data and will start transcription. If the Client implementation holds the data in an internal buffer to resubmit in case of an error, it can safely discard the corresponding data after this message. The following fields are present in the response:

message: "AudioAdded"
seq_no (Int): Required. An incrementing number which is equal to the number of audio chunks that the server has processed so far in the session. The count begins at 1 meaning that the 5th AddAudio message sent by the client, for example, should be answered by an AudioAdded message with seq_no equal to 5.

Possible errors:

data_error, job_error, buffer_error

When sending audio faster than real time (for instance when sending files), make sure you don't send too much audio ahead of time. For large files, this causes the audio to fill out networking buffers, which might lead to disconnects due to WebSocket ping/pong timeout. Use AudioAdded messages to keep track what messages are processed by the engine, and don't send more than 10s of audio data or 500 individual AddAudio messages ahead of time (whichever is lower).

Implementation Details

Under special circumstances, such as when the client is sending the audio data faster than real time, the Server might read the data slower than the Client is sending it. The Server will not read the binary AddAudio message if it is larger than the internal audio buffer on the Server. In that case, the server will read any messages coming in on the WebSocket, until enough space is made in the buffer. The Client will only receive the corresponding AudioAdded response message once the binary data is read. The WebSocket might eventually fill all the TCP buffers on the way, causing a corresponding WebSocket to fail to write and close the connection with prejudice. The Client can use the bufferedAmount attribute of the WebSocket to prevent this.

AddTranscript

This message is sent from the Server to the Client, and contains part of the transcript. Each message corresponds to the audio since the last AddTranscript message. These messages are also referred to as Finals since the transcript will not change any further. An AddTranscript message is sent when we reach an endpoint (end of a sentence or a phrase in the audio), or after the max_delay. Any further AddTranscript or AddPartialTranscript messages will only correspond to the newly processed audio.

message: "AddTranscript"
metadata (Object): Required.
- start_time (Number): Required. The time (in seconds) of the audio corresponding to the beginning of the first word in the segment.
- end_time (Number): Required. The time (in seconds) of the audio corresponding to the ending of the final word in the segment.
- transcript (String): Required. The entire transcript contained in the segment in text format. Providing the entire transcript here is designed for ease of consumption; we have taken care of all the necessary formatting required to concatenate the transcription results into a block of text. This transcript lacks the detailed information however which is contained in the results field of the message - such as the timings and confidences for each word.
results (List:Object):
- type (String): Required. One of 'word', 'punctuation'. 'word' results represent a single word. 'punctuation' results represent a single punctuation symbol. 'word' and 'punctuation' results will both have one or more alternatives representing the possible alternatives we think the word or punctuation symbol could be.
- start_time (Number): Required. The time (in seconds) of the audio corresponding to the beginning of the result.
- end_time (Number): Required. The time (in seconds) of the audio corresponding to the end of the result. Note that punctuation symbols results are considered to be zero-duration and thus for those results start_time is equal to end_time.
- is_eos (Boolean): Optional. Only present for 'punctuation' results. This indicates whether the punctuation mark is considered an end-of-sentence symbol. For example full-stops are an end-of-sentence symbol in English, whereas commas are not. Other languages, such as Japanese, may use different end-of-sentence symbols.
- alternatives (List:Object): Optional. For 'word' and 'punctuation' results this contains a list of possible alternative options for the word/symbol.
  - content (String): Required. A word or punctuation mark.
  - confidence (Number): Required. A confidence score assigned to the alternative. Ranges from 0.0 (least confident) to 1.0 (most confident).
  - display (Object): Optional. Information about how the word/symbol should be displayed.
    - direction (String): Required. Either 'ltr' for words that should be displayed left-to-right, or 'rtl' vice versa.
  - language (String): Optional. The language that the alternative word is assumed to be spoken in. Currently, this will always be equal to the language that was requested in the initial StartRecognition message.
  - speaker (String): Optional. Label indicating who said that word. Only set if Diarization is enabled.
  - tags (array): Optional. Only [disfluency] and [profanity] are displayed. This is a set list of profanities and disfluencies respectively that cannot be altered by the end user. [disfluency] is only present in English, and [profanity] is present in English, Spanish, and Italian

AddPartialTranscript

This message is sent from the Server to the Client. A partial transcript is a transcript that can be changed and expanded by a future AddTranscript or AddPartialTranscript message and corresponds to the part of audio since the last AddTranscript message. For AddPartialTranscript messages the confidence field for alternatives has no meaning and should not be relied on.

Partials will only be sent if transcription_config.enable_partials is set to true in the StartRecognition message.

The message structure is the same as AddTranscript.

AddTranslation

This message is sent from the Server to the Client, and contains part of the translation, if a translation has been requested. Each message corresponds to the audio since the last AddTranslation message. These messages are also referred to as Finals since the transcript will not change any further. An AddTranslation message is sent when we reach the end of a sentence in the transcription. Any further AddTranslation or Partial messages will only correspond to the newly processed audio.

message: "AddTranslation"
language (String): Required. Language translation relates to.
results (List:Object):
- start_time (Number): Required. The start time (in seconds) of the original transcribed audio segment.
- end_time (Number): Required. The end time (in seconds) of the original transcribed audio segment.
- content (String): Required. The translated segment of speech.
- speaker (String): Optional. The speaker that uttered the speech if speaker diarization is enabled. See Transcription config.

AddPartialTranslation

This message is sent from the Server to the Client. A partial translation is a translation that can be changed and expanded by a future AddTranslation or AddPartialTranslation message and corresponds to the part of audio since the last AddTranslation message.

Partials will only be sent if translation_config.enable_partials is set to true in the StartRecognition message.

The structure is the same as AddTranslation except speakers are not included and message is AddPartialTranslation.

SetRecognitionConfig

Allows the Client to configure the recognition session even after the initial StartRecognition message without restarting the connection. This is only supported for certain parameters.

message: "SetRecognitionConfig"
transcription_config (Object:TranscriptionConfig): A TranscriptionConfig object containing the new configuration for the session, see Transcription config.

The following is an example of such a configuration message:

{
  "message": "SetRecognitionConfig",
  "transcription_config": {
    "language": "en",
    "operating_point": "enhanced",
    "max_delay": 3.5,
    "enable_partials": true
  }
}

Note: The language property is a mandatory element in the transcription_config object; however it is not possible to change the language midway through the session (it will be ignored if you do). It is only possible to modify the following settings through a SetRecognitionConfig message after the initial StartRecognition message:

max_delay
max_delay_mode
enable_partials

If you wish to alter any other parameters you must terminate the session and restart with the altered configuration. Attempting otherwise will result in an error.

EndOfStream

This message is sent from the Client to the API to announce that it has finished sending all the audio that it intended to send. No more AddAudio message are accepted after this message. The Server will finish processing the audio it has received already and then send an EndOfTranscript message. This message is usually sent at the end of file or when the microphone input is stopped.

message: "EndOfStream"
last_seq_no (Int): Required. The total number of audio chunks sent (in the AddAudio messages).

EndOfTranscript

Sent from the API to the Client when the API has finished all the audio, as marked with the EndOfStream message. The API sends this only after it sends all the corresponding AddTranscript messages first. Upon receiving this message the Client can safely disconnect immediately because there will be no more messages coming from the API.

Error Handling

Error, Warning and Info messages may be sent by the server to the client. When using the RT SaaS, there may also be other WebSockets error messages.

Error Messages

Error messages have the following fields:

message: "Error"
code (Int): Optional. A numerical code for the error.
type (String): Required. A code for the error message. See the list of possible errors below.
reason (String): Required. A human-readable reason for the error message.

Error Types

type: "invalid_message"
- The message received was not understood.
type: "invalid_model"
- Unable to use the model for the recognition. This can happen if the language is not supported at all, or is not available for the user.
type: "invalid_config"
- The config received contains some wrong/unsupported fields, or too many translation target languages were requested.
type: "invalid_audio_type"
- Audio type is not supported, is deprecated, or the audio_type is malformed.
type: "invalid_output_format"
- Output format is not supported, is deprecated, or the output_format is malformed.
type: "not_authorised"
- User was not recognised, or the API key provided is not valid.
type: "insufficient_funds"
- User doesn't have enough credits or any other reason preventing the user to be charged for the job properly.
type: "not_allowed"
- User is not allowed to use this message (is not allowed to perform the action the message would invoke).
type: "job_error"
- Unable to do any work on this job, the server might have timed out etc.
type: "data_error"
- Unable to accept the data specified - usually because there is too much data being sent at once
type: "buffer_error"
- Unable to fit the data in a corresponding buffer. This can happen for clients sending the input data faster than real-time.
type: "protocol_error"
- Message received was syntactically correct, but could not be accepted due to protocol limitations. This is usually caused by messages sent in the wrong order.
type: "quota_exceeded"
- Maximum number of concurrent connections allowed for the contract has been reached
type: "timelimit_exceeded"
- Usage quota for the contract has been reached
type: "unknown_error"
- An error that did not fit any of the types above.

Note that invalid_message, protocol_error and unknown_error can be triggered as a response to any type of messages.

After any error, the transcription is terminated and the connection is closed.

WebSockets Errors

In the Real-time SaaS, an in-band error message can be followed by a WebSocket close message. The table below shows the possible WebSocket close codes and associated error types. The error types are provided in the payload of the close message.

WebSocket Close Code	WebSocket Close Payload
1003	`protocol_error`
1011	`internal_error`
4001	`not_authorised`
4003	`not_allowed`
4004	`invalid_model`
4005	`quota_exceeded`
4006	`timelimit_exceeded`
4013	`job_error`

There maybe other WebSockets close codes not mentioned here, such as code 1006 for abnormal_close. This may happen when the underlying TCP connection is unexpectedly terminated, for instance, with network issues.

A full list of WebSockets error codes can be downloaded here.

Warning

Warning messages have the following fields:

message: "Warning"
code (Int): Optional. A numerical code for the warning.
type (String): Required. A code for the warning message. See the list of possible warnings below.
reason (String): Required. A human-readable reason for the warning message.

Warning Types

type: "duration_limit_exceeded"
- The maximum allowed duration of a single utterance to process has been exceeded. Any AddAudio messages received that exceed this limit are confirmed with AudioAdded, but are ignored by the transcription engine. Exceeding the limit triggers the same mechanism as receiving an EndOfStream message, so the Server will eventually send an EndOfTranscript message and suspend.
- It has the following extra field:
  - duration_limit (Number): The limit that was exceeded (in seconds).
type: "unsupported_translation_pair"
- One of the requested translation target languages is unsupported (given the source audio language).
- The error message specifies the unsupported language pair.

Info

Info messages denote additional information sent form the Server to the Client. Those are similar to Error and Warning messages in syntax, but don't actually denote any problem. The Client can safely ignore these messages or use them for additional client-side logging.

message: "Info"
code (Int): Optional. A numerical code for the informational message.
type (String): Required. A code for the info message. See the list of possible info messages below.
reason (String): Required. A human-readable reason for the informational message.

Info Message Types

type: "recognition_quality"
- Informs the client what particular quality-based model is used to handle the recognition.
- It has the following extra field:
  - quality (String): Quality-based model name. It is one of "telephony", "broadcast". The model is selected automatically, for high-quality audio (12kHz+) the broadcast model is used, for lower quality audio the telephony model is used.
type: "model_redirect"
- Informs the client that a deprecated language code has been specified, and will be handled with a different model. For example, if the model parameter is set to one of en-US, en-GB, or en-AU, then the request may be internally redirected to the Global English model (en).
type: "deprecated"
- Informs about using a feature that is going to be removed in a future release.

Configuration Settings

JSON config objects that are sent by the client to the server as part of the StartRecognition or SetRecognitionConfig messages.

Transcription Config

A TranscriptionConfig object specifies various configuration values for the transcription. All fields except language are optional, using default values when omitted.

language (String): Required. Language model to process the audio input, normally specified as an ISO language code, e.g., 'en'. The value must be consistent with the language code used in the API endpoint URL.
additional_vocab (List:AdditionalWord): Optional. Configure custom dictionary. Default is an empty list. You should be aware that there is a performance penalty (latency degradation and memory increase) from using additional_vocab, especially if you use a large word list. When initializing a session that uses additional_vocab in the config, you should expect a delay of up to 15 seconds (depending on the size of the list).
diarization (String): Optional. The Speaker Diarization method to apply to the audio. The default is "none" indicating that no diarization will be performed.
speaker_diarization_config (Object): Optional. Allows you to prevent too many speakers from being detected by using the max_speakers setting. The default value for max_speakers is 50. The minimum and maximum values are 2 and 100 inclusive. See Max Speakers.
enable_partials (Boolean): Optional. Whether or not to send partials (i.e. AddPartialTranscript messages) as well as finals (i.e. AddTranscript messages). The default is false.
max_delay (Number): Optional. Maximum delay in seconds between receiving input audio and returning final transcription results. The default is 10. The minimum and maximum values are 2 and 20. See Real-Time Latency for more details.
max_delay_mode (String): Optional. Allowed values are fixed and flexible. The default is flexible. Where an entity is detected when using flexible, the latency of a transcript may exceed the threshold specified in max_delay to allow recognition of entities to be more accurate. When using fixed, the transcript will be returned in segments that will never exceed the max_delay threshold even if this results in inaccuracies in entity recognition.
output_locale (String): Optional. Configure output locale. See Output Locale. Default is an empty string.
punctuation_overrides (Object): Optional. Options for controlling punctuation in the output transcripts. See Punctuation Settings.
operating_point (String): Optional. Which model within the language pack you wish to use for transcription with a choice between standard and enhanced. See Requesting an Enhanced Model for more details.
enable_entities (Boolean): Optional. Whether a user wishes for entities to be identified with additional spoken and written word format. Supported values true or false. The default is false.

Translation Config

A TranslationConfig object specifies various configuration values for translation. All fields except target_languages are optional, using default values when omitted.

target_languages (Array): Required. List of languages to translate to from the source transcription language. Specified as an ISO Language Code
enable_partials (Boolean): Optional. Whether or not to send Partials (i.e. AddPartialTranslation messages) as well as Finals (i.e. AddTranslation messages). The default is false.

WebSocket Call Flow

A basic WebSocket call flow looks like this:

The WebSocket client initiates a secure connection to the server with an HTTPS GET request (which is called an Upgrade request in WebSocket terminology). The Speechmatics Real-Time SaaS expects the API Key in an Authorization header to authenticate the client.
The client receives a 400 Bad Request or 405 Method Not Allowed HTTP response if the handshake request is malformed.
If the handshake is successful, the server sends a 101 Switching Protocols response to the client and upgrades the connection to the WebSocket protocol.
After a successful handshake, the server attempts to authenticate and connect to the transcription backend. The server could send one of the following errors (as an in-band message followed by a WebSocket close handshake):
- 4001 not_authorised - authentication failed, usually because the API Key is invalid
- 4003 not_allowed - forbidden, typically because the API Key is expired
- 4004 invalid_model - an unsupported language was requested
- 4005 quota_exceeded - the maximum number of concurrent connections allowed for the contract has been reached
- 4006 timelimit_exceeded - the usage quota for the contract has been reached
- 4013 job_error or 1011 internal_error - the service unexpectedly fails to start transcription
Once the WebSocket connection is established, the client starts sending the sequence of audio frames to the server. The server starts transcribing the audio and sends the transcription asynchronously to the client.
When the client does not want to send any more audio, it sends an EndOfStream message to the server.
The client waits for the final EndOfTranscript message from the server before disconnecting.

Example Communication

The communication consists of 3 stages - initialization, transcription and a disconnect handshake.

On initialization, the StartRecognition message is sent from the Client to the API and the Client must block and wait until it receives a RecognitionStarted message.

Afterwards, the transcription stage happens. The client keeps sending AddAudio messages. The API asynchronously replies with AudioAdded messages. The API also asynchronously sends AddPartialTranscript (if Partials enabled) and/or AddTranscript messages, depending on whether Partials were enabled.

Once the Client doesn't want to send any more audio, the disconnect handshake is performed. The Client sends an EndOfStream message as its last message. No more messages are handled by the API afterwards. The API processes whatever audio it has buffered at that point and sends all the AddTranscript and AddPartialTranscript messages accordingly. Once the API processes all the buffered audio, it sends an EndOfTranscript message and the Client can then safely disconnect.

Note: In the example below, -> denotes a message sent by the Client to the API, <- denotes a message send by the API to the Client. Any comments are denoted "[like this]".

-> {"message": "StartRecognition", "audio_format": {"type": "file"},
    "transcription_config": {"language": "en", "enable_partials": true}}

 <- {"message": "RecognitionStarted", "id": "807670e9-14af-4fa2-9e8f-5d525c22156e"}

->  "[binary message - AddAudio 1]"
->  "[binary message - AddAudio 2]"

 <- {"message": "AudioAdded", "seq_no": 1}
 <- {"message": "Info", "type": "recognition_quality", "quality": "broadcast", "reason": "Running recognition using a broadcast model quality."}
 <- {"message": "AudioAdded", "seq_no": 2}

->  "[binary message - AddAudio 3]"

 <- {"message": "AudioAdded", "seq_no": 3}

"[asynchronously received transcripts:]"

 <- {"message": "AddPartialTranscript", "metadata": {"start_time": 0.0, "end_time": 0.5399999618530273, "transcript": "One"},
     "results": [{"alternatives": [{"confidence": 0.0, "content": "One"}],
                  "start_time": 0.47999998927116394, "end_time": 0.5399999618530273, "type": "word"}
                ]}
 <- {"message": "AddPartialTranscript", "metadata": {"start_time": 0.0, "end_time": 0.7498992613545260, "transcript": "One to"},
     "results": [{"alternatives": [{"confidence": 0.0, "content": "One"}],
                  "start_time": 0.47999998927116394, "end_time": 0.5399999618530273, "type": "word"},
                 {"alternatives": [{"confidence": 0.0, "content": "to"}],
                  "start_time": 0.6091238623430891, "end_time": 0.7498992613545260, "type": "word"}
                ]}
 <- {"message": "AddPartialTranscript", "metadata": {"start_time": 0.0, "end_time": 0.9488123643240011, "transcript": "One to three"},
     "results": [{"alternatives": [{"confidence": 0.0, "content": "One"}],
                  "start_time": 0.47999998927116394, "end_time": 0.5399999618530273, "type": "word"},
                 {"alternatives": [{"confidence": 0.0, "content": "to"}],
                  "start_time": 0.6091238623430891, "end_time": 0.7498992613545260, "type": "word"}
                 {"alternatives": [{"confidence": 0.0, "content": "three"}],
                  "start_time": 0.8022338627780892, "end_time": 0.9488123643240011, "type": "word"}
                ]}
 <- {"message": "AddTranscript", "metadata": {"start_time": 0.0, "end_time": 0.9488123643240011, "transcript": "One two three."},
     "results": [{"alternatives": [{"confidence": 1.0, "content": "One"}],
                  "start_time": 0.47999998927116394, "end_time": 0.5399999618530273, "type": "word"},
                 {"alternatives": [{"confidence": 1.0, "content": "to"}],
                  "start_time": 0.6091238623430891, "end_time": 0.7498992613545260, "type": "word"}
                 {"alternatives": [{"confidence": 0.96, "content": "three"}],
                  "start_time": 0.8022338627780892, "end_time": 0.9488123643240011, "type": "word"}
                 {"alternatives": [{"confidence": 1.0, "content": "."}],
                  "start_time": 0.9488123643240011, "end_time": 0.9488123643240011, "type": "punctuation", "is_eos": true}
                ]}

"[closing handshake]"

->  {"message":"EndOfStream","last_seq_no":3}

 <- {"message": "EndOfTranscript"}

Real-Time API Reference

WebSocket Handshake​

Handshake Request​

Handshake Responses​

Temporary Tokens​

Supported Audio Types​

Message Handling​

StartRecognition​

RecognitionStarted​

AddAudio​

AudioAdded​

Implementation Details​

AddTranscript​

AddPartialTranscript​

AddTranslation​

AddPartialTranslation​

SetRecognitionConfig​

EndOfStream​

EndOfTranscript​

Error Handling​

Error Messages​

Error Types​

WebSockets Errors​

Warning​

Warning Types​

Info​

Info Message Types​

Configuration Settings​

Transcription Config​

Translation Config​

WebSocket Call Flow​

Example Communication​