Speech to TextRealtime Transcription

Output

Learn about latency in the Speechmatics Real-Time server

The Realtime API returns transcription output, and other information through a variety of messages, see Received Messages.

Which messages are returned, and how frequently, depends on how the session is configured in the StartRecognition message.

Latency

When transcribing in real-time, you can control the maximum time to wait for the final transcript. This could be as fast as 0.7 seconds, though allowing a longer time will give a slight accuracy improvement.

For even faster output, use Partial transcripts to receive transcription output before higher-accuracy final transcripts are returned.

Configuration example

The following example shows a typical configuration for low latency applications. Include this in the StartRecognition message.

{
  "type": "transcription",
  "transcription_config": {
    "max_delay": 0.7,
    "max_delay_mode": "flexible",
    "enable_partials": true,
    "language": "en",
    "operating_point": "enhanced",
  }
}

max_delay (Number): Optional. Allowed between 0.7 and 4 seconds. Default is 4 seconds. This is the delay in seconds between the end of a spoken word and returning the Final transcript results. Note that there is a very small amount of additional latency while the server is sending the transcript to the client.
max_delay_mode (String): Optional. Allowed values are fixed and flexible. Default is flexible. This allows some additional time for Numeral Formatting.
enable_partials (Boolean): Default is false. Whether or not to receive Partial transcripts before the Final transcripts are received.

Accuracy/Latency trade-offs

We recommend experimenting with different settings for the max_delay to find the right trade-off between accuracy and latency for your application. Based on our own testing and experience, we can offer a few guidelines to get you started.

Setting max_delay to between 0.7 and 1.5 gives an accuracy degradation of less than 5% relative when compared to the Batch transcription service. This tradeoff is worthwhile for use cases that need ultra-fast responses such as real-time conversational AI.

At 2 seconds max_delay, there is around 1% relative accuracy degradation when compared to the Batch transcription service. This is the recommended setting for most use cases, such as broadcast captioning.

For the best accuracy, we recommend using a max_delay of 4 seconds which is equivalent to our Batch transcription service. This can be combined with Partial transcripts, to give users early feedback of the recognized text.

Partial transcripts

Partial transcripts allow you to receive preliminary transcription and update as more context is available until the higher-accuracy Finals are returned. Typically Partials are returned in less than 500 milliseconds. Partial transcripts are enabled using the enable_partials config option.

On each Final transcript you will immediately receive a Partial transcript with any remaining words which have not been finalized.

Note that Partial transcripts have some limitations:

Accuracy is usually 10-25% lower than the Final transcript. This includes punctuation and capitalization of words.
The confidence field for Partial transcripts has no meaning and should not be relied on.

Smart formatting

Smart Formatting ensures readability of your transcripts by formatting numbers, dates, currencies and other important entities into their written form.

When the max_delay_mode is set to flexible, and an entity is being spoken, the Final transcript would be delayed until the entity is fully spoken to enable proper formatting. This option should be used in most use-cases for improved accuracy and readability for numbers, currencies, and dates.

If you have strict latency requirements, and prefer not to wait for entity formatting to complete, set max_delay_mode to fixed. Note that in this mode, there will be some reduction in accuracy and readability for numbers, currencies, and dates.

Example outputs (partials and finals)

With only Finals and default max_delay_mode, messages received could look like the following:

(Final): I am 35.

Final output: I am 35.

With Partials enabled and default max_delay_mode, messages received could look like the following:

(Partial): I
(Partial): I am
(partial): I am third
(Partial): I am 30
(Final): I am 35.

Final output: I am 35.

With Partials enabled and max_delay_mode as fixed, messages received could look like the following:

(Partial): I
(Final): I am
(partial): third
(Final): 30
(Partial): five
(Final): five.

Final output: I am 30 five.

Latency​

Configuration example​

Accuracy/Latency trade-offs​

Partial transcripts​

Smart formatting​

Example outputs (partials and finals)​