Skip to main content

Real-Time Latency

Transcription:Real-TimeDeployments:All

When transcribing in real-time, you can control the maximum time to wait for the final transcript using the max_delay and max_delay_mode transcription config options. You can also use enable_partials to receive Partial transcripts in just a few hundred milliseconds.

{
  "type": "transcription",
  "transcription_config": {
    "language": "en",
    "max_delay": 2.0,
    "max_delay_mode": "fixed",
    "enable_partials": true
  }
}

The max_delay parameter controls the maximum latency of Finals in the real-time transcription engine. Finals latency is the delay in seconds between receiving input audio and returning Final transcription results. The default value of max_delay is 10. The minimum and maximum values are 0.7 and 20. Note that max_delay has no impact on how Partials are returned.

Max Delay Mode

Using a fixed value of max_delay can increase the potential for inaccuracies in the transcript, especially around entities such as numerals, currencies, and dates.

Flexible max_delay_mode allows greater flexibility in the maximum latency only when a potential entity has been detected. Entities are common concepts such as numbers, currencies and dates, and are discussed in more detail here.

There are two options for max_delay_mode: fixed and flexible. The default is flexible.

  • flexible improves accuracy in entity recognition by allowing the latency to exceed the max_delay threshold when a potential entity is detected
  • fixed ensures that processing of final transcripts is constrained by the max_delay threshold, even if this results in less accurate transcription of entities

Partial Transcripts

Partial transcripts are enabled using the enable_partials config option. Partials allow users to receive transcription output before higher-accuracy Finals are returned. Typically Partials are returned in 500-800 milliseconds.

When Partial transcripts are enabled, Final transcripts are still returned. Partials are updated as more audio is received and further context is understood. This improves the accuracy until a Final transcript is generated for that section of audio. Once a Final is received, the partials are reset to empty.

Note that Partial transcripts have some limitations:

  • Accuracy is usually 10-25% lower than the Final transcript. This includes lower accuracy of punctuation and capitalisation of words.
  • Numeral Formatting is not returned in Partial transcripts
  • Diarization is not returned in Partial transcripts
  • The confidence field for Partial transcripts has no meaning and should not be relied on.