Speech to TextRealtime Transcription

Realtime diarization

Learn how to use the Speechmatics API to separate speakers in real-time

To learn more about diarization as a feature, check out the diarization page.

Overview

Real-time diarization offers the following ways to separate speakers in audio:

Speaker diarization — Identifies each speaker by their voice.
Useful when there are multiple speakers in the same audio stream.
Channel diarization — Transcribes each audio channel separately.
Useful when each speaker is recorded on their own channel.
Channel & speaker diarization — Combines both methods.
Each channel is transcribed separately, with unique speakers identified within each channel.
Useful when multiple speakers are present across multiple channels.

Speaker diarization

Speaker diarization picks out different speakers from the audio stream based on acoustic matching.

To enable Speaker diarization, diarization must be set to speaker in the transcription config:

{
  "type": "transcription",
  "transcription_config": {
    "language": "en",
    "diarization": "speaker"
  }
}

When diarization is enabled, each word and punctuation object in the transcript includes a speaker property that identifies who spoke it. There are two types of labels:

S# – S stands for speaker, and # is a sequential number identifying each speaker. S1 appears first in the results, followed by S2, S3, and so on.
UU – Used when the speaker cannot be identified or diarization is not applied, for example, if background noise is transcribed as speech but no speaker can be determined.

  "results": [
    {
      "alternatives": [
        {
          "confidence": 0.93,
          "content": "hello",
          "language": "en",
          "speaker": "S1"
        }
      ],
    },
    {
      "alternatives": [
        {
          "confidence": 1.0,
          "content": "hi",
          "language": "en",
          "speaker": "S2"
        }
      ],
    }]

Channel diarization

Channel diarization processes audio with multiple channels and returns a separate transcript for each one. This gives you perfect speaker separation at the channel level and more accurate handling of cross-talk.

To enable channel diarization, diarization must be set to channel and labels for each channel provided in channel_diarization_labels in the transcription config of the StartRecognition message:

{
  "type": "transcription",
  "transcription_config": {
    "language": "en",
    "diarization": "channel",
    "channel_diarization_labels": ["New_York", "Shanghai"]
  }
}

You should see a channels field in the RecognitionStarted message which lists all the channels you requested:

{
  "message": "RecognitionStarted",
  ...
  "channels": ["New_York", "Shanghai"]
}

Send audio to a channel

To send audio for a specific channel, you can use the AddChannelAudio message. You'll need to encode the data in base64 format:

{
  "message": "AddChannelAudio",
  "channel": "New_York",
  "data": <base_64_encoded_data>
}

You should get an acknowledgement in the form of a ChannelAudioAdded message from the server, with a corresponding sequence number for the channel:

{
  "message": "ChannelAudioAdded",
  "channel": "New_York",
  "seq_no": <10>
}

Transcript response

Transcripts are returned independently for each channel, with the channel property identifying the channel.

{
  "message": "AddTranscript",
  "channel": "New_York",
  ...
  "results": [
    {
      "type": "word",
      "start_time": 1.45,
      "end_time": 1.8,
      "alternatives": [{
        "language": "en",
        "content": "Hello,",
        "confidence": 0.98,
      }]
    },
  ]
}

The channel property will be returned for AddTranscript and AddPartialTranscript messages only. The translation feature does not currently include this property. To request this feature, please contact support.

Channel and speaker diarization

Channel and speaker diarization combines speaker diarization and channel diarization, splitting transcripts per channel whilst also separating individual speakers in each channel.

To enable this mode, follow the steps in speaker diarization and set the diarization mode to channel_and_speaker.

To send audio to a channel, follow the instructions in send audio to a channel.

Transcripts are returned in the same way as channel diarization, but with individual speakers identified:

{
  "message": "AddTranscript",
  "channel": "New_York",
  "results": [
    {
      "alternatives": [{
        "content": "Hello",
        "confidence": 0.98,
        "speaker": 'S1',
      }]
    },
    ...
    {
      "alternatives": [{
        "content": "Hi",
        "confidence": 0.98,
        "speaker": 'S2',
      }]
    },
  ]
}

When using channel_and_speaker diarization, speaker labelling is specific to channels even if the speaker labels are the same. S1 on channel 1 is not necessarily the same as S1 on channel 2.

Closing Channels

When you're finished with a channel, you can signal that it's no longer in use by sending an EndOfChannel message.

{
  "message": "EndOfChannel",
  "channel": "New_York",
  "last_seq_no": 2564
}

Once this happens, the channel will stop accepting or processing any additional data. When all channels are closed, the session ends. This may be more convenient to use when you don't want to keep track of the number of open channels in a given stream.

You can also use EndOfStream to simultaneously close all channels. This may be more convenient to use when you know you want to close all channels at the same time.

At this time, closing an individual channel has no impact on pricing.

Limits

For SaaS customers, the maximum number of channels is 2.

For On-prem Container customers, the maximum number of channels depends on your Multi-session container's maximum number of connections.

The Speechmatics Python client CLI is currently limited to transcribing multi-channel audio in via files and not streaming/raw audio.

Configuration

You can customize diarization to match your use case by adjusting settings for sensitivity, limiting the maximum number of speakers, preferring the current speaker to reduce false switches, and controlling how punctuation influences accuracy.

Speaker sensitivity

You can configure the sensitivity of speaker detection by using the speaker_sensitivity setting in the speaker_diarization_config section of the job config object as shown below:

{
  "type": "transcription",
  "transcription_config": {
    "language": "en",
    "diarization": "speaker",
    "speaker_diarization_config": {
      "speaker_sensitivity": 0.6
    }
  }
}

This takes a value between 0 and 1 (the default is 0.5). A higher sensitivity will increase the likelihood of more unique speakers returning.

Prefer Current Speaker

You can reduce the likelihood of incorrectly switching between similar sounding speakers by setting the prefer_current_speaker flag in the speaker_diarization_config:

{
  "type": "transcription",
  "transcription_config": {
    "language": "en",
    "diarization": "speaker",
    "speaker_diarization_config": {
      "prefer_current_speaker": true
    }
  }
}

By default this is false. When this is set to true, the system will stay with the speaker of the previous word, if they closely match the speaker of the new word.

This may result in some shorter speaker turn changes between similar speakers being missed.

Max. Speakers

You can prevent too many speakers from being detected by using the max_speakers setting in the StartRecognition message as shown below:

{
  "message": "StartRecognition",
  "audio_format": {
    "type": "raw",
    "encoding": "pcm_f32le",
    "sample_rate": 48000
  },
  "transcription_config": {
    "language": "en",
    "operating_point": "enhanced",
    "diarization": "speaker",
    "speaker_diarization_config": {
      "max_speakers": 10
    }
  }
}

By default, there is no limit on the number of speakers. When set explicitly, max_speakers can be set to any integer greater than or equal to 2.

Punctuation

Speaker diarization uses punctuation to improve accuracy. Small corrections are applied to speaker labels based on sentence boundaries.

For example, if the system initially assigns 9 words in a sentence to S1 and 1 word to S2, the lone S2 word may be corrected to S1.

This adjustment only works when punctuation is enabled. Disabling punctuation via the permitted_marks setting in punctuation_overrides can reduce diarization accuracy.

Adjusting punctuation sensitivity can also affect how accurately speakers are identified.

Speaker change (legacy)

The Speaker Change Detection feature was removed in July 2024. The speaker_change and channel_and_speaker_change parameters are no longer supported. Use the Speaker diarization feature for speaker labeling.

For API-related questions, contact support.

On-prem

To run channel or channel_and_speaker diarization with an on-prem deployment, configure your environment as follows:

Use a GPU Speech-to-Text container. Handling multiple audio streams is computationally intensive and benefits from GPU acceleration.
Set the SM_MAX_CONCURRENT_CONNECTIONS environment variable to match the number of channels you want to process.

For more details on container setup, see the on-prem deployment docs.

Overview​

Speaker diarization​

Channel diarization​

Send audio to a channel​

Transcript response​

Channel and speaker diarization​

Closing Channels​

Limits​

Configuration​

Speaker sensitivity​

Prefer Current Speaker​

Max. Speakers​

Punctuation​

Speaker change (legacy)​

On-prem​