Speech to TextFeatures

Diarization

Learn how to use the Speechmatics diarization offering

Speechmatics offers two different modes for separating out different speakers in the audio:

Speaker diarization: Each speaker will be identified by their voice.
Used in cases where there are multiple speakers in the same audio recording.
Channel diarization: Each audio channel will be transcribed separately.
Used when it's possible to record each speaker on a separate audio channel.

By default, the transcript will not be diarized. For details on configuring Diarization, please see the relevant page linked below.

Speaker diarization

Overview

Speaker diarization aggregates all audio channels into a single stream for processing, and picks out different speakers based on acoustic matching.

The feature is disabled by default. To enable Speaker diarization, diarization must be set to speaker in the transcription config:

{
  "type": "transcription",
  "transcription_config": {
    "language": "en",
    "diarization": "speaker"
  }
}

When enabled, every word and punctuation object in the output results will be a given "speaker" property which is a label indicating who said that word. There are two kinds of labels you will see:

S# - S stands for speaker and the # will be an incrementing integer identifying an individual speaker. S1 will appear first in the results, followed by S2, S3, etc.
UU - Diarization is disabled or individual speakers cannot be identified (which only applies when running batch mode on CPU operating points). UU can appear for example if some background noise is transcribed as speech, but the diarization system does not recognize it as a speaker.

Considerations

Enabling diarization for a file increases the amount of time taken to transcribe an audio file. In general, we expect the use of Diarization to increase the overall processing time by 10-50%.

You can configure the sensitivity of speaker detection by using the speaker_sensitivity setting in the speaker_diarization_config section of the job config object as shown below:

  "results": [
    {
      "alternatives": [
        {
          "confidence": 0.93,
          "content": "hello",
          "language": "en",
          "speaker": "S1"
        }
      ],
      "end_time": 0.51,
      "start_time": 0.36,
      "type": "word"
    },
    {
      "alternatives": [
        {
          "confidence": 1.0,
          "content": "hi",
          "language": "en",
          "speaker": "S2"
        }
      ],
      "end_time": 12.6,
      "start_time": 12.27,
      "type": "word"
    }]

Speaker sensitivity

This feature is only available for Batch Transcription.

For batch transcription, you can configure the sensitivity of speaker detection by using the speaker_sensitivity setting in the speaker_diarization_config section of the job config object as shown below:

{
  "type": "transcription",
  "transcription_config": {
    "language": "en",
    "diarization": "speaker",
    "speaker_diarization_config": {
      "speaker_sensitivity": 0.6
    }
  }
}

This takes a value between 0 and 1 (the default is 0.5). A higher sensitivity will increase the likelihood of more unique speakers returning.

For Real-Time Transcription, you can configure the maximum number of speakers.

Prefer Current Speaker

You can reduce the likelihood of incorrectly switching between similar sounding speakers by setting the prefer_current_speaker flag in the speaker_diarization_config:

{
  "type": "transcription",
  "transcription_config": {
    "language": "en",
    "diarization": "speaker",
    "speaker_diarization_config": {
      "prefer_current_speaker": true
    }
  }
}

By default this flag is false. When this flag is set to true, the system will stay with the speaker of the previous word, if they closely match the speaker of the new word.

This may result in some shorter speaker turn changes between similar speakers being missed.

Max. Speakers

This feature is only available for Real-Time Transcription.

For Real-Time Transcription, you can prevent too many speakers from being detected by using the max_speakers setting in the StartRecognition message as shown below:

{
  "message": "StartRecognition",
  "audio_format": {
    "type": "raw",
    "encoding": "pcm_f32le",
    "sample_rate": 48000
  },
  "transcription_config": {
    "language": "en",
    "operating_point": "enhanced",
    "diarization": "speaker",
    "speaker_diarization_config": {
      "max_speakers": 10
    }
  }
}

The default value is 50, but it can take any integer value between 2 and 100 inclusive.

For batch transcription, you can configure the Speaker Sensitivity.

Speaker diarization and Punctuation

To enhance the accuracy of our Speaker diarization, we make small corrections to the speaker labels based on the punctuation in the transcript. For example, if our system originally thought that 9 words in a sentence were spoken by speaker S1, and only 1 word by speaker S2, we will correct the incongruous S2 label to be S1. This only works if punctuation is enabled in the transcript.

If you disable punctuation by removing end of sentence punctuation through permitted_marks in the punctuation_overrides section then diarization will not work correctly.

Changing the punctuation sensitivity will also affect the accuracy of Speaker diarization.

Speaker diarization Timeout

Speaker diarization will time out if it takes too long to run for a particular audio file. Currently, the timeout is set to 5 minutes or 0.5 * the audio duration, whichever is longer. For example, with a 2 hour audio file, the timeout is 1 hour. If a timeout happens, the transcript will still be returned and all speaker labels in the output will be labelled as UU.

Speaker Change (Legacy)

The legacy Speaker Change Detection feature was removed on 1st July 2024. From this point on, both the speaker_change and channel_and_speaker_change parameters are not supported. Our existing Speaker diarization feature provides superior accuracy for speaker change use cases, as well as additional speaker labelling functionality. Existing users should reach out to Support for API-related questions.

Channel diarization

Overview

Channel diarization enables each channel in multi-channel audio to be transcribed separately and collated into a single transcript. This provides perfect diarization at the channel level as well as better handling of cross-talk between channels. Using Channel diarization, files with up to 100 separate input channels are supported.

This is particularly useful for the Contact Centre use case, where audio is often recorded in stereo with separate channels for the agent and the caller.

In order to use this feature you set the diarization property to channel. You optionally name these channels by using the channel_diarization_labels in the configuration:

{
  "type": "transcription",
  "transcription_config": {
    "language": "en",
    "diarization": "channel",
    "channel_diarization_labels": ["Agent", "Caller"]
  }
}

If you do not specify any labels then defaults will be used (e.g. Channel 1). The number of labels you use should be the same as the number of channels in your audio. Additional labels are ignored. When the transcript is returned a channel property for each word will indicate the speaker, for example:

"results": [
  {
    "type": "word",
    "end_time": 1.8,
    "start_time": 1.45,
    "channel": "Agent",
    "alternatives": [
      {
        "language": "en",
        "content": "Hello",
        "confidence": 0.76
      }
    ]
  }
]

Speaker diarization​

Overview​

Speaker sensitivity​

Prefer Current Speaker​

Max. Speakers​

Speaker diarization and Punctuation​

Speaker diarization Timeout​

Speaker Change (Legacy)​

Channel diarization​

Overview​

Speaker diarization

Overview

Speaker sensitivity

Prefer Current Speaker

Max. Speakers

Speaker diarization and Punctuation

Speaker diarization Timeout

Speaker Change (Legacy)

Channel diarization

Overview