Skip to main content

Diarization

Transcription:BatchReal-TimeDeployments:All


Speechmatics offers two different modes for separating out different speakers in the audio:

TypeDescriptionUse Case
Speaker DiarizationEach speaker will be identified by their voice.Used in cases where there are multiple speakers in the same audio recording.
Channel DiarizationEach audio channel will be transcribed separately. Available for batch transcription only.Used when it's possible to record each speaker on a separate audio channel.

By default, the transcript will not be diarized. For details on configuring Diarization, please see the relevant page linked below.

Speaker Diarization

Transcription:BatchReal-TimeDeployments:All

Overview

Speaker Diarization aggregates all audio channels into a single stream for processing, and picks out different speakers based on acoustic matching.

The feature is disabled by default. To enable Speaker Diarization, diarization must be set to speaker in the transcription config:

{
  "type": "transcription",
  "transcription_config": {
    "language": "en",
    "diarization": "speaker"
  }
}

When enabled, every word and punctuation object in the output results will be a given "speaker" property which is a label indicating who said that word. There are two kinds of labels you will see:

  • S# - S stands for speaker and the # will be an incrementing integer identifying an individual speaker. S1 will appear first in the results, followed by S2, S3, etc.
  • UU - Diarization is disabled or individual speakers cannot be identified. UU can appear for example if some background noise is transcribed as speech, but the diarization system does not recognise it as a speaker.

Considerations

  • Enabling diarization for a file increases the amount of time taken to transcribe an audio file. In general, we expect the use of Diarization to increase the overall processing time by 10-50%.
  • When transcribing in Real-Time, Partial transcripts will not include speaker information.

The example below shows relevant parts of a transcript with 2 speakers:

  "results": [
    {
      "alternatives": [
        {
          "confidence": 0.93,
          "content": "hello",
          "language": "en",
          "speaker": "S1"
        }
      ],
      "end_time": 0.51,
      "start_time": 0.36,
      "type": "word"
    },
    {
      "alternatives": [
        {
          "confidence": 1.0,
          "content": "hi",
          "language": "en",
          "speaker": "S2"
        }
      ],
      "end_time": 12.6,
      "start_time": 12.27,
      "type": "word"
    }]

Speaker Sensitivity

Transcription:BatchDeployments:All

For batch transcription, you can configure the sensitivity of speaker detection by using the speaker_sensitivity setting in the speaker_diarization_config section of the job config object as shown below:

{
  "type": "transcription",
  "transcription_config": {
    "language": "en",
    "diarization": "speaker",
    "speaker_diarization_config": {
      "speaker_sensitivity": 0.6
    }
  }
}

This takes a value between 0 and 1 (the default is 0.5). A higher sensitivity will increase the likelihood of more unique speakers returning.

For Real-Time Rranscription, you can configure the max speakers.

Max Speakers

Transcription:Real-TimeDeployments:All

For Real-Time Transcription, you can prevent too many speakers from being detected by using the max_speakers setting in the StartRecognition message as shown below:

{
  "message": "StartRecognition",
  "audio_format": {
    "type": "raw",
    "encoding": "pcm_f32le",
    "sample_rate": 48000
  },
  "transcription_config": {
    "language": "en",
    "operating_point": "enhanced",
    "diarization": "speaker",
    "speaker_diarization_config": {
      "max_speakers": 10
    }
  }
}

The default value is 50, but it can take any integer value between 2 and 100 inclusive.

For batch transcription, you can configure the Speaker Sensitivity.

Speaker Diarization and Punctuation

To enhance the accuracy of our Speaker Diarization, we make small corrections to the speaker labels based on the punctuation in the transcript. For example, if our system originally thought that 9 words in a sentence were spoken by speaker S1, and only 1 word by speaker S2, we will correct the incongruous S2 label to be S1. This only works if punctuation is enabled in the transcript.

If you disable punctuation by removing end of sentence punctuation through permitted_marks in the punctuation_overrides section then diarization will not work correctly.

Changing the punctuation sensitivity will also affect the accuracy of Speaker Diarization.

Speaker Diarization Timeout

Speaker Diarization will time out if it takes too long to run for a particular audio file. Currently, the timeout is set to 5 minutes or 0.5 * the audio duration, whichever is longer. For example, with a 2 hour audio file, the timeout is 1 hour. If a timeout happens, the transcript will still be returned and all speaker labels in the output will be labelled as UU.

Speaker Change (Legacy)

The legacy Speaker Change Detection feature is now deprecated and will be removed in all releases from 1st July 2024 onwards. From this point on, both the speaker_change and channel_and_speaker_change parameters will be unsupported. Our existing Speaker Diarization feature provides superior accuracy for speaker change use cases, as well as additional speaker labelling functionality. Existing users should reach out to Support for API-related questions.

Channel Diarization

Transcription:BatchDeployments:All

Channel Diarization enables each channel in multi-channel audio to be transcribed separately and collated into a single transcript. This provides perfect diarization at the channel level as well as better handling of cross-talk between channels. Using Channel Diarization, files with up to 100 separate input channels are supported.

This is particularly useful for the Contact Centre use case, where audio is often recorded in stereo with separate channels for the agent and the caller.

In order to use this feature you set the diarization property to channel. You optionally name these channels by using the channel_diarization_labels in the configuration:

{
  "type": "transcription",
  "transcription_config": {
    "language": "en",
    "diarization": "channel",
    "channel_diarization_labels": ["Agent", "Caller"]
  }
}

If you do not specify any labels then defaults will be used (e.g. Channel 1). The number of labels you use should be the same as the number of channels in your audio. Additional labels are ignored. When the transcript is returned a channel property for each word will indicate the speaker, for example:

"results": [
  {
    "type": "word",
    "end_time": 1.8,
    "start_time": 1.45,
    "channel": "Agent",
    "alternatives": [
      {
        "language": "en",
        "content": "Hello",
        "confidence": 0.76
      }
    ]
  }
]