Diarization
Transcription:BatchReal-TimeDeployments:AllSpeechmatics offers two different modes for separating out different speakers in the audio:
Type | Description | Use Case |
---|---|---|
Speaker Diarization | Each speaker will be identified by their voice. | Used in cases where there are multiple speakers in the same audio recording. |
Channel Diarization | Each audio channel will be transcribed separately. Available for batch transcription only. | Used when it's possible to record each speaker on a separate audio channel. |
By default, the transcript will not be diarized. For details on configuring Diarization, please see the relevant page linked below.
Speaker Diarization
Transcription:BatchReal-TimeDeployments:AllOverview
Speaker Diarization aggregates all audio channels into a single stream for processing, and picks out different speakers based on acoustic matching.
The feature is disabled by default. To enable Speaker Diarization, diarization
must be set to speaker
in the transcription config:
{
"type": "transcription",
"transcription_config": {
"language": "en",
"diarization": "speaker"
}
}
When enabled, every word
and punctuation
object in the output results will be a given "speaker" property which is a label
indicating who said that word. There are two kinds of labels you will see:
S#
- S stands for speaker and the # will be an incrementing integer identifying an individual speaker. S1 will appear first in the results, followed by S2, S3, etc.UU
- Diarization is disabled or individual speakers cannot be identified.UU
can appear for example if some background noise is transcribed as speech, but the diarization system does not recognise it as a speaker.
Considerations
- Enabling diarization for a file increases the amount of time taken to transcribe an audio file. In general, we expect the use of Diarization to increase the overall processing time by 10-50%.
- When transcribing in Real-Time, Partial transcripts will not include speaker information.
The example below shows relevant parts of a transcript with 2 speakers:
"results": [
{
"alternatives": [
{
"confidence": 0.93,
"content": "hello",
"language": "en",
"speaker": "S1"
}
],
"end_time": 0.51,
"start_time": 0.36,
"type": "word"
},
{
"alternatives": [
{
"confidence": 1.0,
"content": "hi",
"language": "en",
"speaker": "S2"
}
],
"end_time": 12.6,
"start_time": 12.27,
"type": "word"
}]
Speaker Sensitivity
Transcription:BatchDeployments:AllFor batch transcription, you can configure the sensitivity of speaker detection by using
the speaker_sensitivity
setting in the speaker_diarization_config
section of
the job config object as shown below:
{
"type": "transcription",
"transcription_config": {
"language": "en",
"diarization": "speaker",
"speaker_diarization_config": {
"speaker_sensitivity": 0.6
}
}
}
This takes a value between 0 and 1 (the default is 0.5). A higher sensitivity will increase the likelihood of more unique speakers returning.
For Real-Time Rranscription, you can configure the max speakers.
Max Speakers
Transcription:Real-TimeDeployments:AllFor Real-Time Transcription, you can prevent too many speakers from being detected by using the max_speakers
setting in the StartRecognition
message as shown below:
{
"message": "StartRecognition",
"audio_format": {
"type": "raw",
"encoding": "pcm_f32le",
"sample_rate": 48000
},
"transcription_config": {
"language": "en",
"operating_point": "enhanced",
"diarization": "speaker",
"speaker_diarization_config": {
"max_speakers": 10
}
}
}
The default value is 50, but it can take any integer value between 2 and 100 inclusive.
For batch transcription, you can configure the Speaker Sensitivity.
Speaker Diarization and Punctuation
To enhance the accuracy of our Speaker Diarization, we make small corrections to the speaker labels based on the punctuation in the transcript. For example, if our system originally thought that 9 words in a sentence were spoken by speaker S1, and only 1 word by speaker S2, we will correct the incongruous S2 label to be S1. This only works if punctuation is enabled in the transcript.
If you disable punctuation by removing end of sentence punctuation through permitted_marks
in the punctuation_overrides
section then diarization will not work correctly.
Changing the punctuation sensitivity will also affect the accuracy of Speaker Diarization.
Speaker Diarization Timeout
Speaker Diarization will time out if it takes too long to run for a particular audio file. Currently, the timeout is set to 5 minutes or 0.5 * the audio duration, whichever is longer. For example, with a 2 hour audio file, the timeout is 1 hour. If a timeout happens, the transcript will still be returned and all speaker labels in the output will be labelled as UU.
Speaker Change (Legacy)
The legacy Speaker Change Detection feature was removed on 1st July 2024. From this point on, both the speaker_change
and channel_and_speaker_change
parameters are not supported. Our existing Speaker Diarization feature provides superior accuracy for speaker change use cases, as well as additional speaker labelling functionality. Existing users should reach out to Support for API-related questions.
Channel Diarization
Transcription:BatchDeployments:AllChannel Diarization enables each channel in multi-channel audio to be transcribed separately and collated into a single transcript. This provides perfect diarization at the channel level as well as better handling of cross-talk between channels. Using Channel Diarization, files with up to 100 separate input channels are supported.
This is particularly useful for the Contact Centre use case, where audio is often recorded in stereo with separate channels for the agent and the caller.
In order to use this feature you set the diarization
property to channel
. You optionally name these channels by using the channel_diarization_labels
in the configuration:
{
"type": "transcription",
"transcription_config": {
"language": "en",
"diarization": "channel",
"channel_diarization_labels": ["Agent", "Caller"]
}
}
If you do not specify any labels then defaults will be used (e.g. Channel 1). The number of labels you use should be the same as the number of channels in your audio. Additional labels are ignored. When the transcript is returned a channel
property for each word will indicate the speaker, for example:
"results": [
{
"type": "word",
"end_time": 1.8,
"start_time": 1.45,
"channel": "Agent",
"alternatives": [
{
"language": "en",
"content": "Hello",
"confidence": 0.76
}
]
}
]