Realtime diarization
Learn how to use the Speechmatics API to separate speakers in real-timeTo learn more about diarization as a feature, check out the diarization page.
Overview
Real-time diarization offers the following ways to separate speakers in audio:
-
Speaker diarization — Identifies each speaker by their voice.
Useful when there are multiple speakers in the same audio stream. -
Channel diarization — Transcribes each audio channel separately.
Useful when each speaker is recorded on their own channel. -
Channel & speaker diarization — Combines both methods.
Each channel is transcribed separately, with unique speakers identified within each channel.
Useful when multiple speakers are present across multiple channels.
Speaker diarization
Speaker diarization picks out different speakers from the audio stream based on acoustic matching.
To enable Speaker diarization, diarization
must be set to speaker
in the transcription config:
{
"type": "transcription",
"transcription_config": {
"language": "en",
"diarization": "speaker"
}
}
When diarization is enabled, each word
and punctuation
object in the transcript includes a speaker
property that identifies who spoke it. There are two types of labels:
S#
– S stands for speaker, and#
is a sequential number identifying each speaker. S1 appears first in the results, followed by S2, S3, and so on.UU
– Used when the speaker cannot be identified or diarization is not applied, for example, if background noise is transcribed as speech but no speaker can be determined.
"results": [
{
"alternatives": [
{
"confidence": 0.93,
"content": "hello",
"language": "en",
"speaker": "S1"
}
],
},
{
"alternatives": [
{
"confidence": 1.0,
"content": "hi",
"language": "en",
"speaker": "S2"
}
],
}]
Channel diarization
This feature is coming soon!
Subscribe to our release notes to be notified when it's available.
Channel diarization processes audio with multiple channels and returns a separate transcript for each one. This gives you perfect speaker separation at the channel level and more accurate handling of cross-talk.
To enable channel diarization, diarization
must be set to channel
and labels for each channel provided in channel_diarization_labels
in the transcription config of the StartRecognition
message:
{
"type": "transcription",
"transcription_config": {
"language": "en",
"diarization": "channel",
"channel_diarization_labels": ["New_York", "Shanghai", "Paris"]
}
}
You should see a channels
field in the RecognitionStarted
message which lists all the channels you requested:
{
"message": "RecognitionStarted",
...
"channels": ["New_York", "Shanghai", "Paris"]
}
Send audio to a channel
To send audio for a specific channel, you can use the AddChannelAudio
message. You'll need to encode the data in base64 format:
{
"message": "AddChannelAudio",
"channel": "New_York",
"data": <base_64_encoded_data>
}
You should get an acknowledgement in the form of a ChannelAudioAdded
message from the server, with a corresponding sequence number for the channel:
{
"message": "ChannelAudioAdded",
"channel": "New_York",
"seq_no": <10>
}
Transcript response
Transcripts are returned independently for each channel, with the channel
property identifying the channel.
{
"message": "AddTranscript",
"channel": "New_York",
...
"results": [
{
"type": "word",
"start_time": 1.45,
"end_time": 1.8,
"alternatives": [{
"language": "en",
"content": "Hello,",
"confidence": 0.98,
}]
},
]
}
Channel and speaker diarization
Channel and speaker diarization combines speaker diarization and shannel diarization, splitting transcripts per channel whilst also separating individual speakers in each channel.
To enable this mode, follow the steps in speaker diarization and set the diarization
mode to channel_and_speaker
.
To send audio to a channel, follow the instructions in send audio to a channel.
To send audio to a specific channel, follow the directions for sending audio to a channel.
Transcripts are returned in the same way as channel diarization, but with individual speakers identified:
{
"message": "AddTranscript",
"channel": "New_York",
"results": [
{
"alternatives": [{
"content": "Hello",
"confidence": 0.98,
"speaker": 'S1',
}]
},
...
{
"alternatives": [{
"content": "Hi",
"confidence": 0.98,
"speaker": 'S2',
}]
},
]
}
Limits
For SaaS customers, the maximum number of channels is 2.
For On-prem Container customers, the maximum number of channels depends on your Multi-session container's maximum number of connections.
Configuration
You can customize diarization to match your use case by adjusting settings for sensitivity, limiting the maximum number of speakers, preferring the current speaker to reduce false switches, and controlling how punctuation influences accuracy.
Speaker sensitivity
You can configure the sensitivity of speaker detection by using the speaker_sensitivity
setting in the speaker_diarization_config
section of the job config object as shown below:
{
"type": "transcription",
"transcription_config": {
"language": "en",
"diarization": "speaker",
"speaker_diarization_config": {
"speaker_sensitivity": 0.6
}
}
}
This takes a value between 0 and 1 (the default is 0.5). A higher sensitivity will increase the likelihood of more unique speakers returning.
Prefer Current Speaker
You can reduce the likelihood of incorrectly switching between similar sounding speakers by setting the prefer_current_speaker
flag in the speaker_diarization_config
:
{
"type": "transcription",
"transcription_config": {
"language": "en",
"diarization": "speaker",
"speaker_diarization_config": {
"prefer_current_speaker": true
}
}
}
By default this is false
. When this is set to true
, the system will stay with the speaker of the previous word, if they closely match the speaker of the new word.
This may result in some shorter speaker turn changes between similar speakers being missed.
Max. Speakers
You can prevent too many speakers from being detected by using the max_speakers
setting in the StartRecognition
message as shown below:
{
"message": "StartRecognition",
"audio_format": {
"type": "raw",
"encoding": "pcm_f32le",
"sample_rate": 48000
},
"transcription_config": {
"language": "en",
"operating_point": "enhanced",
"diarization": "speaker",
"speaker_diarization_config": {
"max_speakers": 10
}
}
}
The default value is 50, but it can take any integer value between 2 and 100 inclusive.
Punctuation
Speaker diarization uses punctuation to improve accuracy. Small corrections are applied to speaker labels based on sentence boundaries.
For example, if the system initially assigns 9 words in a sentence to S1 and 1 word to S2, the lone S2 word may be corrected to S1.
This adjustment only works when punctuation is enabled. Disabling punctuation via the permitted_marks
setting in punctuation_overrides
can reduce diarization accuracy.
Adjusting punctuation sensitivity can also affect how accurately speakers are identified.
Speaker diarization Timeout
Speaker diarization will time out if it takes too long to run for a particular audio file. Currently, the timeout is set to 5 minutes or 0.5 * the audio duration, whichever is longer. For example, with a 2 hour audio file, the timeout is 1 hour. If a timeout happens, the transcript will still be returned and all speaker labels in the output will be labelled as UU.
Speaker change (legacy)
The Speaker Change Detection feature was removed in July 2024. The speaker_change
and channel_and_speaker_change
parameters are no longer supported. Use the Speaker diarization feature for speaker labeling.
For API-related questions, contact support.
On-prem
To run channel
or channel_and_speaker
diarization with an on-prem deployment, configure your environment as follows:
- Use a GPU Speech-to-Text container. Handling multiple audio streams is computationally intensive and benefits from GPU acceleration.
- Set the
SM_MAX_CONCURRENT_CONNECTIONS
environment variable to match the number of channels you want to process.
For more details on container setup, see the on-prem deployment docs.