Speech to TextRealtime transcription

Turn detection

Learn how to detect the end of speech

Use the end of utterance feature to help with turn detection in real-time conversational scenarios.

Use cases

Voice AI and conversational systems: Enable voice assistants and chatbots to detect when the user has finished speaking, allowing the system to respond promptly without awkward delays.

Realtime translation: Critical for live interpretation services where translations need to be delivered as soon as the speaker completes their thought, maintaining the flow of conversation.

Dictation and transcription: Helps dictation software determine when users have completed their input, improving speed of final transcription and user experience.

End of utterance

End of utterance detection is a feature that allows you to detect when a person has finished speaking. This is useful for voice AI, translation, and dictation use cases.

End of utterance uses server-side word timings to detect periods without speech. The moment a word is detected on the server, a countdown begins. If a new word is detected, the countdown restarts. If the configured interval passes without another word being detected, an end of utterance is triggered. When this happens, the server sends a final transcript message to the client, followed by an extra EndOfUtterance message.

Configuration

To enable end of utterance detection, include the following in the StartRecognition message:

{
  "type": "transcription",
  "transcription_config": {
    "conversation_config": {
        "end_of_utterance_silence_trigger": 0.5
    },
    "language": "en",
  }
}

end_of_utterance_silence_trigger (Number): Allowed between 0 and 2 seconds. Setting to 0 seconds disables detection. This is the number of seconds of non-speech (silence) to wait before an end of utterance is identified.

`EndOfUtterance`

On detecting an end of utterance, the server sends an EndOfUtterance message, structured as follows:

messagerequired

Constant value: EndOfUtterance

metadata objectrequired

start_timefloat

The time (in seconds) that the end of utterance was detected.

end_timefloat

The time (in seconds) that the end of utterance was detected.

channelstring

The channel identifier to which the EndOfUtterance message belongs. This field is only seen in multichannel.

We recommend 0.5-0.8 seconds for most voice AI applications. Longer values (0.8-1.2s) may be better for dictation applications.
Keep the end_of_utterance_silence_trigger lower than the max_delay value.
EndOfUtterance messages are only sent after some speech is recognised and duplicate EndOfUtterance messages will never be sent for the same period of silence.
The EndOfUtterance message is not related to any specific individual identified by diarization and will not contain speaker information.

Code examples

Realtime streaming from microphone - ideal for voice AI applications.

import speechmatics
import pyaudio
import threading
import time
import asyncio

API_KEY = "YOUR_API_KEY"
LANGUAGE = "en"
CONNECTION_URL = f"wss://eu.rt.speechmatics.com/v2"

# Audio recording parameters
SAMPLE_RATE = 16000
CHUNK_SIZE = 1024
FORMAT = pyaudio.paFloat32


class AudioProcessor:
    def __init__(self):
        self.wave_data = bytearray()
        self.read_offset = 0

    async def read(self, chunk_size):
        while self.read_offset + chunk_size > len(self.wave_data):
            await asyncio.sleep(0.001)

        new_offset = self.read_offset + chunk_size
        data = self.wave_data[self.read_offset : new_offset]
        self.read_offset = new_offset
        return data

    def write_audio(self, data):
        self.wave_data.extend(data)


class VoiceAITranscriber:
    def __init__(self):
        self.ws = speechmatics.client.WebsocketClient(
            speechmatics.models.ConnectionSettings(
                url=CONNECTION_URL,
                auth_token=API_KEY,
            )
        )
        self.audio = pyaudio.PyAudio()
        self.stream = None
        self.is_recording = False
        self.audio_processor = AudioProcessor()

        # Set up event handlers
        self.ws.add_event_handler(
            event_name=speechmatics.models.ServerMessageType.AddPartialTranscript,
            event_handler=self.handle_partial_transcript,
        )

        self.ws.add_event_handler(
            event_name=speechmatics.models.ServerMessageType.AddTranscript,
            event_handler=self.handle_final_transcript,
        )

        self.ws.add_event_handler(
            event_name=speechmatics.models.ServerMessageType.EndOfUtterance,
            event_handler=self.handle_end_of_utterance,
        )

    def handle_partial_transcript(self, msg):
        transcript = msg["metadata"]["transcript"]
        print(f"[Listening...] {transcript}")

    def handle_final_transcript(self, msg):
        transcript = msg["metadata"]["transcript"]
        print(f"[Complete] {transcript}")

    def handle_end_of_utterance(self, msg):
        print("🔚 End of utterance detected - ready for AI response!")
        # This is where your voice AI would process the complete utterance
        # and generate a response

    def stream_callback(self, in_data, frame_count, time_info, status):
        self.audio_processor.write_audio(in_data)
        return in_data, pyaudio.paContinue

    def start_streaming(self):
        try:
            # Set up pyaudio stream with callback
            self.stream = self.audio.open(
                format=FORMAT,
                channels=1,
                rate=SAMPLE_RATE,
                input=True,
                frames_per_buffer=CHUNK_SIZE,
                stream_callback=self.stream_callback,
            )

            # Configure audio settings
            settings = speechmatics.models.AudioSettings()
            settings.encoding = "pcm_f32le"
            settings.sample_rate = SAMPLE_RATE
            settings.chunk_size = CHUNK_SIZE

            # Configure transcription with end-of-utterance detection

            conversation_config = speechmatics.models.ConversationConfig(
                end_of_utterance_silence_trigger=0.75
            )  # Adjust as needed

            conf = speechmatics.models.TranscriptionConfig(
                operating_point="enhanced",
                language=LANGUAGE,
                enable_partials=True,
                max_delay=1,
                conversation_config=conversation_config,
            )

            print("🎤 Voice AI ready - start speaking!")
            print("Press Ctrl+C to stop...")

            # Start transcription using the working approach
            self.ws.run_synchronously(
                transcription_config=conf,
                stream=self.audio_processor,
                audio_settings=settings,
            )

        except KeyboardInterrupt:
            print("\n🛑 Stopping voice AI transcriber...")
        except Exception as e:
            print(f"Error in transcription: {e}")
        finally:
            self.stop_streaming()

    def stop_streaming(self):
        self.is_recording = False
        if self.stream:
            self.stream.stop_stream()
            self.stream.close()
        self.audio.terminate()


# Usage
if __name__ == "__main__":
    transcriber = VoiceAITranscriber()
    transcriber.start_streaming()

Copy in your API key and file name to get started.

import speechmatics

API_KEY = "YOUR_API_KEY"
PATH_TO_FILE = "example.wav"
LANGUAGE = "en"
CONNECTION_URL = "wss://eu.rt.speechmatics.com/v2"

# Create a transcription client
ws = speechmatics.client.WebsocketClient(
    speechmatics.models.ConnectionSettings(
        url=CONNECTION_URL,
        auth_token=API_KEY,
    )
)


# Define an event handler to print the partial transcript
def print_partial_transcript(msg):
    print(f"[partial] {msg['metadata']['transcript']}")


# Define an event handler to print the full transcript
def print_transcript(msg):
    print(f"[   FULL] {msg['metadata']['transcript']}")


# Define an event handler for the end-of-utterance event
def print_eou(msg):
    print("EndOfUtterance")


# Register the event handler for partial transcript
ws.add_event_handler(
    event_name=speechmatics.models.ServerMessageType.AddPartialTranscript,
    event_handler=print_partial_transcript,
)

# Register the event handler for full transcript
ws.add_event_handler(
    event_name=speechmatics.models.ServerMessageType.AddTranscript,
    event_handler=print_transcript,
)

# Register the event handler for end of utterance
ws.add_event_handler(
    event_name=speechmatics.models.ServerMessageType.EndOfUtterance,
    event_handler=print_eou,
)

settings = speechmatics.models.AudioSettings()

# Define transcription parameters
# Full list of parameters described here: https://speechmatics.github.io/speechmatics-python/models

conversation_config = speechmatics.models.ConversationConfig(
    end_of_utterance_silence_trigger=0.75
)  # Adjust as needed

conf = speechmatics.models.TranscriptionConfig(
    operating_point="enhanced",
    language=LANGUAGE,
    enable_partials=True,
    max_delay=1,
    conversation_config=conversation_config,
)

print("Starting transcription (type Ctrl-C to stop):")
with open(PATH_TO_FILE, "rb") as fd:
    try:
        ws.run_synchronously(fd, conf, settings)
    except KeyboardInterrupt:
        print("\nTranscription stopped.")

Client side turn detection

You can use our ForceEndOfUtterance message to manually trigger an end of utterance detection at any point in the conversation. This allows you to integrate your own end of turn detection model with our ASR.

The ForceEndOfUtterance message is sent to the server to trigger an end of utterance detection:

{
  "message": "ForceEndOfUtterance"
}

To achieve higher accuracy, you can include an optional timestamp parameter. This is measured in seconds since the beginning of the audio and corresponds with the force end of utterance request.

{
  "message": "ForceEndOfUtterance",
  "timestamp": 63.5
}

You can also use `ForceEndOfUtterance` with multi-channel diarization:

```json
{
  "message": "ForceEndOfUtterance",
  "channel": "new_york"
}

When this message is received, the server will send an AddTranscript message, followed by an EndOfUtterance message.

Semantic turn detection

While silence-based end of utterance is enough for many use cases, it is often improved by combining it with the context of the conversation. This is known as semantic turn detection. You can try semantic turn detection right away with our free Flow service demo!

You can also check out our semantic turn detection "how to" guide for more details on how to implement this in your own application.

Use cases​

End of utterance​

Configuration​

EndOfUtterance​

Code examples​

Client side turn detection​

Semantic turn detection​