LiveKit speech to text
Use the Speechmatics STT plugin to transcribe live audio in your LiveKit voice agents.
Features
- Real-time transcription — Low-latency streaming with partial (interim) results
- Turn detection — Adaptive, fixed, ML-based, or external control modes
- Speaker diarization — Identify and attribute speech to different speakers
- Speaker filtering — Focus on specific speakers or ignore others (like the assistant)
- Custom vocabulary — Boost recognition for domain-specific terms and proper nouns
- Output formatting — Configurable templates for multi-speaker transcripts
Installation
uv add "livekit-agents[speechmatics]~=1.4"
Basic configuration
Authentication
By default, the plugin reads your API key from SPEECHMATICS_API_KEY.
Service options
Example
from livekit.agents import AgentSession
from livekit.plugins import speechmatics
session = AgentSession(
stt=speechmatics.STT(
language="en",
output_locale="en-GB",
),
# ... llm, tts, vad, etc.
)
Advanced configuration
Turn detection
The Speechmatics STT plugin uses the Speechmatics Voice SDK for endpointing and turn detection.
Turn detection determines when a user has finished their complete thought, while the Realtime API's EndOfUtterance message indicates a pause in speech. The plugin handles this distinction automatically.
Modes
Set turn_detection_mode to control how end of speech is detected:
from livekit.plugins import speechmatics
from livekit.plugins.speechmatics import TurnDetectionMode
# Adaptive mode (default) - adjusts to speech patterns
# Requires: pip install speechmatics-voice[smart]
stt = speechmatics.STT(
turn_detection_mode=TurnDetectionMode.ADAPTIVE,
)
# Fixed mode - consistent silence threshold
stt = speechmatics.STT(
turn_detection_mode=TurnDetectionMode.FIXED,
end_of_utterance_silence_trigger=0.8, # 800ms of silence
end_of_utterance_max_delay=5.0, # Force end after 5s
)
# Smart turn mode - ML-based natural turn-taking
# Requires: pip install speechmatics-voice[smart]
stt = speechmatics.STT(
turn_detection_mode=TurnDetectionMode.SMART_TURN,
)
# External mode - manual control via finalize()
stt = speechmatics.STT(
turn_detection_mode=TurnDetectionMode.EXTERNAL,
)
Manual turn finalization
When using TurnDetectionMode.EXTERNAL, you control when a turn ends by calling finalize() on the STT instance. This is useful when you have your own VAD or want to integrate with external signals.
from livekit.plugins import speechmatics
from livekit.plugins.speechmatics import TurnDetectionMode
stt = speechmatics.STT(
turn_detection_mode=TurnDetectionMode.EXTERNAL,
)
# Later, when you detect the user has finished speaking:
stt.finalize()
Configuration
Advanced diarization
The plugin can attribute words to speakers and lets you decide which speakers are treated as active (foreground) vs passive (background).
Configuration
from livekit.plugins import speechmatics
from livekit.plugins.speechmatics import AdditionalVocabEntry
stt = speechmatics.STT(
enable_diarization=True,
speaker_sensitivity=0.7,
max_speakers=3,
prefer_current_speaker=True,
additional_vocab=[
AdditionalVocabEntry(content="Speechmatics"),
AdditionalVocabEntry(content="API", sounds_like=["A P I"]),
],
)
Known speakers
Use known_speakers to attribute words to specific speakers across sessions. This is useful when you want consistent speaker identification for known participants.
from livekit.plugins import speechmatics
from livekit.plugins.speechmatics import SpeakerIdentifier
stt = speechmatics.STT(
enable_diarization=True,
known_speakers=[
SpeakerIdentifier(label="Alice", speaker_identifiers=["speaker_abc123"]),
SpeakerIdentifier(label="Bob", speaker_identifiers=["speaker_def456"]),
],
)
Speaker identifiers are unique to each Speechmatics account and can be obtained from a previous transcription session.
Speaker focus
Control which speakers are treated as active (foreground) vs passive (background):
- Active speakers are the speakers you care about in your application. They generate
FINAL_TRANSCRIPTevents. - Passive speakers are still transcribed, but their words are buffered and only included in the output alongside new words from active speakers.
Focus modes
SpeakerFocusMode.RETAINkeeps non-focused speakers as passive.SpeakerFocusMode.IGNOREdiscards non-focused speaker words entirely.
ignore_speakers always excludes those speakers from transcription and their speech will not trigger VAD or end of utterance detection.
By default, any speaker label wrapped in double underscores (for example __ASSISTANT__) is automatically excluded. This convention lets you filter out assistant audio without explicitly adding it to ignore_speakers.
from livekit.plugins import speechmatics
from livekit.plugins.speechmatics import SpeakerFocusMode
stt = speechmatics.STT(
focus_speakers=["S1"],
focus_mode=SpeakerFocusMode.RETAIN,
ignore_speakers=["S3"],
)
Speaker formatting
Use speaker_active_format and speaker_passive_format to format transcripts for your LLM.
The templates support {speaker_id} and {text}.
from livekit.plugins import speechmatics
stt = speechmatics.STT(
speaker_active_format="<{speaker_id}>{text}</{speaker_id}>",
speaker_passive_format="<{speaker_id} background>{text}</{speaker_id}>",
)
When you use a custom format, include it in your agent instructions so the LLM can interpret speaker tags consistently.
Updating speakers during transcription
You can dynamically change which speakers to focus on or ignore during an active transcription session using the update_speakers() method.
from livekit.plugins import speechmatics
from livekit.plugins.speechmatics import SpeakerFocusMode
stt = speechmatics.STT(enable_diarization=True)
# Later, during transcription:
stt.update_speakers(
focus_speakers=["S1", "S2"],
ignore_speakers=["S3"],
focus_mode=SpeakerFocusMode.RETAIN,
)
This is useful when you need to adjust speaker filtering based on runtime conditions, such as when a new participant joins or leaves a conversation.
Example
from livekit.agents import AgentSession
from livekit.plugins import speechmatics
from livekit.plugins.speechmatics import (
AdditionalVocabEntry,
AudioEncoding,
OperatingPoint,
SpeakerFocusMode,
SpeakerIdentifier,
TurnDetectionMode,
)
stt = speechmatics.STT(
# Service options
language="en",
output_locale="en-US",
operating_point=OperatingPoint.ENHANCED,
# Turn detection
turn_detection_mode=TurnDetectionMode.ADAPTIVE,
max_delay=1.5,
include_partials=True,
# Diarization
enable_diarization=True,
speaker_sensitivity=0.6,
max_speakers=4,
prefer_current_speaker=True,
# Speaker focus
focus_speakers=["S1", "S2"],
focus_mode=SpeakerFocusMode.RETAIN,
ignore_speakers=["__ASSISTANT__"],
# Output formatting
speaker_active_format="[{speaker_id}]: {text}",
speaker_passive_format="[{speaker_id} (background)]: {text}",
# Custom vocabulary
additional_vocab=[
AdditionalVocabEntry(content="Speechmatics"),
AdditionalVocabEntry(content="LiveKit", sounds_like=["live kit", "livekit"]),
],
)
session = AgentSession(
stt=stt,
# ... llm, tts, vad, etc.
)
Next steps
- Quickstart — Build a complete voice agent
- Text to speech — Use Speechmatics voices in your agent