Skip to main content
Integrations and SDKsPipecat

Pipecat speech to text

Transcribe live audio in your Pipecat voice bots with Speechmatics STT.

Use the Speechmatics STT service to transcribe live audio in your Pipecat voice bots.

Features

  • Real-time transcription — Low-latency streaming with partial (interim) results
  • Turn detection — Adaptive, fixed, ML-based, or external control modes
  • Speaker diarization — Identify and attribute speech to different speakers
  • Speaker filtering — Focus on specific speakers or ignore others (like the assistant)
  • Custom vocabulary — Boost recognition for domain-specific terms and proper nouns
  • Output formatting — Configurable templates for multi-speaker transcripts

Installation

pip install "pipecat-ai[speechmatics]"

Basic configuration

Authentication

By default, the service reads your API key from the SPEECHMATICS_API_KEY environment variable.

Service options

ParameterTypeDefaultDescription
api_keystringenv varSpeechmatics API key (defaults to SPEECHMATICS_API_KEY)
base_urlstringenv varRealtime base URL (defaults to SPEECHMATICS_RT_URL, or wss://eu2.rt.speechmatics.com/v2)
sample_ratenumberpipeline defaultAudio sample rate in Hz
should_interruptbooleantrueEnable interruption on detected speech

Input parameters

These are passed via params=SpeechmaticsSTTService.InputParams(...):

ParameterTypeDefaultDescription
languageLanguage | stringLanguage.ENLanguage code for transcription
domainstring | nullnullDomain-specific model (for example "finance")
operating_pointOperatingPoint | nullnullTranscription accuracy. Use OperatingPoint.ENHANCED (higher accuracy) or OperatingPoint.STANDARD (lower latency)
audio_encodingAudioEncodingPCM_S16LEAudio encoding format: AudioEncoding.PCM_S16LE, AudioEncoding.PCM_F32LE, or AudioEncoding.MULAW
punctuation_overridesobject | nullnullCustom punctuation rules
extra_paramsobject | nullnullAdditional parameters to pass to the API

Example

from pipecat.services.speechmatics.stt import SpeechmaticsSTTService

stt = SpeechmaticsSTTService(
params=SpeechmaticsSTTService.InputParams(
language="en",
operating_point=SpeechmaticsSTTService.OperatingPoint.ENHANCED,
),
)

Advanced configuration

Turn detection

Turn detection determines when a user has finished their complete thought, while the Realtime API's EndOfUtterance message indicates a pause in speech. The service handles this distinction automatically.

Modes

Set turn_detection_mode to control how end of speech is detected:

ModeWhen to use
TurnDetectionMode.EXTERNALDefault and recommended. Delegates turn detection to Pipecat's pipeline (VAD, Smart Turn, etc.). Try this first
TurnDetectionMode.ADAPTIVESpeechmatics analyzes speech content and acoustic patterns for end-of-turn detection
TurnDetectionMode.FIXEDFixed silence threshold using end_of_utterance_silence_trigger
TurnDetectionMode.SMART_TURNSpeechmatics Smart Turn for ML-based turn detection

Start with EXTERNAL mode. This lets you use Pipecat's turn detection features (like LocalSmartTurnAnalyzerV3) which are well-integrated with the pipeline. Only switch to other modes if you need Speechmatics to handle turn detection directly.

from pipecat.services.speechmatics.stt import SpeechmaticsSTTService

# External mode (default, recommended) - use Pipecat's turn detection
stt = SpeechmaticsSTTService(
params=SpeechmaticsSTTService.InputParams(
turn_detection_mode=SpeechmaticsSTTService.TurnDetectionMode.EXTERNAL,
),
)

# Adaptive mode - Speechmatics determines end-of-turn
stt = SpeechmaticsSTTService(
params=SpeechmaticsSTTService.InputParams(
turn_detection_mode=SpeechmaticsSTTService.TurnDetectionMode.ADAPTIVE,
),
)

# Fixed mode - consistent silence threshold
stt = SpeechmaticsSTTService(
params=SpeechmaticsSTTService.InputParams(
turn_detection_mode=SpeechmaticsSTTService.TurnDetectionMode.FIXED,
end_of_utterance_silence_trigger=0.8, # 800ms of silence
end_of_utterance_max_delay=5.0, # Force end after 5s
),
)

# Smart turn mode - Speechmatics ML-based turn detection
stt = SpeechmaticsSTTService(
params=SpeechmaticsSTTService.InputParams(
turn_detection_mode=SpeechmaticsSTTService.TurnDetectionMode.SMART_TURN,
),
)

When using ADAPTIVE or SMART_TURN modes, remove any competing VAD or turn-detection features from your pipeline to avoid conflicts.

Configuration

ParameterTypeDefaultDescription
end_of_utterance_silence_triggernumber | nullnullSilence duration (seconds) that triggers end of utterance. Used primarily in FIXED mode. Valid range: >0 to <2 seconds (exclusive)
end_of_utterance_max_delaynumber | nullnullMaximum delay (seconds) before forcing an end of utterance. Must be greater than end_of_utterance_silence_trigger
max_delaynumber | nullnullMaximum transcription delay (seconds). Lower values reduce latency at the cost of accuracy. Valid range: 0.7–4.0 seconds
include_partialsboolean | nullnullEnable partial (interim) transcription results
split_sentencesboolean | nullnullSplit transcription into sentences

Advanced diarization

The service can attribute words to speakers and lets you decide which speakers are treated as active (foreground) vs passive (background).

Configuration

ParameterTypeDefaultDescription
enable_diarizationboolean | nullnullEnable speaker diarization
speaker_sensitivitynumber | nullnullSpeaker detection sensitivity. Valid range: >0.0 to <1.0 (exclusive)
max_speakersnumber | nullnullMaximum number of speakers to detect. Valid range: 2–100
prefer_current_speakerboolean | nullnullReduce speaker switching for similar voices
known_speakersarray | nullnullPre-define speaker identifiers with labels (SpeakerIdentifier objects)
additional_vocabarray | nullnullCustom vocabulary entries (AdditionalVocabEntry objects) for improved recognition
from pipecat.services.speechmatics.stt import SpeechmaticsSTTService

stt = SpeechmaticsSTTService(
params=SpeechmaticsSTTService.InputParams(
enable_diarization=True,
speaker_sensitivity=0.7,
max_speakers=3,
prefer_current_speaker=True,
additional_vocab=[
SpeechmaticsSTTService.AdditionalVocabEntry(content="Speechmatics"),
SpeechmaticsSTTService.AdditionalVocabEntry(content="API", sounds_like=["A P I"]),
],
),
)

Known speakers

Use known_speakers to attribute words to specific speakers across sessions. This is useful when you want consistent speaker identification for known participants.

from pipecat.services.speechmatics.stt import SpeechmaticsSTTService

stt = SpeechmaticsSTTService(
params=SpeechmaticsSTTService.InputParams(
enable_diarization=True,
known_speakers=[
SpeechmaticsSTTService.SpeakerIdentifier(label="Alice", speaker_identifiers=["speaker_abc123"]),
SpeechmaticsSTTService.SpeakerIdentifier(label="Bob", speaker_identifiers=["speaker_def456"]),
],
),
)

Speaker identifiers are unique to each Speechmatics account and can be obtained from a previous transcription session.

Speaker focus

Control which speakers are treated as active (foreground) vs passive (background):

  • Active speakers are the speakers you care about in your application. They generate FINAL_TRANSCRIPT events.
  • Passive speakers are still transcribed, but their words are buffered and only included in the output alongside new words from active speakers.
ParameterTypeDefaultDescription
focus_speakersarray[]Speaker IDs to treat as active
ignore_speakersarray[]Speaker IDs to exclude entirely
focus_modeSpeakerFocusModeRETAINHow to handle non-focused speakers
Focus modes
  • SpeakerFocusMode.RETAIN keeps non-focused speakers as passive.
  • SpeakerFocusMode.IGNORE discards non-focused speaker words entirely.

ignore_speakers always excludes those speakers from transcription and their speech will not trigger VAD or end of utterance detection.

from pipecat.services.speechmatics.stt import SpeechmaticsSTTService

stt = SpeechmaticsSTTService(
params=SpeechmaticsSTTService.InputParams(
focus_speakers=["S1"],
focus_mode=SpeechmaticsSTTService.SpeakerFocusMode.RETAIN,
ignore_speakers=["S3"],
),
)

Speaker formatting

Use speaker_active_format and speaker_passive_format to format transcripts for your LLM. The templates support {speaker_id}, {text}, {ts}, {start_time}, {end_time}, and {lang}.

ParameterTypeDefaultDescription
speaker_active_formatstring | nullnullFormat template for active speaker output
speaker_passive_formatstring | nullnullFormat template for passive speaker output
from pipecat.services.speechmatics.stt import SpeechmaticsSTTService

stt = SpeechmaticsSTTService(
params=SpeechmaticsSTTService.InputParams(
speaker_active_format="<{speaker_id}>{text}</{speaker_id}>",
speaker_passive_format="<{speaker_id} background>{text}</{speaker_id}>",
),
)

When you use a custom format, include it in your bot's system prompt so the LLM can interpret speaker tags consistently.

Updating speakers during transcription

You can dynamically change which speakers to focus on or ignore during an active transcription session using the update_params() method.

from pipecat.services.speechmatics.stt import SpeechmaticsSTTService

stt = SpeechmaticsSTTService(
params=SpeechmaticsSTTService.InputParams(enable_diarization=True),
)

# Later, during transcription:
stt.update_params(
SpeechmaticsSTTService.UpdateParams(
focus_speakers=["S1", "S2"],
ignore_speakers=["S3"],
focus_mode=SpeechmaticsSTTService.SpeakerFocusMode.RETAIN,
)
)

This is useful when you need to adjust speaker filtering based on runtime conditions, such as when a new participant joins or leaves a conversation.

Example

from pipecat.services.speechmatics.stt import SpeechmaticsSTTService

stt = SpeechmaticsSTTService(
params=SpeechmaticsSTTService.InputParams(
# Service options
language="en",
operating_point=SpeechmaticsSTTService.OperatingPoint.ENHANCED,

# Turn detection
turn_detection_mode=SpeechmaticsSTTService.TurnDetectionMode.EXTERNAL,
max_delay=1.5,
include_partials=True,

# Diarization
enable_diarization=True,
speaker_sensitivity=0.6,
max_speakers=4,
prefer_current_speaker=True,

# Speaker focus
focus_speakers=["S1", "S2"],
focus_mode=SpeechmaticsSTTService.SpeakerFocusMode.RETAIN,
ignore_speakers=[],

# Output formatting
speaker_active_format="[{speaker_id}]: {text}",
speaker_passive_format="[{speaker_id} (background)]: {text}",

# Custom vocabulary
additional_vocab=[
SpeechmaticsSTTService.AdditionalVocabEntry(content="Speechmatics"),
SpeechmaticsSTTService.AdditionalVocabEntry(content="Pipecat", sounds_like=["pipe cat"]),
],
),
)

Next steps