Voice Agent API
- The Voice Agent API is a preview offering and should not be used for live production traffic. The system will be less stable than our production endpoints and features may change.
- There are no uptime or performance SLAs.
- There are no data residency guarantees. Data processing may occur in both US and EU regions.
- Preview features may be cancelled at any time or never be released publicly.
Introduction
The Voice Agent API is a WebSocket API for building voice agents. Stream audio in and receive speaker-labelled, turn-based transcription back — clean, punctuated, and ready to pass directly to an LLM.
Turn detection runs server-side. Choose a profile based on your use case and the API handles when to finalise each speaker's turn.
Looking for code examples? See working examples in Speechmatics Academy for Python and JavaScript.
Profiles
Profiles are pre-configured turn detection modes. Each profile sets the right defaults for your use case — you choose one when connecting, include it in your endpoint URL, and the server handles the rest.
adaptive
Endpoint: /v2/agent/adaptive
Adapts to each speaker's pace over the course of a conversation. It adjusts the turn-end threshold based on speech rate and disfluencies (e.g. hesitations, filler words), waiting longer for speakers who tend to pause mid-thought.
Best for: General conversational voice agents.
Languages: All supported languages. Disfluency detection is English-only — other languages fall back to speech-rate adaptation.
Trade-off: Latency varies by speaker.
agile
Endpoint: /v2/agent/agile
Uses voice activity detection (VAD) to detect silence and finalise turns as quickly as possible. The lowest latency profile.
Best for: Use cases where response speed is the top priority and occasional mid-speech finalisations are acceptable.
Languages: All supported languages.
Trade-off: Because it relies on silence, it may finalise a turn while the speaker is still mid-sentence — for example, during a natural pause. This can result in additional downstream LLM calls.
smart
Endpoint: /v2/agent/smart
Builds on adaptive with an additional ML model that analyses acoustic cues to predict whether a speaker has genuinely finished their turn. The most conservative profile — least likely to interrupt.
Best for: High-stakes conversations where cutting off the user is costly — finance, healthcare, legal.
Languages: Arabic, Bengali, Chinese, Danish, Dutch, English, Finnish, French, German, Hindi, Indonesian, Italian, Japanese, Korean, Marathi, Norwegian, Polish, Portuguese, Russian, Spanish, Turkish, Ukrainian, Vietnamese.
Trade-off: Higher latency than adaptive.
external
Endpoint: /v2/agent/external
Turn detection is fully manual. The server accumulates audio and transcript until you send a ForceEndOfUtterance message, at which point it finalises everything spoken up to that point and emits an AddSegment.
Best for: Push-to-talk interfaces, custom VAD pipelines, or setups where an LLM decides when to respond.
Languages: All supported languages.
Trade-off: You are responsible for all turn detection logic.
Session Flow
Every session follows the same structure: connect, start recognition, stream audio, receive turn events, close.
SessionMetrics is emitted every 5 seconds independently of turn boundaries.
For a full reference of all messages, see Messages Overview.
Getting Started
1. Connect
Open a WebSocket connection to the preview endpoint. To do this, you must specify the profile to use:
wss://preview.rt.speechmatics.com/v2/agent/<profile>
2. Authenticate
Authenticate every connection using one of the following:
See Authentication for details including temporary keys.
3. Start the session
Send StartRecognition as your first message:
{
"message": "StartRecognition",
"transcription_config": {
"language": "en"
}
}
For all configuration options, see Configuration.
The server responds with RecognitionStarted when the session is ready. You should wait for this message before sending audio.
4. Stream audio and handle responses
Send audio as binary WebSocket frames. Turn events will arrive in real time as the API processes speech — see Session Flow for the full message sequence.
Configuration
Configuration is passed in StartRecognition and is split across two levels of the payload: audio_format (top-level) and transcription_config.
audio_format
Only pcm_s16le at 8000 or 16000 Hz is supported. Other encodings (e.g. pcm_f32le, mulaw) and sample rates (e.g. 44100) may be silently accepted by the API but will not produce correct output.
Example: {"type":"raw","encoding":"pcm_s16le","sample_rate":16000}
transcription_config
transcription_config.speaker_diarization_config
Note: The following require diarization: speaker to be set.
Not supported — will be rejected if present
Messages Overview
All messages exchanged during a Voice Agent API session. For payload details, see the API Reference sections.
Client → Server
Server → Client
Core turn events — the messages your agent logic acts on
Turn prediction — early signals you can use to prepare a response
Speech and speaker activity
Session lifecycle
Metrics and diagnostics
Shared messages with the RT API - messages shared with the RT API. See the RT API Reference for full payload details.
API Reference - Client Messages
StartRecognition
The first message you send after connecting. Starts the recognition session and passes configuration.
The server responds with RecognitionStarted.
{
"message": "StartRecognition",
"audio_format": {
"type": "raw",
"encoding": "pcm_s16le",
"sample_rate": 16000
},
"transcription_config": {
"language": "en"
}
}
For all configuration options, see Configuration.
EndOfStream
Send when you have finished streaming audio. The server finalises any remaining transcript and then emits EndOfTranscript.
last_seq_no is the sequence number of the last audio frame you sent.
{
"message": "EndOfStream",
"last_seq_no": 1234
}
ForceEndOfUtterance
Only applies to the external profile. Immediately ends the current turn — the server finalises all audio received so far and emits a single AddSegment containing the complete transcript for that turn, followed by EndOfTurn.
Use this wherever your application decides a turn is complete: on button release (push-to-talk), on VAD silence, or on an LLM signal.
{
"message": "ForceEndOfUtterance"
}
UpdateSpeakerFocus
Updates which speakers are in focus, mid-session. Takes effect immediately. See Speaker Focus for full details.
{
"message": "UpdateSpeakerFocus",
"speaker_focus": {
"focus_speakers": ["S1"],
"ignore_speakers": [],
"focus_mode": "retain"
}
}
GetSpeakers
Requests voice identifiers for all speakers diarized so far in the session. The server responds with a SpeakersResult message. See Speaker ID for full details.
{
"message": "GetSpeakers"
}
API Reference - Server Messages
This section covers Voice Agent API-specific messages only. For shared messages (RecognitionStarted, AudioAdded, AddPartialTranscript, AddTranscript, EndOfUtterance, EndOfTranscript, Info, Warning, Error), see the RT API reference.
StartOfTurn
Emitted when a speaker begins a new turn. Use this to signal to your agent that it should stop speaking if it currently is.
{
"message": "StartOfTurn",
"turn_id": 42
}
Fields:
turn_id— monotonically increasing integer; pairs with the correspondingEndOfTurn
EndOfTurn
Emitted when turn detection decides the speaker has finished. This is the trigger for your agent to respond. The finalised transcript for the turn is in the preceding AddSegment.
{
"message": "EndOfTurn",
"turn_id": 42,
"metadata": {
"start_time": 0.84,
"end_time": 3.24
}
}
Fields:
turn_id— matches theStartOfTurnfor this turnmetadata.start_time/metadata.end_time— audio time range for the turn, in seconds from session start
AddPartialSegment
Interim transcript update emitted continuously while the speaker is talking. Each new AddPartialSegment replaces the previous one — do not concatenate them.
{
"message": "AddPartialSegment",
"segments": [
{
"speaker_id": "S1",
"is_active": true,
"timestamp": "2025-01-01T12:00:00.000+00:00",
"language": "en",
"text": "Good evening",
"is_eou": false,
"metadata": {
"start_time": 0.84,
"end_time": 1.24
}
}
],
"metadata": {
"start_time": 0.84,
"end_time": 1.24,
"processing_time": 0.23
}
}
AddSegment
The final, complete transcript for a turn. Emitted just before EndOfTurn. This is the stable output to pass to your LLM — do not use AddPartialSegment for this.
In multi-speaker scenarios, a single AddSegment may contain segments from multiple speakers, returned in time order.
{
"message": "AddSegment",
"segments": [
{
"speaker_id": "S1",
"is_active": true,
"timestamp": "2025-01-01T12:00:00.000+00:00",
"language": "en",
"text": "Good evening.",
"is_eou": true,
"metadata": {
"start_time": 0.84,
"end_time": 1.56
}
}
],
"metadata": {
"start_time": 0.84,
"end_time": 1.56,
"processing_time": 0.25
}
}
Segment fields:
speaker_id— speaker label (e.g.S1,S2, or a custom label if using Speaker ID)is_active—trueif this speaker is in your current focus list;falseif they are a background speaker (see Speaker Focus)is_eou—trueon final segments,falseon partialstext— clean, punctuated transcript textmetadata.start_time/metadata.end_time— time range of this segment in seconds from session start
Message-level fields:
metadata.processing_time— transcription latency in seconds for this message
SpeakerStarted / SpeakerEnded
Emitted when a specific speaker starts or stops being heard. These are voice activity events — they fire based on detected speech, independently of turn boundaries.
{
"message": "SpeakerStarted",
"speaker_id": "S1",
"is_active": true,
"time": 0.84,
"metadata": { "start_time": 0.84, "end_time": 0.84 }
}
{
"message": "SpeakerEnded",
"speaker_id": "S1",
"is_active": true,
"time": 3.24,
"metadata": { "start_time": 0.84, "end_time": 3.24 }
}
Fields:
speaker_id— the speaker whose activity changedis_active— whether this speaker is in your current focus listtime— seconds from session start when the activity was detectedmetadata.start_time— when this speaker started their current speaking intervalmetadata.end_time— when this speaker stopped speaking (SpeakerEndedonly)
SessionMetrics
Emitted every 5 seconds and once at the end of the session.
{
"message": "SessionMetrics",
"total_time": 4.6,
"total_time_str": "00:00:04",
"total_bytes": 148480,
"processing_time": 0.295
}
SpeakerMetrics
Emitted each time a speaker produces a recognised word.
{
"message": "SpeakerMetrics",
"speakers": [
{
"speaker_id": "S1",
"word_count": 6,
"last_heard": 2.36,
"volume": 5.2
}
]
}
SpeakersResult
Emitted in response to GetSpeakers. Contains voice identifiers for all diarized speakers so far. See Speaker ID for how to store and use these.
{
"message": "SpeakersResult",
"speakers": [
{ "label": "S1", "speaker_identifiers": ["<id1>"] },
{ "label": "S2", "speaker_identifiers": ["<id2>"] }
]
}
EndOfTurnPrediction
Emitted by adaptive and smart profiles when the model predicts the current turn is about to end. Can be used to begin preparing a response before EndOfTurn arrives, reducing perceived latency.
{
"message": "EndOfTurnPrediction",
"turn_id": 2,
"predicted_wait": 0.73,
"metadata": {
"ttl": 0.73,
"reasons": ["not__ends_with_eos"]
}
}
Fields:
turn_id— the turn this prediction applies topredicted_wait— estimated seconds until the turn endsmetadata.ttl— time to live; how long this prediction remains validmetadata.reasons— internal signals that contributed to the prediction
SmartTurnResult
This message is currently emitted as SmartTurnResult during preview. It will be renamed to SmartTurnPrediction at GA.
Emitted by the smart profile only. A higher-confidence acoustic prediction of turn completion, based on the ML model that analyses vocal cues.
{
"message": "SmartTurnResult",
"prediction": {
"prediction": true,
"probability": 0.979,
"processing_time": 0.128
},
"metadata": {
"start_time": 0.0,
"end_time": 2.2,
"language": "en",
"speaker_id": "S1",
"total_time": 2.2
}
}
Fields:
prediction.prediction—trueif the model predicts the turn is completeprediction.probability— confidence score (0–1)prediction.processing_time— time taken by the ML model in secondsmetadata.start_time/metadata.end_time— audio window analysedmetadata.total_time— total session time at point of predictionmetadata.speaker_id— speaker being analysed (nullif not yet identified)
SpeechStarted / SpeechEnded
Voice activity detection events. Emitted when speech is first detected in the audio stream (SpeechStarted) or stops (SpeechEnded). These fire independently of speaker identity and turn boundaries.
{
"message": "SpeechStarted",
"probability": 0.508,
"transition_duration_ms": 192.0,
"metadata": {
"start_time": 2.1,
"end_time": 2.1
}
}
{
"message": "SpeechEnded",
"probability": 0.307,
"transition_duration_ms": 192.0,
"metadata": {
"start_time": 0.4,
"end_time": 2.5
}
}
Fields:
probability— VAD confidence score (0–1)transition_duration_ms— duration of the speech/silence transition in millisecondsmetadata.start_time— when speech began (SpeechStarted: same asend_time;SpeechEnded: when the speaking interval started)metadata.end_time— when the event was detected
Features
Speaker Focus
Speaker focus lets you control which speakers' output your agent acts on. By default, all detected speakers are active and their transcripts are included in AddSegment output.
Speaker IDs (S1, S2, etc.) are assigned automatically when diarization is enabled, and persist for the lifetime of the session. Send UpdateSpeakerFocus at any point during the session to change who is in focus — the new config takes effect immediately and replaces the previous one.
{
"message": "UpdateSpeakerFocus",
"speaker_focus": {
"focus_speakers": ["S1"],
"ignore_speakers": ["S3"],
"focus_mode": "retain"
}
}
Fields:
focus_speakers— speaker IDs to treat as active. Their segments appear withis_active: true.ignore_speakers— speaker IDs to exclude entirely. Their speech is dropped and does not affect turn detection.focus_mode— what happens to speakers who are neither infocus_speakersnorignore_speakers:retain— they remain in the output as passive speakers (is_active: false)ignore— they are excluded from the output entirely
Speaker ID
Speaker ID lets you recognise the same person across separate sessions. At the end of a session, you can retrieve voice identifiers for each speaker and store them. In future sessions, pass those identifiers into StartRecognition and the system will tag matching speakers with a consistent label rather than a generic S1, S2.
Getting identifiers
Send GetSpeakers at any point during a session to retrieve identifiers for all diarized speakers so far. The server responds with a SpeakersResult message.
Store the speaker_identifiers values from the response. These are opaque tokens tied to a speaker's voice profile — treat them as credentials and store them securely.
Using identifiers in future sessions
Pass stored identifiers into StartRecognition via transcription_config.known_speakers. You can assign any label:
{
"message": "StartRecognition",
"transcription_config": {
"language": "en",
"known_speakers": [
{ "label": "Alice", "speaker_identifiers": ["<alice_id>"] },
{ "label": "Bob", "speaker_identifiers": ["<bob_id>"] }
]
}
}
When those speakers are detected, their segments will carry "Alice" or "Bob" as the speaker_id instead of generic labels. Any unrecognised speakers are still assigned generic labels (S1, S2, etc.).
Code Examples
For working code examples in Python and JavaScript, see the Speechmatics Academy.
Feedback
This is a preview and your feedback shapes what goes to GA (General Availability). We'd love to hear from you — Tell us what works well, which features you use, whether something didn't work as expected, a profile that behaved differently than you anticipated, or a feature you'd want before we ship this more broadly.
Specific areas of interest:
- Integration experience (documentation, SDKs, API messages/metadata)
- Accuracy and latency (including data capture if it's relevant. E.g. phone numbers, spell outs of names/account numbers)
- Turn detection and experience with different profiles
- Any missing capabilities which would make your product better
- What would stop you using this in production
We'd love to get on a call with you to discuss your feedback in person, or you can fill in this form. You can also reach us via your Speechmatics contact or the channel shared in your preview welcome email.