API reference
Deployments:SaaSStatus:Early AccessA Flow conversation has a live audio input stream and produces a live audio output stream, along with the transcript of what is spoken by both parties.
The conversation is driven in real-time by a customer application that handles both audio streams and receives the transcripts over a WebSocket protocol known as the Conversation API.
Various control and data messages can be exchanged over the WebSocket to support capabilities such as conversation recording and moderation.
Limits
Flow is currently in Early Access, meaning that API behaviour and limits may change at any time. It is only production ready for selected development partners.
Usage is limited to:
- 1 concurrent stream
- 20 mins max session duration
- 50 hours monthly usage
To learn more about how you can bring the power of Flow to your product, Speak to sales.
WebSocket Handshake
Handshake Request
We recommend establishing a WebSocket connection directly between the client device/browser and the Speechmatics server in order to minimise latency for the user.
To do so securely, first generate a Temporary Key which can be used to start any number of Flow Conversations for 60 seconds:
curl -L -X POST "https://mp.speechmatics.com/v1/api_keys?type=flow" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $API_KEY" \
-d '{"ttl": 60 }'
Next, provide the temporary key as a part of a query parameter when sending the StartConversation message.
wss://flow.api.speechmatics.com/v1/flow?jwt=<temporary-key>
When implementing your WebSocket client, we recommend using a ping/pong timeout of at least 60 seconds and a ping interval of 20 to 60 seconds. More details about ping/pong messages can be found in the WebSocket RFC.
Overview
A basic Flow session will have the following message exchanges:
'→' Indicates messages sent by the client
'←' Indicates messages sent by the service
Once-only at conversation start:
→ StartConversation
← ConversationStarted
Repeating during the conversation to cover the audio stream from the client and corresponding transcripts:
→ AddAudio
(implicit name, means all inbound binary messages)
← AudioAdded
(confirming ingestion of each binary audio message into the ASR)
← AddTranscript
/ AddPartialTranscript
The transcription of user's voice
← ResponseStarted
The response from the LLM, sent immediately before the TTS audio playback begins.
→ AudioReceived
(confirming receipt of each binary audio message)
← ResponseCompleted
(containing textual content of the utterance just spoken)
Function Calling - Exchanged during function calling over the websocket
← ToolInvoke
Message sent by the service when a particular function needs to be called, including defined parameters
→ ToolResult
Sent by the client, containing the ok
, failed
or rejected
response to the Function call
Once-only at conversation end:
→ AudioEnded
when client is ending the session; indicates audio input has finished so transcription should be finalized
← ConversationEnding
if agent is ending the session; indicates audio input transcription has stopped but response playback might not be finished. Audio input should end (with or without AudioEnded), but will be ignored regardless.
← ConversationEnded
to indicate no further information will be sent and connection will be closed
Info (Details), Warning(Details) and Error(Details) messages will be sent as appropriate.
Message Handling
Each message that the Server accepts is a stringified JSON object with the following fields:
message
(String): The name of the message we are sending.- Any other fields depend on the value of the message and are described below. The messages sent by the Server to a Client are stringified JSON objects as well.
The only exception is a binary message sent from the Client to the Server containing a chunk of audio which will be referred to as AddAudio
.
The server may also send other undocumented message types, but these are subject to change and should be ignored.
The following values of the message
field are supported:
StartConversation
conversation_config
(Object): Required.template_id
(String): Required. This configures a number of options, including the LLM used, the agents voice and any Custom Dictionary used for the transcription. Enterprise customers can configure the LLM and TTS providers used through the use of custom templates. For self-service customers, the supported templates are:Language template_id
English flow-service-assistant-humphrey
English flow-service-assistant-amelia
Arabic flow-service-assistant-sara
Bilingual Spanish-English flow-service-assistant-andres
template_variables
(Object): Optional. Optional section to allow overriding the default values of agent configurations defined in the Template. Parameters:persona
(String): Optionalstyle
(String): Optionalcontext
(String): Optional
audio_format
(Object): Optionaltype
(String): Required. Must always beraw
encoding
(String): Required. Default ispcm_s16le
sample_rate
(Int): Required. Default is 16000
tools
(List): Optionaltype
(String): Required. Must always befunction
function
(Object) : Requiredname
(String) : Required. The name of the function. This will be passed in the messageToolInvoke
by the API when the function needs to be called.description
(String) : Optional. The description of what the function does. This describes to the LLM when the function must be called.parameters
(Object) : Optional. A variable dictionary of input parameter keys, each defined with a corresponding object withtype
anddescription
type
(String) : Required. Should always beobject
properties
(Object) : Optional. A list of parameters expected by the function, each defined as an object. The LLM can sometimes choose to call functions with partial information. Designating parameters asrequired
ensures they're always passed by the LLM.<parameter_name>
(Object) : Required. The name of the particular parameter. Passed as a key to an input parameter inToolInvoke
type
(String) : Required. The type of the input parameterdescription
(String) : Required. The description of what the input parameter is. This describes to the LLM how to detect the parameter during conversation.
required
(List) : Optional. A list of input parameters for the function, from the ones specified before which are necessarily required to call this function.
Example:
"message": "StartConversation",
"audio_format": {
"type": "raw",
"encoding": "pcm_f32le",
"sample_rate": 44100
},
"conversation_config": {
"template_id": "default",
"template_variables": {
"persona": "You are an aging English rock star named Goder Rogfrey.",
"style": "Be charming but unpredictable. Take any opportunity you can to talk about the old days of rock'n'roll. If there are multiple speakers, get them to be sassy to each other.",
"context": "You are taking a customer's order for fast food at the Soft Pebble Cafe. The only options on the menu are burgers and chicken nuggets. Please make them sound appealing!"
}
}
"tools": {
"type": "function",
"function": {
"name": "function_call_name",
"description": "Function call trigger.",
"parameters": {
"type": "object",
"properties": {
"param_name": {
"type": "string",
"description": "Parameter description."
},
"another_param_name": {
"type": "string",
"description": "Another parameter description."
}
},
"required": ["param_name"]
}
}
}
}
ConversationStarted
Confirmation that the Flow session is live and ready to accept the input audio stream.
The session id is provided for reference, along with metadata for interpreting the JSON format of the input audio transcripts.
Example:
{
"message": "ConversationStarted",
"id": "ae35a954-9841-4c17-8fa7-3ad3a1e426c8",
"asr_session_id": "807670e9-14af-4fa2-9e8f-5d525c22156e",
"language_pack_info": {
"adapted": false,
"itn": false,
"language_description": "English",
"word_delimiter": " ",
"writing_direction": "left-to-right"
}
}
AddAudio
AddAudio
is a binary message containing a chunk of audio data and no additional metadata.
From client
The message payload is a single channel of raw audio in the format specified by audio_format
in StartConversation
.
From server
This refers to all binary audio messages on the WebSocket, sent from the Flow engine to the client application.
The outbound message payload is always a single channel of raw audio in PCM S16LE format.
Managing stuttering: To minimize the risk of stuttering during audio playback, the client can add a small buffering delay when the first audio chunk is received if no audio is currently playing, to mask jitter in audio delivery. The delay value could potentially be tuned based on measurements of jitter throughput the session, and so should be reported to the Flow engine where it can be used to adjust processing decisions.
AudioAdded
The server sends AudioAdded
to acknowledge ingest of the audio.
Use AudioAdded messages to keep track what messages are processed by the engine, and don't send more than 10s of audio data or 500 individual AddAudio messages ahead of time (whichever is lower).
This is the same as the existing Real-Time ASR WebSocket protocol. More details.
Example:
{
"message": "AudioAdded",
"seq_no": 134
}
AudioReceived
The client must send this message as soon as it receives a binary (AddAudio
) message. The purpose is to allow the Flow engine to measure the latency of audio playback.
The engine will note the local time when each AddAudio message is sent to the client and when the corresponding AudioReceived message arrives, measuring the round-trip-time (RTT) and the inter-chunk spacing.
The RTT is the lag of audio playback compared to the voice input experienced by the Flow client.
{
"message": "AudioReceived",
"seq_no": 245,
"buffering": 0.020
}
AddTranscript
Transcription generated by the Real-Time ASR Engine generated for only the input audio stream. For the response from the LLM, see ResponseStarted and ResponseCompleted
This is the same as the existing Real-Time ASR WebSocket protocol. More details.
AddPartialTranscript
Fastest, lower-confidence transcription generated by the Real-Time ASR Engine for only the input audio stream.
A partial transcript is an early output transcript that can be changed and expanded by a future AddTranscript
or AddPartialTranscript
message and corresponds to the part of audio since the last AddTranscript
message.
This is the same as the existing Real-Time ASR WebSocket protocol. More details.
ResponseStarted
The text of a response that is about to be uttered by the agent is sent immediately before the TTS audio playback begins.
The start time is noted by the Flow engine based on the current position in the audio input stream, calculated by counting the bytes of raw audio received so far by the engine.
Clients could use this message to show a full caption for the Flow audio stream that's being played back.
Example:
{
"message": "ResponseStarted",
"content": "Hi, my name is Roger, I hope you're hungry!",
"start_time": 6.253
}
ResponseCompleted
The text of a response that has just finished being uttered by the agent is sent immediately after the TTS audio playback has ended.
The end time is noted by the Flow engine based on the current position in the audio input stream, calculated by counting the bytes of raw audio received so far by the engine.
Clients should use this message to incorporate the agent's TTS utterances into the session transcript if needed.
This also marks the point at which the Flow playback caption can be removed from the UI as the utterance has reached its end.
Example:
{
"message": "ResponseCompleted",
"content": "Hi, my name is Roger, I hope you're hungry!",
"start_time": 6.253,
"end_time": 11.860
}
ResponseInterrupted
Sent by the service when a voice interruption to Flow is detected.
Message contains the text of the latest response that had been uttered by the Flow up to the point it was stopped by the interruption.
Clients should use this message to incorporate the agent's TTS utterances, as heard by the end-user, into the session transcript, if needed.
This also marks the point at which the full Flow playback caption can be removed from the UI & replaced with only the spoken part.
Example:
{
"message": "ResponseInterrupted",
"content": "Hi, my name is Roger, I hope you're",
"start_time": 6.253,
"end_time": 10.808
}
AudioEnded
This message can be used and should be sent when the customer application has decided to stop the audio input.
The session doesn't end immediately, as transcription results will lag
The client can close the WebSocket immediately if the final transcripts aren't important.
Example:
{
"message": "AudioEnded"
}
ToolInvoke
The following payload will be sent from the service over the WebSocket when a function call is triggered. It includes the id
for the tool (generated by Flow on invocation of the function call) and the attributes.
The id
will be a unique identifier for this function call instance and must be used when returning the result (see below).
Example:
{
"message": "ToolInvoke",
"id": "call_xxx_yyy",
"function": {
"name": "sentence",
"arguments": {}
}
}
ToolResult
The client must send this message when replying to a function call, following the format below.
The id
must match the unique id for the original function call instance.
The status can be one of ok
, rejected
or failed
. It is then down to the LLM on how it should respond. The content
field can contain a message that will assist the LLM with the response.
Example:
{
"message": "ToolResult",
"id": "call_xxx_yyy",
"status": "ok",
"content": ""
}
If the response returned from the function call does not require confirmation by the LLM (such as “turn on the lights”), then the content can include <NO_RESPONSE_REQUIRED/>
and this will tell the LLM to note the result but not speak about it.
ConversationEnding
It is possible in Flow for the service to decide to end a conversation. When that decision is taken, a ConversationEnding message is sent immediately to the client to indicate that the engine will not process client input beyond what has already been transcribed and reported so far, except perhaps for a final AddTranscript message.
The session will continue in one-sided mode during TTS playback of the final words, or until the final ASR transcript is emitted, before sending ConversationEnded as the last message.
The client can close the WebSocket immediately if the final words and transcript aren't important.
Example:
{
"message": "ConversationEnding"
}
ConversationEnded
ConversationEnded is sent by the Flow service to the client after all AddTranscript and TTS playback related messages, following the graceful termination of a conversation either by AudioEnded or ConversationEnding.
It indicates that the server will not send any further message so the WebSocket can be closed
This is where post-session UI changes & resource cleanup should take place in the client
Illustrations - Control Flow
The audio output stream could start immediately after StartConversation to allow the Flow client to measure arrival jitter and decide on the size of its jitter buffer before the session actually has TTS audio to deliver. This also provides the option to play a predefined sound for feedback while waiting for the Flow session to be accepted, or if the session cannot be accepted.
session start request
→ {"message": "StartConversation", "conversation_config": ...}
audio output streaming loop
← AddAudio (binary)
← AddAudio
→ {"message": "AudioReceived", "seq_no": 1}
← AddAudio
→ {"message": "AudioReceived", "seq_no": 2}
← AddAudio
…
session start confirmation (meaning ASR is now active)
← {"message": "ConversationStarted", "id": "ae35...", "asr_session_id": "8076...", ...}
audio input streaming loop
→ AddAudio (binary)
→ AddAudio
← {"message": "AudioAdded", "seq_no": 1}
→ AddAudio
← {"message": "AudioAdded", "seq_no": 2}
→ AddAudio
…
transcript streaming of individual words or pairs
← {"message": "AddTranscript", "metadata": {"transcript": "i say", ...}, ...}
← {"message": "AddTranscript", "metadata": {"transcript": " hello", ...}, ...}
content of responses being uttered in TTS playback
← {"message": "ResponseStarted", "content": "Hey there!"}
… (corresponding TTS speech mixed into audio output stream)
← {"message": "ResponseCompleted", "content": "Hey there!"}
← {"message": "ResponseStarted", "content": "Welcome to the event."}
…
← {"message": "ResponseCompleted", "content": "Welcome to the event."}
← {"message": "ResponseStarted", "content": "Could I please confirm your name"}
…
← {"message": "ResponseCompleted", "content": "Could I please confirm your name"}
← {"message": "ResponseStarted", "content": " and company, and any dietary restrictions"}
…
← {"message": "ResponseCompleted", "content": " and company, and any dietary restrictions"}
← {"message": "ResponseStarted", "content": " you have?"}
…
← {"message": "ResponseCompleted", "content": " you have?"}
interrupt during response playback
← {"message": "ResponseStarted", "content": " and company, and any dietary restrictions"}
… (TTS playback occurring on audio output, user starts speaking on audio input)
… (ASR produces partial transcripts building up enough words to trigger TTS interruption; these will be sent to the Flow client as well if they are requested)
← {"message": "AddPartialTranscript", "metadata": {"transcript": "stop", ...}, ...}
…
← {"message": "AddPartialTranscript", "metadata": {"transcript": "stop there", ...}, ...}
… (Flow engine determines the interrupt threshold has been reached, and starts tailing off TTS playback, eg. tapering volume while allowing final word or part-word to finish)
… (TTS playback has ended; Flow knows or estimates which words were uttered, and includes this in the session context for the LLM)
← {"message": "ResponseInterrupted", "content": " and company, and any"}
← {"message": "AddTranscript", "metadata": {"transcript": "Stop there ", ...}, ...}
← {"message": "AddTranscript", "metadata": {"transcript": ", I just ", ...}, ...}
← {"message": "AddTranscript", "metadata": {"transcript": " want to", ...}, ...}
← {"message": "AddTranscript", "metadata": {"transcript": " say", ...}, ...}
…
session closedown by client after user and agent stop talking
… (audio in/out streaming and ASR transcription as normal)
← {"message": "AddTranscript", "metadata": {"transcript": " bye now", ...}, ...}
… (TTS prepares utterance, normal audio input and ASR processing continue)
← {"message": "ResponseStarted", "content": "Okay, have a good day."}
… (TTS playback runs to completion, user happens to not be talking, then client initiates closedown)
→ AddAudio (final audio chunk, containing no speech)
→ {"message": "AudioEnded", "last_seq_no": 928}
← {"message": "AudioAdded", "seq_no": 928}
… (ASR finishes processing to end of audio input)
← {"message": "ConversationEnded"}
session closedown by client while TTS is playing
← {"message": "ResponseStarted", "content": "As I was saying, that's an interesting idea."}
… (TTS playback starts in audio output, client initiates session closedown)
→ AddAudio (final audio chunk)
→ {"message": "AudioEnded", "last_seq_no": 928}
… (Flow engine treats end of audio as an interrupt, and starts tailing off TTS playback, eg. tapering volume while allowing final word or part-word to finish. Session continues, if the client allows it, until all remaining transcripts and response messages are sent.)
← {"message": "AddTranscript", "metadata": {"transcript": "", ...}, ...} (final transcript produced)
… (TTS playback continues if necessary to reach the nearest stop point)
← {"message": "ResponseInterrupted", "content": "As I was saying, that's"}
← {"message": "ConversationEnded"}
session closedown initiated by Flow service (whether user is talking or not)
← {"message": "ConversationEnding"}
… (TTS prepares final utterance, ASR processing halted and all remaining/further audio and transcripts are discarded)
← {"message": "ResponseStarted", "content": "Goodbye."}
… (TTS playback of final response content without interruption, if client allows it)
← {"message": "ResponseCompleted", "content": "Goodbye."}
← {"message": "ConversationEnded"}