Skip to main content

API reference

Deployments:SaaSStatus:Early Access

A Flow conversation has a live audio input stream and produces a live audio output stream, along with the transcript of what is spoken by both parties.

The conversation is driven in real-time by a customer application that handles both audio streams and receives the transcripts over a WebSocket protocol known as the Conversation API.

Various control and data messages can be exchanged over the WebSocket to support capabilities such as conversation recording and moderation.

Limits

Flow is currently in Early Access, meaning that API behaviour and limits may change at any time. It is only production ready for selected development partners.

Usage is limited to:

  • 1 concurrent stream
  • 20 mins max session duration
  • 50 hours monthly usage

To learn more about how you can bring the power of Flow to your product, Speak to sales.

WebSocket Handshake

Handshake Request

We recommend establishing a WebSocket connection directly between the client device/browser and the Speechmatics server in order to minimise latency for the user.

To do so securely, first generate a Temporary Key which can be used to start any number of Flow Conversations for 60 seconds:

curl -L -X POST "https://mp.speechmatics.com/v1/api_keys?type=flow" \
     -H "Content-Type: application/json" \
     -H "Authorization: Bearer $API_KEY" \
     -d '{"ttl": 60 }'

Next, provide the temporary key as a part of a query parameter when sending the StartConversation message.

 wss://flow.api.speechmatics.com/v1/flow?jwt=<temporary-key>

When implementing your WebSocket client, we recommend using a ping/pong timeout of at least 60 seconds and a ping interval of 20 to 60 seconds. More details about ping/pong messages can be found in the WebSocket RFC.

Overview

A basic Flow session will have the following message exchanges:

'→' Indicates messages sent by the client

'←' Indicates messages sent by the service

Once-only at conversation start:

StartConversation

ConversationStarted

Repeating during the conversation to cover the audio stream from the client and corresponding transcripts:

AddAudio (implicit name, means all inbound binary messages)

AudioAdded (confirming ingestion of each binary audio message into the ASR)

AddTranscript / AddPartialTranscript

The transcription of user's voice

ResponseStarted The response from the LLM, sent immediately before the TTS audio playback begins.

AudioReceived (confirming receipt of each binary audio message)

ResponseCompleted (containing textual content of the utterance just spoken)

Function Calling - Exchanged during function calling over the websocket

ToolInvoke Message sent by the service when a particular function needs to be called, including defined parameters

ToolResult Sent by the client, containing the ok, failed or rejected response to the Function call

Once-only at conversation end:

AudioEnded when client is ending the session; indicates audio input has finished so transcription should be finalized

ConversationEnding if agent is ending the session; indicates audio input transcription has stopped but response playback might not be finished. Audio input should end (with or without AudioEnded), but will be ignored regardless.

ConversationEnded to indicate no further information will be sent and connection will be closed

Info (Details), Warning(Details) and Error(Details) messages will be sent as appropriate.

Message Handling

Each message that the Server accepts is a stringified JSON object with the following fields:

  • message (String): The name of the message we are sending.
  • Any other fields depend on the value of the message and are described below. The messages sent by the Server to a Client are stringified JSON objects as well.

The only exception is a binary message sent from the Client to the Server containing a chunk of audio which will be referred to as AddAudio.

info

The server may also send other undocumented message types, but these are subject to change and should be ignored.

The following values of the message field are supported:

StartConversation

  • conversation_config (Object): Required.
    • template_id (String): Required. This configures a number of options, including the LLM used, the agents voice and any Custom Dictionary used for the transcription. Enterprise customers can configure the LLM and TTS providers used through the use of custom templates. For self-service customers, the supported templates are:
      Languagetemplate_id
      Englishflow-service-assistant-humphrey
      Englishflow-service-assistant-amelia
      Arabicflow-service-assistant-sara
      Bilingual Spanish-Englishflow-service-assistant-andres
    • template_variables (Object): Optional. Optional section to allow overriding the default values of agent configurations defined in the Template. Parameters:
      • persona (String): Optional
      • style(String): Optional
      • context (String): Optional
  • audio_format (Object): Optional
    • type (String): Required. Must always be raw
    • encoding (String): Required. Default is pcm_s16le
    • sample_rate (Int): Required. Default is 16000
  • tools (List): Optional
    • type (String): Required. Must always be function
    • function (Object) : Required
      • name (String) : Required. The name of the function. This will be passed in the message ToolInvoke by the API when the function needs to be called.
      • description (String) : Optional. The description of what the function does. This describes to the LLM when the function must be called.
      • parameters (Object) : Optional. A variable dictionary of input parameter keys, each defined with a corresponding object with type and description
        • type (String) : Required. Should always be object
        • properties (Object) : Optional. A list of parameters expected by the function, each defined as an object. The LLM can sometimes choose to call functions with partial information. Designating parameters as required ensures they're always passed by the LLM.
          • <parameter_name> (Object) : Required. The name of the particular parameter. Passed as a key to an input parameter in ToolInvoke
            • type (String) : Required. The type of the input parameter
            • description (String) : Required. The description of what the input parameter is. This describes to the LLM how to detect the parameter during conversation.
        • required (List) : Optional. A list of input parameters for the function, from the ones specified before which are necessarily required to call this function.

Example:

  "message": "StartConversation",
  "audio_format": {
    "type": "raw",
    "encoding": "pcm_f32le",
    "sample_rate": 44100
  },
  "conversation_config": {
    "template_id": "default",
    "template_variables": {
      "persona": "You are an aging English rock star named Goder Rogfrey.",
      "style": "Be charming but unpredictable. Take any opportunity you can to talk about the old days of rock'n'roll. If there are multiple speakers, get them to be sassy to each other.",
      "context": "You are taking a customer's order for fast food at the Soft Pebble Cafe. The only options on the menu are burgers and chicken nuggets. Please make them sound appealing!"
    }
  }
  "tools": {
    "type": "function",
    "function": {
      "name": "function_call_name",
      "description": "Function call trigger.",
      "parameters": {
        "type": "object",
        "properties": {
          "param_name": {
            "type": "string",
            "description": "Parameter description."
          },
          "another_param_name": {
            "type": "string",
            "description": "Another parameter description."
          }
       },
        "required": ["param_name"]
      }
    }
  }
}

ConversationStarted

Confirmation that the Flow session is live and ready to accept the input audio stream.

The session id is provided for reference, along with metadata for interpreting the JSON format of the input audio transcripts.

Example:

{
  "message": "ConversationStarted",
  "id": "ae35a954-9841-4c17-8fa7-3ad3a1e426c8",
  "asr_session_id": "807670e9-14af-4fa2-9e8f-5d525c22156e",
  "language_pack_info": {
    "adapted": false,
    "itn": false,
    "language_description": "English",
    "word_delimiter": " ",
    "writing_direction": "left-to-right"
  }
}

AddAudio

AddAudio is a binary message containing a chunk of audio data and no additional metadata.

From client

The message payload is a single channel of raw audio in the format specified by audio_format in StartConversation.

From server

This refers to all binary audio messages on the WebSocket, sent from the Flow engine to the client application.

The outbound message payload is always a single channel of raw audio in PCM S16LE format.

Managing stuttering: To minimize the risk of stuttering during audio playback, the client can add a small buffering delay when the first audio chunk is received if no audio is currently playing, to mask jitter in audio delivery. The delay value could potentially be tuned based on measurements of jitter throughput the session, and so should be reported to the Flow engine where it can be used to adjust processing decisions.

AudioAdded

The server sends AudioAdded to acknowledge ingest of the audio.

Use AudioAdded messages to keep track what messages are processed by the engine, and don't send more than 10s of audio data or 500 individual AddAudio messages ahead of time (whichever is lower).

This is the same as the existing Real-Time ASR WebSocket protocol. More details.

Example:

{
  "message": "AudioAdded",
  "seq_no": 134
}

AudioReceived

The client must send this message as soon as it receives a binary (AddAudio) message. The purpose is to allow the Flow engine to measure the latency of audio playback.

The engine will note the local time when each AddAudio message is sent to the client and when the corresponding AudioReceived message arrives, measuring the round-trip-time (RTT) and the inter-chunk spacing.

The RTT is the lag of audio playback compared to the voice input experienced by the Flow client.

{
  "message": "AudioReceived",
  "seq_no": 245,
  "buffering": 0.020
}

AddTranscript

Transcription generated by the Real-Time ASR Engine generated for only the input audio stream. For the response from the LLM, see ResponseStarted and ResponseCompleted

This is the same as the existing Real-Time ASR WebSocket protocol. More details.

AddPartialTranscript

Fastest, lower-confidence transcription generated by the Real-Time ASR Engine for only the input audio stream.

A partial transcript is an early output transcript that can be changed and expanded by a future AddTranscript or AddPartialTranscript message and corresponds to the part of audio since the last AddTranscript message.

This is the same as the existing Real-Time ASR WebSocket protocol. More details.

ResponseStarted

The text of a response that is about to be uttered by the agent is sent immediately before the TTS audio playback begins.

The start time is noted by the Flow engine based on the current position in the audio input stream, calculated by counting the bytes of raw audio received so far by the engine.

Clients could use this message to show a full caption for the Flow audio stream that's being played back.

Example:

{
  "message": "ResponseStarted",
  "content": "Hi, my name is Roger, I hope you're hungry!",
  "start_time": 6.253
}

ResponseCompleted

The text of a response that has just finished being uttered by the agent is sent immediately after the TTS audio playback has ended.

The end time is noted by the Flow engine based on the current position in the audio input stream, calculated by counting the bytes of raw audio received so far by the engine.

Clients should use this message to incorporate the agent's TTS utterances into the session transcript if needed.

This also marks the point at which the Flow playback caption can be removed from the UI as the utterance has reached its end.

Example:

{
  "message": "ResponseCompleted",
  "content": "Hi, my name is Roger, I hope you're hungry!",
  "start_time": 6.253,
  "end_time": 11.860
}

ResponseInterrupted

Sent by the service when a voice interruption to Flow is detected.

Message contains the text of the latest response that had been uttered by the Flow up to the point it was stopped by the interruption.

Clients should use this message to incorporate the agent's TTS utterances, as heard by the end-user, into the session transcript, if needed.

This also marks the point at which the full Flow playback caption can be removed from the UI & replaced with only the spoken part.

Example:

{
  "message": "ResponseInterrupted",
  "content": "Hi, my name is Roger, I hope you're",
  "start_time": 6.253,
  "end_time": 10.808
}

AudioEnded

This message can be used and should be sent when the customer application has decided to stop the audio input.

The session doesn't end immediately, as transcription results will lag

The client can close the WebSocket immediately if the final transcripts aren't important.

Example:

{
  "message": "AudioEnded"
}

ToolInvoke

The following payload will be sent from the service over the WebSocket when a function call is triggered. It includes the id for the tool (generated by Flow on invocation of the function call) and the attributes.

The id will be a unique identifier for this function call instance and must be used when returning the result (see below).

Example:

{
  "message": "ToolInvoke",
  "id": "call_xxx_yyy",
  "function": {
    "name": "sentence",
    "arguments": {}
  }
}

ToolResult

The client must send this message when replying to a function call, following the format below.

The id must match the unique id for the original function call instance.

The status can be one of ok, rejected or failed. It is then down to the LLM on how it should respond. The content field can contain a message that will assist the LLM with the response.

Example:

{
  "message": "ToolResult",
  "id": "call_xxx_yyy",
  "status": "ok",
  "content": ""
}
info

If the response returned from the function call does not require confirmation by the LLM (such as “turn on the lights”), then the content can include <NO_RESPONSE_REQUIRED/> and this will tell the LLM to note the result but not speak about it.

ConversationEnding

It is possible in Flow for the service to decide to end a conversation. When that decision is taken, a ConversationEnding message is sent immediately to the client to indicate that the engine will not process client input beyond what has already been transcribed and reported so far, except perhaps for a final AddTranscript message.

The session will continue in one-sided mode during TTS playback of the final words, or until the final ASR transcript is emitted, before sending ConversationEnded as the last message.

The client can close the WebSocket immediately if the final words and transcript aren't important.

Example:

{
  "message": "ConversationEnding"
}

ConversationEnded

ConversationEnded is sent by the Flow service to the client after all AddTranscript and TTS playback related messages, following the graceful termination of a conversation either by AudioEnded or ConversationEnding.

It indicates that the server will not send any further message so the WebSocket can be closed

This is where post-session UI changes & resource cleanup should take place in the client

Illustrations - Control Flow

The audio output stream could start immediately after StartConversation to allow the Flow client to measure arrival jitter and decide on the size of its jitter buffer before the session actually has TTS audio to deliver. This also provides the option to play a predefined sound for feedback while waiting for the Flow session to be accepted, or if the session cannot be accepted.

session start request

→ {"message": "StartConversation", "conversation_config": ...}

audio output streaming loop

← AddAudio (binary)

← AddAudio

→ {"message": "AudioReceived", "seq_no": 1}

← AddAudio

→ {"message": "AudioReceived", "seq_no": 2}

← AddAudio


session start confirmation (meaning ASR is now active)

← {"message": "ConversationStarted", "id": "ae35...", "asr_session_id": "8076...", ...}

audio input streaming loop

→ AddAudio (binary)

→ AddAudio

← {"message": "AudioAdded", "seq_no": 1}

→ AddAudio

← {"message": "AudioAdded", "seq_no": 2}

→ AddAudio


transcript streaming of individual words or pairs

← {"message": "AddTranscript", "metadata": {"transcript": "i say", ...}, ...}

← {"message": "AddTranscript", "metadata": {"transcript": " hello", ...}, ...}

content of responses being uttered in TTS playback

← {"message": "ResponseStarted", "content": "Hey there!"}

… (corresponding TTS speech mixed into audio output stream)

← {"message": "ResponseCompleted", "content": "Hey there!"}

← {"message": "ResponseStarted", "content": "Welcome to the event."}


← {"message": "ResponseCompleted", "content": "Welcome to the event."}

← {"message": "ResponseStarted", "content": "Could I please confirm your name"}


← {"message": "ResponseCompleted", "content": "Could I please confirm your name"}

← {"message": "ResponseStarted", "content": " and company, and any dietary restrictions"}


← {"message": "ResponseCompleted", "content": " and company, and any dietary restrictions"}

← {"message": "ResponseStarted", "content": " you have?"}


← {"message": "ResponseCompleted", "content": " you have?"}

interrupt during response playback

← {"message": "ResponseStarted", "content": " and company, and any dietary restrictions"}

… (TTS playback occurring on audio output, user starts speaking on audio input)

… (ASR produces partial transcripts building up enough words to trigger TTS interruption; these will be sent to the Flow client as well if they are requested)

← {"message": "AddPartialTranscript", "metadata": {"transcript": "stop", ...}, ...}


← {"message": "AddPartialTranscript", "metadata": {"transcript": "stop there", ...}, ...}

… (Flow engine determines the interrupt threshold has been reached, and starts tailing off TTS playback, eg. tapering volume while allowing final word or part-word to finish)

… (TTS playback has ended; Flow knows or estimates which words were uttered, and includes this in the session context for the LLM)

← {"message": "ResponseInterrupted", "content": " and company, and any"}

← {"message": "AddTranscript", "metadata": {"transcript": "Stop there ", ...}, ...}

← {"message": "AddTranscript", "metadata": {"transcript": ", I just ", ...}, ...}

← {"message": "AddTranscript", "metadata": {"transcript": " want to", ...}, ...}

← {"message": "AddTranscript", "metadata": {"transcript": " say", ...}, ...}


session closedown by client after user and agent stop talking

… (audio in/out streaming and ASR transcription as normal)

← {"message": "AddTranscript", "metadata": {"transcript": " bye now", ...}, ...}

… (TTS prepares utterance, normal audio input and ASR processing continue)

← {"message": "ResponseStarted", "content": "Okay, have a good day."}

… (TTS playback runs to completion, user happens to not be talking, then client initiates closedown)

→ AddAudio (final audio chunk, containing no speech)

→ {"message": "AudioEnded", "last_seq_no": 928}

← {"message": "AudioAdded", "seq_no": 928}

… (ASR finishes processing to end of audio input)

← {"message": "ConversationEnded"}

session closedown by client while TTS is playing

← {"message": "ResponseStarted", "content": "As I was saying, that's an interesting idea."}

… (TTS playback starts in audio output, client initiates session closedown)

→ AddAudio (final audio chunk)

→ {"message": "AudioEnded", "last_seq_no": 928}

… (Flow engine treats end of audio as an interrupt, and starts tailing off TTS playback, eg. tapering volume while allowing final word or part-word to finish. Session continues, if the client allows it, until all remaining transcripts and response messages are sent.)

← {"message": "AddTranscript", "metadata": {"transcript": "", ...}, ...} (final transcript produced)

… (TTS playback continues if necessary to reach the nearest stop point)

← {"message": "ResponseInterrupted", "content": "As I was saying, that's"}

← {"message": "ConversationEnded"}

session closedown initiated by Flow service (whether user is talking or not)

← {"message": "ConversationEnding"}

… (TTS prepares final utterance, ASR processing halted and all remaining/further audio and transcripts are discarded)

← {"message": "ResponseStarted", "content": "Goodbye."}

… (TTS playback of final response content without interruption, if client allows it)

← {"message": "ResponseCompleted", "content": "Goodbye."}

← {"message": "ConversationEnded"}