Skip to main content

API reference

Deployments:SaaSStatus:Early Access

A Flow conversation has a live audio input stream and produces a live audio output stream, along with the transcript of what is spoken by both parties.

The conversation is driven in real-time by a customer application that handles both audio streams and receives the transcripts over a WebSocket protocol known as the Conversation API.

Various control and data messages can be exchanged over the WebSocket to support capabilities such as conversation recording and moderation.

WebSocket Handshake

Handshake Request

We recommend establishing a WebSocket connection directly between the client device/browser and the Speechmatics server in order to minimise latency for the user.

To do so securely, first generate a Temporary Key which can be used to start any number of Flow Conversations for 60 seconds:

curl -L -X POST "https://mp.speechmatics.com/v1/api_keys?type=flow" \
     -H "Content-Type: application/json" \
     -H "Authorization: Bearer $API_KEY" \
     -d '{"ttl": 60 }'

Next, provide the temporary key as a part of a query parameter when sending the StartConversation message.

 wss://flow.api.speechmatics.com/v1/flow?jwt=<temporary-key>

When implementing your WebSocket client, we recommend using a ping/pong timeout of at least 60 seconds and a ping interval of 20 to 60 seconds. More details about ping/pong messages can be found in the WebSocket RFC.

Overview

A basic Flow session will have the following message exchanges:

'→' Indicates messages sent by the client

'←' Indicates messages sent by the service

Once-only at conversation start:

StartConversation

ConversationStarted

Repeating during the conversation to cover the audio stream from the client and corresponding transcripts:

AddAudio (implicit name, means all inbound binary messages)

AudioAdded (confirming ingestion of each binary audio message into the ASR)

AddTranscript / AddPartialTranscript

The transcription of user's voice

ResponseStarted The response from the LLM, sent immediately before the TTS audio playback begins.

AudioReceived (confirming receipt of each binary audio message)

ResponseCompleted (containing textual content of the utterance just spoken)

Once-only at conversation end:

AudioEnded when client is ending the session; indicates audio input has finished so transcription should be finalized

ConversationEnding if agent is ending the session; indicates audio input transcription has stopped but response playback might not be finished. Audio input should end (with or without AudioEnded), but will be ignored regardless.

ConversationEnded to indicate no further information will be sent and connection will be closed

Info (Details), Warning(Details) and Error(Details) messages will be sent as appropriate.

Message Handling

Each message that the Server accepts is a stringified JSON object with the following fields:

  • message (String): The name of the message we are sending.
  • Any other fields depend on the value of the message and are described below. The messages sent by the Server to a Client are stringified JSON objects as well.

The only exception is a binary message sent from the Client to the Server containing a chunk of audio which will be referred to as AddAudio.

info

The server may also send other undocumented message types, but these are subject to change and should be ignored.

The following values of the message field are supported:

StartConversation

  • template_id (String): Required. One of: default, flow-service-assistant-amelia or flow-service-assistant-humphrey. This configures a number of options, including the LLM used, the agents voice and any Custom Dictionary used for the transcription. Enterprise customers can configure the LLM and TTS providers used through the use of custom templates.
  • template_variables (Object): Optional. Optional section to allow overriding the default values of agent configurations defined in the Template. Parameters:
    • persona (String): Optional
    • style(String): Optional
    • context (String): Optional
  • audio_format (Object): Optional
    • type (String): Required. Must always be raw
    • encoding (String): Required. Default is pcm_s16le
    • sample_rate (Int): Required. Default is 16000

Example:

  "message": "StartConversation",
  "audio_format": {
    "type": "raw",
    "encoding": "pcm_f32le",
    "sample_rate": 44100
  },
  "conversation_config": {
    "template_id": "default",
    "template_variables": {
      "persona": "You are an aging English rock star named Roger Godfrey.",
      "style": "Be charming but unpredictable. Take any opportunity you can to talk about the old days of rock'n'roll. If there are multiple speakers, get them to be sassy to each other.",
      "context": "You are taking a customer's order for fast food at the Hard Rock Cafe. The only options on the menu are burgers and chicken nuggets. Please make them sound appealing!"
    }
  }
}

ConversationStarted

Confirmation that the Flow session is live and ready to accept the input audio stream.

The session id is provided for reference, along with metadata for interpreting the JSON format of the input audio transcripts.

Example:

{
  "message": "ConversationStarted",
  "id": "ae35a954-9841-4c17-8fa7-3ad3a1e426c8",
  "asr_session_id": "807670e9-14af-4fa2-9e8f-5d525c22156e",
  "language_pack_info": {
    "adapted": false,
    "itn": false,
    "language_description": "English",
    "word_delimiter": " ",
    "writing_direction": "left-to-right"
  }
}

AddAudio

AddAudio is a binary message containing a chunk of audio data and no additional metadata.

From client

The message payload is a single channel of raw audio in the format specified by audio_format in StartConversation. Audio must be added in chunks of 256 bytes.

From server

This refers to all binary audio messages on the WebSocket, sent from the Flow engine to the client application.

The outbound message payload is always a single channel of raw audio in PCM S16LE format.

Managing stuttering: To minimize the risk of stuttering during audio playback, the client can add a small buffering delay when the first audio chunk is received if no audio is currently playing, to mask jitter in audio delivery. The delay value could potentially be tuned based on measurements of jitter throughput the session, and so should be reported to the Flow engine where it can be used to adjust processing decisions.

AudioAdded

The server sends AudioAdded to acknowledge ingest of the audio.

Use AudioAdded messages to keep track what messages are processed by the engine, and don't send more than 10s of audio data or 500 individual AddAudio messages ahead of time (whichever is lower).

This is the same as the existing Real-Time ASR WebSocket protocol. More details.

Example:

{
  "message": "AudioAdded",
  "seq_no": 134
}

AudioReceived

The client must send this message as soon as it receives a binary (AddAudio) message. The purpose is to allow the Flow engine to measure the latency of audio playback.

The engine will note the local time when each AddAudio message is sent to the client and when the corresponding AudioReceived message arrives, measuring the round-trip-time (RTT) and the inter-chunk spacing.

The RTT is the lag of audio playback compared to the voice input experienced by the Flow client.

{
  "message": "AudioReceived",
  "seq_no": 245,
  "buffering": 0.020
}

AddTranscript

Transcription generated by the Real-Time ASR Engine generated for only the input audio stream. For the response from the LLM, see ResponseStarted and ResponseCompleted

This is the same as the existing Real-Time ASR WebSocket protocol. More details.

AddPartialTranscript

Fastest, lower-confidence transcription generated by the Real-Time ASR Engine for only the input audio stream.

A partial transcript is an early output transcript that can be changed and expanded by a future AddTranscript or AddPartialTranscript message and corresponds to the part of audio since the last AddTranscript message.

This is the same as the existing Real-Time ASR WebSocket protocol. More details.

prompt (Deprecated)

Note: This message is shortly going to be removed and replaced by ResponseStarted and ResponseCompleted.

This message contains the transcribed user prompt and the LLM generated text response. Once the user has finished speaking, the TTS audio will be streamed back in the AddAudio messages.

Also note that the transcribed user output will also be returned in the AddTranscript and AddPartialTranscript messages. These will be should be used to ensure forwards compatibility and reduce perceived latency.

Example:

{
  "message": "prompt",
  "prompt": {
    "id": "ae35a954-9841-4c17-8fa7-3ad3a1e426c8",    
    "prompt": "hello there",
    "response": "Hello, how can I help?",
  }
}

ResponseStarted (Coming Soon!)

Note: This message is shortly going to be added to replace the prompt message sent by the server.

The text of a response that is about to be uttered by the agent is sent immediately before the TTS audio playback begins.

The start time is noted by the Flow engine based on the current position in the audio input stream, calculated by counting the bytes of raw audio received so far by the engine.

Clients could use this message to show a full caption for the Flow audio stream that's being played back.

Example:

{
  "message": "ResponseStarted",
  "content": "Hi, my name is Roger, I hope you're hungry!",
  "start_time": 6.253
}

ResponseCompleted (Coming Soon!)

Note: This message is shortly going to be added to replace the prompt message sent by the server.

The text of a response that has just finished being uttered by the agent is sent immediately after the TTS audio playback has ended.

The end time is noted by the Flow engine based on the current position in the audio input stream, calculated by counting the bytes of raw audio received so far by the engine.

Clients should use this message to incorporate the agent's TTS utterances into the session transcript if needed.

This also marks the point at which the Flow playback caption can be removed from the UI as the utterance has reached its end.

Example:

{
  "message": "ResponseCompleted",
  "content": "Hi, my name is Roger, I hope you're hungry!",
  "start_time": 6.253,
  "end_time": 11.860
}

ResponseInterrupted (Coming Soon!)

Sent by the service when a voice interruption to Flow is detected.

Message contains the text of the latest response that had been uttered by the Flow up to the point it was stopped by the interruption.

Clients should use this message to incorporate the agent's TTS utterances, as heard by the end-user, into the session transcript, if needed.

This also marks the point at which the full Flow playback caption can be removed from the UI & replaced with only the spoken part.

Example:

{
  "message": "ResponseInterrupted",
  "content": "Hi, my name is Roger, I hope you're",
  "start_time": 6.253,
  "end_time": 10.808
}

AudioEnded

This message can be used and should be sent when the customer application has decided to stop the audio input.

The session doesn't end immediately, as transcription results will lag

The client can close the WebSocket immediately if the final transcripts aren't important.

Example:

{
  "message": "AudioEnded"
}

ConversationEnding

It is possible in Flow for the service to decide to end a conversation. When that decision is taken, a ConversationEnding message is sent immediately to the client to indicate that the engine will not process client input beyond what has already been transcribed and reported so far, except perhaps for a final AddTranscript message.

The session will continue in one-sided mode during TTS playback of the final words, or until the final ASR transcript is emitted, before sending ConversationEnded as the last message.

The client can close the WebSocket immediately if the final words and transcript aren't important.

Example:

{
  "message": "ConversationEnding"
}

ConversationEnded

ConversationEnded is sent by the Flow service to the client after all AddTranscript and TTS playback related messages, following the graceful termination of a conversation either by AudioEnded or ConversationEnding.

It indicates that the server will not send any further message so the WebSocket can be closed

This is where post-session UI changes & resource cleanup should take place in the client

Illustrations - Control Flow

The audio output stream could start immediately after StartConversation to allow the Flow client to measure arrival jitter and decide on the size of its jitter buffer before the session actually has TTS audio to deliver. This also provides the option to play a predefined sound for feedback while waiting for the Flow session to be accepted, or if the session cannot be accepted.

session start request

→ {"message": "StartConversation", "conversation_config": ...}

audio output streaming loop

← AddAudio (binary)

← AddAudio

→ {"message": "AudioReceived", "seq_no": 1}

← AddAudio

→ {"message": "AudioReceived", "seq_no": 2}

← AddAudio


session start confirmation (meaning ASR is now active)

← {"message": "ConversationStarted", "id": "ae35...", "asr_session_id": "8076...", ...}

audio input streaming loop

→ AddAudio (binary)

→ AddAudio

← {"message": "AudioAdded", "seq_no": 1}

→ AddAudio

← {"message": "AudioAdded", "seq_no": 2}

→ AddAudio


transcript streaming of individual words or pairs

← {"message": "AddTranscript", "metadata": {"transcript": "i say", ...}, ...}

← {"message": "AddTranscript", "metadata": {"transcript": " hello", ...}, ...}

content of responses being uttered in TTS playback

← {"message": "ResponseStarted", "content": "Hey there!"}

… (corresponding TTS speech mixed into audio output stream)

← {"message": "ResponseCompleted", "content": "Hey there!"}

← {"message": "ResponseStarted", "content": "Welcome to the event."}


← {"message": "ResponseCompleted", "content": "Welcome to the event."}

← {"message": "ResponseStarted", "content": "Could I please confirm your name"}


← {"message": "ResponseCompleted", "content": "Could I please confirm your name"}

← {"message": "ResponseStarted", "content": " and company, and any dietary restrictions"}


← {"message": "ResponseCompleted", "content": " and company, and any dietary restrictions"}

← {"message": "ResponseStarted", "content": " you have?"}


← {"message": "ResponseCompleted", "content": " you have?"}

interrupt during response playback

← {"message": "ResponseStarted", "content": " and company, and any dietary restrictions"}

… (TTS playback occurring on audio output, user starts speaking on audio input)

… (ASR produces partial transcripts building up enough words to trigger TTS interruption; these will be sent to the Flow client as well if they are requested)

← {"message": "AddPartialTranscript", "metadata": {"transcript": "stop", ...}, ...}


← {"message": "AddPartialTranscript", "metadata": {"transcript": "stop there", ...}, ...}

… (Flow engine determines the interrupt threshold has been reached, and starts tailing off TTS playback, eg. tapering volume while allowing final word or part-word to finish)

… (TTS playback has ended; Flow knows or estimates which words were uttered, and includes this in the session context for the LLM)

← {"message": "ResponseInterrupted", "content": " and company, and any"}

← {"message": "AddTranscript", "metadata": {"transcript": "Stop there ", ...}, ...}

← {"message": "AddTranscript", "metadata": {"transcript": ", I just ", ...}, ...}

← {"message": "AddTranscript", "metadata": {"transcript": " want to", ...}, ...}

← {"message": "AddTranscript", "metadata": {"transcript": " say", ...}, ...}


session closedown by client after user and agent stop talking

… (audio in/out streaming and ASR transcription as normal)

← {"message": "AddTranscript", "metadata": {"transcript": " bye now", ...}, ...}

… (TTS prepares utterance, normal audio input and ASR processing continue)

← {"message": "ResponseStarted", "content": "Okay, have a good day."}

… (TTS playback runs to completion, user happens to not be talking, then client initiates closedown)

→ AddAudio (final audio chunk, containing no speech)

→ {"message": "AudioEnded", "last_seq_no": 928}

← {"message": "AudioAdded", "seq_no": 928}

… (ASR finishes processing to end of audio input)

← {"message": "ConversationEnded"}

session closedown by client while TTS is playing

← {"message": "ResponseStarted", "content": "As I was saying, that's an interesting idea."}

… (TTS playback starts in audio output, client initiates session closedown)

→ AddAudio (final audio chunk)

→ {"message": "AudioEnded", "last_seq_no": 928}

… (Flow engine treats end of audio as an interrupt, and starts tailing off TTS playback, eg. tapering volume while allowing final word or part-word to finish. Session continues, if the client allows it, until all remaining transcripts and response messages are sent.)

← {"message": "AddTranscript", "metadata": {"transcript": "", ...}, ...} (final transcript produced)

… (TTS playback continues if necessary to reach the nearest stop point)

← {"message": "ResponseInterrupted", "content": "As I was saying, that's"}

← {"message": "ConversationEnded"}

session closedown initiated by Flow service (whether user is talking or not)

← {"message": "ConversationEnding"}

… (TTS prepares final utterance, ASR processing halted and all remaining/further audio and transcripts are discarded)

← {"message": "ResponseStarted", "content": "Goodbye."}

… (TTS playback of final response content without interruption, if client allows it)

← {"message": "ResponseCompleted", "content": "Goodbye."}

← {"message": "ConversationEnded"}