Skip to main content

Introduction

Deployments:SaaSStatus:Early Access

info

Early-bird pricing is $0.15/minute with limited free usage available for evaluation.

Flow is our Conversational AI API that allows you to add responsive, real-time speech-to-speech interactions to any product.

Flow is engineered to engage in natural and fluid conversations by automatically handling interruptions, responding to multiple speakers, and understanding different dialects and accents.

How Flow Works

Built on-top of Speechmatics' industry-leading ASR, the latest LLMs and text-to-speech, Flow is engineered to engage in natural and fluid conversations.

You can configure your Flow Conversation using Templates and Template Variables. Simply stream in audio, and Flow will provide the TTS response as well as other useful information as detailed below.

Component Models

The three base components of the Flow Engine are Speech-to-Text, Large Language Model, and Text-to-Speech.

Speech-to-Text (ASR)

Flow is built on the foundations of Speechmatics' market-leading real-time ASR. The client passes streaming audio to the Flow service through the WebSocket. The service then processes multiple speech & non-speech signals such as the spoken words, tonality, & audio events before passing the context to the LLM to formulate a response.

Flow natively supports multiple speaker detection (Speaker Diarization). Flow can be configured to ignore, acknowledge or engage with non-primary speakers when setting up Templates.

This transcribed text is also streamed back to the client as soon as it is generated to support any client-driven recording, monitoring & analytics workflows.

To improve accuracy on product-specific terminology we recommend using a Custom Dictionary when setting up Templates.

Large Language Model (LLM)

Flow’s conversational understanding & knowledge is powered by LLMs. The transcribed text from the ASR is then passed with Flow configurations to the LLM to formulate a natural-sounding response.

The response-generation can be influenced through defining a persona, style, and context when setting up Templates.

Text-To-Speech (TTS)

Output generated by the LLM, when ready to be spoken, will be converted to audio through the chosen TTS engine. These engines were selected to provide the most natural-sounding responses while not trading off on latency. This audio is then streamed back to the client, who must then play this back to the user.

Flow Engine

Understanding disfluencies & pacing

Everyone has a different style of speaking. Natural speech is coloured with filler sounds and the pace of speech can vary from speaker to speaker. A one-size-fits-all voice agent can add a lot of friction to the experience if it keeps interrupting you. We’ve designed Flow to adapt to your speaking style and not be over-eager to interrupt, helping to make users feel comfortable.

Handling interruptions

Flow has been modelled on real-world human conversations. Whether it is to stop Flow from going off-track or to correct wrong assumptions, you can interrupt it. We’ve built our own interruption engine that intelligently ignores unintentional interruptions and gracefully handles the ones that it needs to. To avoid sounding abrupt and unnatural when interrupted, Flow will finish the current word that’s being spoken and gradually fade out the next one.

End-of-utterance detection

Based on your voice & what you’ve been saying, Flow can smartly detect when you’re done speaking before it responds for a natural and responsive experience. Flow is built to be human-centric and, while we could achieve much lower latencies, it’s rude to interrupt mid-thought.

Working with Flow

Setting Up Flow

A conversation template covers multiple elements that typically need to be configured in concert to power a specific class of conversations in a human-facing application.

Flow can be configued using the following parameters:

template_id - Required in the the StartConversation message in the Conversation API to configure the agent's LLM and voice. Enterprise customers can configure the LLM and TTS providers used through the use of custom templates.

template_variables

  • persona - The agent personality, e.g., “You are a middle-aged British man named Humphrey.”
  • style - The agent interaction style, e.g., “Be chatty and answer questions in a friendly and informal manner, as in a human conversation.”
  • context - The context for conversations that the agent will have, e.g., “You are welcoming people to a new product launch event, and specifically confirming their name, company affiliation and dietary requirements, while answering questions related to the venue and the food and drinks being offered.”

For more details, refer to StartConversation API reference.

Limitations

Flow is currently in Early Access, meaning that API behaviour and limits may change at any time. It is only production ready for selected development partners.

Usage is limited to:

  • 1 concurrent stream
  • 20 mins max session duration

Function calling [Coming soon!]

There are various systems that Flow would need to work with to be useful in real world. This could involve needing real-time information such as opening/closing times or validation services for authentication or action APIs that control a fast food system while placing a drive-thru order. The Client can configure detection of these triggers through function calling.

The client must instruct Flow in the StartConversation message about these triggers along with the parameters needed from the conversation to fulfil the function call.

Flow has the ability to play a holding message automatically when the function call is first triggered. The client must then ascertain and inform the Flow service if subsequent filler messages need to played due to execution delays in the function call to maintain a good customer experience.

The client must inform the service of either the function call failing with an error message or succeeding with the result. There is no automatic timeout on the Flow service.

Function calling is fully asynchronous. Once the client is informed of the function call, the conversation will continue to progress until a function call status update is received from the client. This is to continue providing a natural conversational experience to the customer.

Moderating & controlling conversations

You might want to control ongoing conversation based on what's spoken by the user or the output by the LLM. This could involve situations where the agent is asked to do things out of scope or the conversation is heading in unintentional directions. We enable this through sharing the real-time transcript from speech (AddPartialTranscript/ AddTranscript) and the entire response from the LLM just before it begins to speak (ResponseStarted). We recommend building monitoring on top of these streams and to use either AudioEnded to end the session, or close the WebSocket directly if the final transcript is unimportant.

Managing call recordings & transcripts

Clients are responsible for maintaining their own recordings & conversation logs. This is enabled through the audio already being routed entirely through the client, and conversation transcripts being provided in real-time through AddPartialTranscript/AddTranscript/ ResponseStarted/ ResponseCompleted/ ResponseInterrupted.