Introduction
Deployments:SaaSStatus:Early AccessFlow is currently free for evaluation with some limitations. Early-bird pricing totals $0.15/minute, including LLM costs.
Flow is our Conversational AI API that allows you to add responsive, real-time speech-to-speech interactions to any product.
Flow is engineered to engage in natural and fluid conversations by automatically handling interruptions, responding to multiple speakers, and understanding different dialects and accents.
This page will show you how to use the Flow Conversational API for the most natural and intuitive conversational AI using an easy to use interactive code editor.
How Flow Works
Built on-top of Speechmatics' industry-leading ASR, the latest LLMs and text-to-speech, Flow is engineered to engage in natural and fluid conversations.
Simply stream in audio, and Flow will provide the TTS response as well as other useful information.
Component Models
The three base components of the Flow Engine are Speech-to-Text, Large Language Model, and Text-to-Speech.
Speech-to-Text (ASR)
Flow is built on the foundations of Speechmatics' market-leading real-time ASR. The client passes streaming audio to the Flow service through the WebSocket. The service then processes multiple speech & non-speech signals such as the spoken words, tonality, & audio events before passing the context to the LLM to formulate a response.
Flow natively supports multiple speaker detection (Speaker Diarization). Flow can be configured to ignore, acknowledge or engage with non-primary speakers when setting up Templates.
This transcribed text is also streamed back to the client as soon as it is generated to support any client-driven recording, monitoring & analytics workflows.
To improve accuracy on product-specific terminology we recommend using a Custom Dictionary when setting up Templates.
Large Language Model (LLM)
Flow’s conversational understanding & knowledge is powered by LLMs. The transcribed text from the ASR is then passed with Flow configurations to the LLM to formulate a natural-sounding response.
The response-generation can be influenced through defining a persona, style, and context when setting up Templates.
Text-To-Speech (TTS)
Output generated by the LLM, when ready to be spoken, will be converted to audio through the chosen TTS engine. These engines were selected to provide the most natural-sounding responses while not trading off on latency. This audio is then streamed back to the client, who must then play this back to the user.
Flow Engine
Understanding disfluencies & pacing
Everyone has a different style of speaking. Natural speech is coloured with filler sounds and the pace of speech can vary from speaker to speaker. A one-size-fits-all voice agent can add a lot of friction to the experience if it keeps interrupting you. We’ve designed Flow to adapt to your speaking style and not be over-eager to interrupt, helping to make users feel comfortable.
Handling interruptions
Flow has been modelled on real-world human conversations. Whether it is to stop Flow from going off-track or to correct wrong assumptions, you can interrupt it. We’ve built our own interruption engine that intelligently ignores unintentional interruptions and gracefully handles the ones that it needs to. To avoid sounding abrupt and unnatural when interrupted, Flow will finish the current word that’s being spoken and gradually fade out the next one.
End-of-utterance detection
Based on your voice & what you’ve been saying, Flow can smartly detect when you’re done speaking before it responds for a natural and responsive experience. Flow is built to be human-centric and, while we could achieve much lower latencies, it’s rude to interrupt mid-thought.
Help and support
For any additional issues, please reach out to the Flow Support team at flow-help@speechmatics.com.