Voice Agents — Flow

Overview

Build conversational AI agents with the Flow API

Try Flow free for up to 50 hours per month. Production pricing starts from $0.08/minute, including LLM costs.

Flow is our Voice Agent API that allows you to add responsive, real-time speech-to-speech interactions to any product.

Flow is engineered to engage in natural and fluid conversations by automatically handling interruptions, responding to multiple speakers, and understanding different dialects and accents.

How Flow works

Built on-top of Speechmatics' industry-leading ASR, the latest LLMs and text-to-speech, Flow is engineered to engage in natural and fluid conversations.

Simply stream in audio, and Flow will provide the TTS response as well as other useful information.

Component models

The three base components of the Flow Engine are Speech-to-Text, Large Language Model, and Text-to-Speech.

Speech-to-text (ASR)

Flow is built on the foundations of Speechmatics' market-leading real-time ASR. The client passes streaming audio to the Flow service through the WebSocket. The service then processes multiple speech & non-speech signals such as the spoken words, tonality, & audio events before passing the context to the LLM to formulate a response.

Flow natively supports multiple speaker detection (Speaker Diarization). Flow can be configured to ignore, acknowledge or engage with non-primary speakers when setting up Agents.

This transcribed text is also streamed back to the client as soon as it is generated to support any client-driven recording, monitoring & analytics workflows.

To improve accuracy on product-specific terminology we recommend using a Custom Dictionary when setting up Agents in the Portal.

Large language model (LLM)

Flow’s conversational understanding & knowledge is powered by LLMs. The transcribed text from the ASR is then passed with Flow configurations to the LLM to formulate a natural-sounding response.

The response-generation can be influenced through defining a persona, style, and context when setting up Templates.

Text-to-speech (TTS)

Output generated by the LLM, when ready to be spoken, will be converted to audio through the chosen TTS engine. These engines were selected to provide the most natural-sounding responses while not trading off on latency. This audio is then streamed back to the client, who must then play this back to the user.

Flow engine

Understanding disfluencies & pacing

Everyone has a different style of speaking. Natural speech is coloured with filler sounds and the pace of speech can vary from speaker to speaker. A one-size-fits-all voice agent can add a lot of friction to the experience if it keeps interrupting you. We’ve designed Flow to adapt to your speaking style and not be over-eager to interrupt, helping to make users feel comfortable.

Handling interruptions

Flow has been modelled on real-world human conversations. Whether it is to stop Flow from going off-track or to correct wrong assumptions, you can interrupt it. We’ve built our own interruption engine that intelligently ignores unintentional interruptions and gracefully handles the ones that it needs to. To avoid sounding abrupt and unnatural when interrupted, Flow will finish the current word that’s being spoken and gradually fade out the next one.

End-of-turn detection

Based on your voice & what you’ve been saying, Flow uses a small language model (SLM) architecture to smartly detect when you’re done speaking before it responds for a natural and responsive experience. Flow is built to be human-centric and, while we could achieve much lower latencies, it’s rude to interrupt mid-thought.

Help and support

For any additional issues, please reach out to the Flow Support team at flow-help@speechmatics.com.

How Flow works​

Component models​

Speech-to-text (ASR)​

Large language model (LLM)​

Text-to-speech (TTS)​

Flow engine​

Understanding disfluencies & pacing​

Handling interruptions​

End-of-turn detection​

Help and support​