Voice agents – FlowGuides

Build a conversational AI web app with Next.js and Flow

Learn how to build a conversational AI web app with Next.js and Flow

In this guide, we will walk you through the process of building a conversational AI web application using Next.js and Flow. You will learn how to set up your development environment, create a Next.js project, integrate Flow and implement a simple conversational AI feature.

You can find the complete code on GitHub .

Prerequisites

Before getting started, ensure you have:

Node.js 20 or later
A Speechmatics account and API key

Step 1: Setup project, dependencies and API key

We will be using NextJS 15 with App Router and Typescript. We will also use TailwindCSS for styling, but feel free to use any styling solution you prefer.

Create a new Next.js project

npx create-next-app@latest nextjs-flow-guide --typescript --eslint --app

Install Speechmatics packages

# Official Flow API client for React
npm install @speechmatics/flow-client-react

# Used for requesting JWTs for API authentication
npm install @speechmatics/auth

# These let us capture and play raw audio in the browser easily
npm install @speechmatics/browser-audio-input
npm install @speechmatics/browser-audio-input-react
npm install @speechmatics/web-pcm-player-react

# Utility package for rendering the transcript of the conversation
npm install @speechmatics/use-flow-transcript

Install TailwindCSS

These steps are from theTailwind docs here.

Install TailwindCSS

npm install @tailwindcss/postcss

Create a postcss.config.mjs file in the root of the project with the following content:

postcss.config.mjs
const config = {
  plugins: {
    "@tailwindcss/postcss": {},
  },
};
export default config;

Finally, remove all styles from the globals.css file, and replace it with the following content:

app/globals.css
@import "tailwindcss";

Add your API key to `.env`

Create a .env file in the root of the project, and add your API key:

.env
API_KEY="your-api-key"

Step 2: Configuration and Context providers

Configure Webpack to serve AudioWorklet script

To interface with the Flow API, we need to record and send raw audio. The @speechmatics/browser-audio-input package is designed to do this. It achieves this by providing a script which can be loaded by an AudioWorklet, but how this script is consumed depends on the bundler being used.

In order to use this package with NextJS, we need to configure Webpack to serve the provided script from a URL, rather than bundling it with the rest. We can leverage Asset Modules to achieve this.

next.config.ts
import type { NextConfig } from "next";

const nextConfig: NextConfig = {
  webpack: (config) => {
    // Load this JS file as a URL, which can be passed to the AudioWorklet through the PCMAudioRecorderProvider
    // See Webpack documentation for more details: https://webpack.js.org/guides/asset-modules/#resource-assets
    config.module.rules.push({
      test: /pcm-audio-worklet\.min\.js$/,
      type: "asset/resource",
      generator: {
        // This ensures the generated URL is the same whether in a Client or Server component
        filename: "static/media/[name][ext]",
      },
    });

    return config;
  },
};

export default nextConfig;

Audio and Context providers

Context providers

We will be using 3 context providers in the app:

FlowProvider - Provides the Flow client to the app.
PCMAudioRecorderProvider - Given an AudioContext, provides the browser audio input to the app.
PCMAudioPlayerProvider - Given an AudioContext, provides the web PCM player to the app.

We'll start by creating a providers.tsx file in the app directory, and adding the following content:

Here we add the 3 context providers to the app, passing the AudioContext instances to both the audio providers, and the workletScriptURL to PCMAudioRecorderProvider.

app/providers.tsx
"use client";

// @ts-ignore: We are importing this as a URL, not as a module
import workletScriptURL from "@speechmatics/browser-audio-input/pcm-audio-worklet.min.js";
import { PCMAudioRecorderProvider } from "@speechmatics/browser-audio-input-react";
import { FlowProvider } from "@speechmatics/flow-client-react";
import { PCMPlayerProvider } from "@speechmatics/web-pcm-player-react";
import { useAudioContexts } from "@/hooks/useAudioContexts";

// This component will contain the context providers for the app
export function Providers({ children }: { children?: React.ReactNode }) {
  const { inputAudioContext, playbackAudioContext } = useAudioContexts();

  return (
    <FlowProvider
      // `appId is optional, it can be any string uniquely identifying your app
      appId="nextjs-example"
      audioBufferingMs={500}
      websocketBinaryType="arraybuffer" // This is optional, but does lead to better audio performance, particularly on Firefox
    >
      <PCMAudioRecorderProvider
        audioContext={inputAudioContext}
        workletScriptURL={workletScriptURL}
      >
        <PCMPlayerProvider audioContext={playbackAudioContext}>
          {children}
        </PCMPlayerProvider>
      </PCMAudioRecorderProvider>
    </FlowProvider>
  );
}

A note about `AudioContext` and sample rates in Firefox

AudioContext is the WebAPI for handling audio recording and playback. Under normal circumstances, you should aim to have one reusable instance of an AudioContext in your app. In most browsers, an AudioContext can freely record and play back audio at different sample rates, but this is not the case in Firefox (see outstanding bug here).

To handle this, we can create a utility hook to expose separate AudioContext instances for recording and playback in Firefox, while sharing a single instance for other browsers (see below).

hooks/useAudioContexts.ts
import { useMemo, useSyncExternalStore, useEffect } from "react";

// Feel free to change the recording sample rate
const recordingSampleRate = 16_000;

// Playback sample rate should always be 16_000
const playbackSampleRate = 16_000;

/**
 * This hook returns audio contexts for recording and playback.
 * In practice they will be the same AudioContext, except in Firefox where sample rates may differ
 * See bug tracked here: https://bugzilla.mozilla.org/show_bug.cgi?id=1725336https://bugzilla.mozilla.org/show_bug.cgi?id=1725336
 * @todo: If/when the bug is fixed, we can use the same audio context for both recording and playback
*/
export function useAudioContexts() {
  const hydrated = useHydrated();
  const inputAudioContext = useMemo(
    () =>
      hydrated
        ? new window.AudioContext({ sampleRate: recordingSampleRate })
        : undefined,
    [hydrated],
  );

  const playbackAudioContext = useMemo(() => {
    const isFirefox = typeof navigator !== "undefined" && navigator.userAgent.toLowerCase().indexOf('firefox') > -1;
    return isFirefox
      ? new window.AudioContext({ sampleRate: playbackSampleRate })
      : inputAudioContext;
  }, [inputAudioContext]);

  useCleanupAudioContext(inputAudioContext);
  useCleanupAudioContext(playbackAudioContext);

  return { inputAudioContext, playbackAudioContext };
}

// Lets us know if we're rendering client side or not
const useHydrated = () =>
  useSyncExternalStore(
    () => () => {},
    () => true,
    () => false,
  );

// Close audio context when component unmounts
function useCleanupAudioContext(context?: AudioContext) {
  useEffect(() => {
    return () => {
      if (context && context.state === 'running') {
        context.close();
      }
    };
  }, [context]);
}

Now place the following code in the app/page.tsx file:

app/page.tsx
import { Providers } from "./providers";

export default async function Home() {
  return (
    <main className="h-screen container mx-auto py-6">
      <h1 className="text-2xl font-bold">Speechmatics NextJS Flow Example</h1>
      <Providers>
        {/* Our app components here will have access to provided functionality */}
      </Providers>
    </main>
  );
}

If you get an error about AudioWorkletProcessor not being defined, make sure you configured Webpack to serve the script URL.

Step 3: Implementing the UI

Wireframe and styles

The UI will follow this wireframe:

Controls

Where we can select the input device and persona, and start/stop the session.

Status

Displays the current status of the connection.

TranscriptView

Displays the transcript of the conversation.

We'll start by adding some basic styles to the app/globals.css file:

app/globals.css
@import "tailwindcss";

body {
  @apply bg-gray-50;
}

main {
  @apply h-screen container mx-auto py-6 flex flex-col gap-4;
}

section {
  @apply flex flex-col gap-3 border-1 rounded-md p-6 border-gray-200 shadow-xs w-full h-full min-h-0 bg-white;
}

button {
  @apply bg-gray-200 text-gray-800 px-4 py-2 rounded-md;
}

button[type="submit"] {
  @apply bg-blue-500 text-white px-4 py-2 rounded-md;
}

select {
  @apply bg-white  text-gray-800 px-4 py-2 rounded-md p-2.5 pr-4.5 border-1 border-gray-300;
}

select:disabled {
  @apply bg-gray-100 text-gray-400;
}

h1 {
  @apply text-2xl font-bold;
}

h3 {
  @apply text-lg font-bold;
}

h6 {
  @apply text-sm font-bold;
}

p {
  @apply text-sm;
}

dl {
  @apply grid grid-cols-2 gap-4;
}

`Controls` component

This component will contain:

A dropdown to select the input device
A dropdown to select the persona
A button to start/stop the session
A button to mute the microphone when the session is active

To connect to the API, we will also need to setup a Server Action to request a JWT from the backend. We can then call this server action in our component.

Here we define the controls component:

form contains the dropdowns and buttons
When the form is submitted, we call the getJWT server action, then pass the JWT to the startConversation function, along with the config from the FormData.

components/Controls.tsx
"use client";
import {
  usePCMAudioListener,
  usePCMAudioRecorderContext,
} from "@speechmatics/browser-audio-input-react";
import {
  type AgentAudioEvent,
  useFlow,
  useFlowEventListener,
} from "@speechmatics/flow-client-react";
import { usePCMAudioPlayerContext } from "@speechmatics/web-pcm-player-react";
import { type FormEventHandler, useCallback } from "react";
import { getJWT } from "@/app/actions";
import { MicrophoneSelect } from "@/components/MicrophoneSelect";

export function Controls({
  personas,
}: {
  personas: Record<string, { name: string }>;
}) {
  const {
    startConversation,
    endConversation,
    sendAudio,
    socketState,
    sessionId,
  } = useFlow();

  const { startRecording, stopRecording, audioContext } =
    usePCMAudioRecorderContext();

  const startSession = useCallback(
    async ({
      personaId,
      recordingSampleRate,
    }: {
      personaId: string;
      recordingSampleRate: number;
    }) => {
      const jwt = await getJWT("flow");

      await startConversation(jwt, {
        config: {
          template_id: personaId,
          template_variables: {
            // We can set up any template variables here
          },
        },
        audioFormat: {
          type: "raw",
          encoding: "pcm_f32le",
          sample_rate: recordingSampleRate,
        },
      });
    },
    [startConversation],
  );

  const handleSubmit = useCallback<FormEventHandler>(
    async (e) => {
      e.preventDefault();

      if (!audioContext) {
        throw new Error("Audio context not initialized!");
      }

      if (socketState === "open" && sessionId) {
        stopRecording();
        endConversation();
        return;
      }

      const formData = new FormData(e.target as HTMLFormElement);

      const personaId = formData.get("personaId")?.toString();
      if (!personaId) throw new Error("No persona selected!");

      const deviceId = formData.get("deviceId")?.toString();
      if (!deviceId) throw new Error("No device selected!");

      await startSession({
        personaId,
        recordingSampleRate: audioContext.sampleRate,
      });
      await startRecording({ deviceId });
    },
    [
      startSession,
      startRecording,
      stopRecording,
      endConversation,
      socketState,
      sessionId,
      audioContext,
    ],
  );

  const { playAudio } = usePCMAudioPlayerContext();

  usePCMAudioListener(sendAudio);
  useFlowEventListener(
    "agentAudio",
    useCallback(
      ({ data }: AgentAudioEvent) => {
        if (socketState === "open" && sessionId) {
          playAudio(data);
        }
      },
      [socketState, sessionId, playAudio],
    ),
  );

  // Disable selects when session is active.
  const disableSelects = !!sessionId;

  return (
    <section>
      <h3>Controls</h3>
      <form onSubmit={handleSubmit} className="flex flex-col gap-2">
        <div className="grid grid-cols-2 gap-4">
          <MicrophoneSelect disabled={disableSelects} />
          <select name="personaId" disabled={disableSelects}>
            {Object.entries(personas).map(([id, persona]) => (
              <option key={id} value={id} label={persona.name} />
            ))}
          </select>
        </div>
        <div className="flex gap-2">
          <ActionButton />
          <MuteMicrophoneButton />
        </div>
      </form>
    </section>
  );
}

function ActionButton() {
  const { socketState, sessionId } = useFlow();

  if (
    socketState === "connecting" ||
    socketState === "closing" ||
    (socketState === "open" && !sessionId)
  ) {
    return (
      <button disabled aria-busy>
        <svg className="mr-3 size-5 animate-spin" viewBox="0 0 24 24">
          <path d="M12,4a8,8,0,0,1,7.89,6.7A1.53,1.53,0,0,0,21.38,12h0a1.5,1.5,0,0,0,1.48-1.75,11,11,0,0,0-21.72,0A1.5,1.5,0,0,0,2.62,12h0a1.53,1.53,0,0,0,1.49-1.3A8,8,0,0,1,12,4Z" />
        </svg>
      </button>
    );
  }

  const running = socketState === "open" && sessionId;
  return (
    <button type="submit" className={running ? "bg-red-500" : undefined}>
      {running ? "Stop" : "Start"}
    </button>
  );
}

function MuteMicrophoneButton() {
  const { isRecording, mute, unmute, isMuted } = usePCMAudioRecorderContext();
  if (!isRecording) return null;

  return (
    <button type="button" onClick={isMuted ? unmute : mute}>
      {isMuted ? "Unmute microphone" : "Mute microphone"}
    </button>
  );
}

We also create a utility component to render the microphone select dropdown. It also handles prompting the user for permission to use the microphone.

components/MicrophoneSelect.tsx
"use client";
import { useAudioDevices } from "@speechmatics/browser-audio-input-react";
import { useFlow } from "@speechmatics/flow-client-react";

export function MicrophoneSelect({ disabled }: { disabled?: boolean }) {
  const devices = useAudioDevices();

  switch (devices.permissionState) {
    case "prompt":
      return (
        <select
          onClick={devices.promptPermissions}
          onKeyDown={devices.promptPermissions}
        />
      );
    case "prompting":
      return <select aria-busy="true" />;
    case "granted": {
      return (
        <select name="deviceId" disabled={disabled}>
          {devices.deviceList.map((d) => (
            <option key={d.deviceId} value={d.deviceId}>
              {d.label}
            </option>
          ))}
        </select>
      );
    }
    case "denied":
      return <select disabled />;
    default:
      devices satisfies never;
      return null;
  }
}

Finally we define the server action to request a JWT from the backend.

app/actions.ts
"use server";

import { createSpeechmaticsJWT } from "@speechmatics/auth";

export async function getJWT(type: "flow" | "rt") {
  const apiKey = process.env.API_KEY;
  if (!apiKey) {
    throw new Error("Please set the API_KEY environment variable");
  }

  return createSpeechmaticsJWT({ type, apiKey, ttl: 60 });
}

`Status` component

This component will display:

The status of the Websocket connection
The Session ID of the current conversation
Whether the microphone is recording

components/Status.tsx
"use client";

import { usePCMAudioRecorderContext } from "@speechmatics/browser-audio-input-react";
import { useFlow } from "@speechmatics/flow-client-react";

export function Status() {
  const { socketState, sessionId } = useFlow();
  const { isRecording } = usePCMAudioRecorderContext();

  return (
    <section>
      <h3>Status</h3>
      <dl>
        <dt>🔌 Socket is</dt>
        <dd>{socketState ?? "(uninitialized)"}</dd>
        <dt>💬 Session ID</dt>
        <dd>{sessionId ?? "(none)"}</dd>
        <dt>🎤 Microphone is</dt>
        <dd>{isRecording ? "recording" : "not recording"}</dd>
      </dl>
    </section>
  );
}

`TranscriptView` component

This component will use the useFlowTranscript hook to display the transcript of the conversation.

The useFlowTranscript hook is provided for convenience. If you want more fine-grained control over the transcript you should use the useFlowEventListener hook to listen for incoming events, and handle them as you see fit.

components/TranscriptView.tsx
"use client";
import { useFlowEventListener } from "@speechmatics/flow-client-react";
import {
  transcriptGroupKey,
  useFlowTranscript,
  wordsToText,
} from "@speechmatics/use-flow-transcript";
import { useEffect, useRef } from "react";

export function TranscriptView() {
  const transcriptGroups = useFlowTranscript();

  useFlowEventListener("message", ({ data }) => {
    if (data.message === "Error") {
      throw new Error("Error message from server", { cause: data.error });
    }
  });

  useFlowEventListener("socketError", (e) => {
    throw new Error("Socket error", { cause: e });
  });

  const scrollRef = useRef<HTMLDivElement>(null);

  // Auto-scroll to bottom when new content arrives
  useEffect(() => {
    if (scrollRef.current) {
      const element = scrollRef.current;
      element.scrollTop = element.scrollHeight;
    }
  });

  return (
    <section className="h-full min-h-0">
      <h3>Transcript</h3>
      <div
        ref={scrollRef}
        className="h-full overflow-y-auto min-h-0 flex flex-col gap-2"
        style={{
          scrollBehavior: "smooth",
        }}
      >
        {transcriptGroups.map((group) => (
          <div
            className={`flex flex-row ${group.type === "agent" ? "justify-start" : "justify-end"}`}
            key={transcriptGroupKey(group)}
          >
            <div
              className={`h-full flex flex-col gap-1 p-2 w-3/4 rounded-md ${group.type === "agent" ? "bg-amber-50" : "bg-blue-50"}`}
            >
              <h6>
                {group.type === "agent"
                  ? "Agent"
                  : group.speaker.replace("S", "Speaker ")}
              </h6>
              <p className="flex-1">
                {group.type === "agent"
                  ? group.data.map((response) => response.text).join(" ")
                  : wordsToText(group.data)}
              </p>
            </div>
          </div>
        ))}
      </div>
    </section>
  );
}

Putting it all together

Now we can update the app/page.tsx file to use the new components:

Since the component in page.tsx is a React Server Component, we can use it to fetch the list of personas from the backend, and pass it to the Controls component.

app/page.tsx
import { fetchPersonas } from "@speechmatics/flow-client-react";
import { Controls } from "@/components/Controls";
import { Status } from "@/components/Status";
import { TranscriptView } from "@/components/TranscriptView";
import { Providers } from "./providers";

export default async function Home() {
  const personas = await fetchPersonas();

  return (
    <main>
      <h1>Speechmatics NextJS Flow Example</h1>
      <Providers>
        <div className="flex flex-col gap-4 h-full min-h-0">
          <div className="flex flex-row gap-2">
            <Controls personas={personas} />
            <Status />
          </div>
          <TranscriptView />
        </div>
      </Providers>
    </main>
  );
}

Running the app

To run the app, use the following command:

npm run dev

You should now be able to access the app at http://localhost:3000.

Prerequisites​

Step 1: Setup project, dependencies and API key​

Create a new Next.js project​

Install Speechmatics packages​

Install TailwindCSS​

Add your API key to .env​

Step 2: Configuration and Context providers​

Configure Webpack to serve AudioWorklet script​

Audio and Context providers​

Context providers​

A note about AudioContext and sample rates in Firefox​

Step 3: Implementing the UI​

Wireframe and styles​

Controls component​

Status component​

TranscriptView component​

Putting it all together​

Running the app​