Skip to main content

Flow

Transcription:Real-TimeDeployments:Kubernetes

Prerequisites

Speechmatics Flow Client

You will require the speechmatics-flow client, which is available here.

Hugging Face

For an on-prem LLM deployment, you will need a Hugging Face token with permissions to the meta-llama/Llama-3.2-3B-Instruct model or any other model you wish to use.

The Helm chart requires you to have a valid Hugging Face token stored in a secret called vllm-secret, on the Kubernetes cluster.

You can add a secret to the cluster with these commands:

kubectl create secret generic vllm-secret \
  --from-literal=hf-token-secret="$HUGGING_FACE_TOKEN"

Alternatively, you can configure the chart to create the secret for your Hugging Face token for you using the following values:

flow:
  vllm:
    hfTokenSecret:
      createSecret: true
      token: "$BASE64_ENCODED_HF_TOKEN"

Hardware Requirements

The example in this documentation has been tested on a Kubernetes cluster running the following Azure node sizes:

ServiceNode Size
STT (Inference Server)Standard_NC4as_T4_v3
STT (Transcriber)Standard_E16s_v5
TTSStandard_NC4as_T4_v3
vLLMStandard_NV72ads_A10_v5
All Other ServicesStandard_D*s_v5

Depending on the LLM model you choose, the GPU requirements may vary.

Configuration

The Speechmatics Helm chart can be configured to deploy and manage services to run Speechmatics Flow for on-premise environments.

The example yaml below will configure the Helm chart to deploy Flow fully on-premise using a local LLM and TTS service.

# flow.values.yaml
flow:
  enabled: true

  vllm:
    hfTokenSecret:
      token: HUGGING_FACE_TOKEN

    config:
      # -- Model to use for vLLM
      model: meta-llama/Llama-3.2-3B-Instruct
      numGPUs: 2

      # -- Use less GPU memory to fit the 3B model on Standard_NV72ads_A10_v5
      dtype: float16
      disableLogRequests: true
      enablePrefixCaching: false
      additionalArgs:
        max-num-seqs: "1"
        gpu-memory-utilization: "0.75"
    
    resources:
      limits:
        nvidia.com/gpu: "2"

Installation

helm upgrade --install speechmatics-realtime \
  oci://speechmaticspublic.azurecr.io/sm-charts/sm-realtime \
  --version 0.5.7 \
  --set proxy.ingress.url="speechmatics.example.com" \
  -f flow.values.yaml

Have a Conversation

To find a list of available agents to converse with, run the following command:

kubectl get configmap proxy-agent-config -o yaml | yq '.data | keys'

Example output:

- 32f4725b-fde5-4521-a57c-e899e245d0b0_latest.json

# 32f4725b-fde5-4521-a57c-e899e245d0b0 is your agent ID

You can use this agent ID to begin your conversation using the speechmatics-flow client.

speechmatics-flow --url wss://speechmatics.example.com/v1/flow \
  --ssl-mode insecure \
  --assistant 32f4725b-fde5-4521-a57c-e899e245d0b0

On-Premise LLM with vLLM

The on-prem LLM makes use of vLLM and runs the meta-llama/Llama-3.2-3B-Instruct model from Hugging Face by default.