Skip to main content

Transcription GPU Inference Container



System Requirements

The system must have:

  • Nvidia GPU(s) with at least 16GB of GPU memory
  • Nvidia drivers (see below for supported versions)
  • CUDA compute capability of 7.5 or above, which corresponds to the Turing architecture. Cards with the Volta architecture (7.0) or below are not able to run the models
  • 24 GB RAM
  • The nvidia-container-toolkit installed
  • Docker version > 19.03

The raw image size of the GPU Inference Container is around 15GB.

Nvidia Drivers

The GPU Inference Container is based on CUDA 11.7.1, which requires the following Nvidia drivers:

  • 515 or later

If you are running on a data center GPU (e.g, a T4) you can use drivers:

  • 450.51 or later R450
  • 470.57 or later R470
  • 510.47 or later R510

Driver installation can be validated by running nvidia-smi. This command should return the Nvidia driver version and show additional information about the GPU(s).

Azure Instances

The GPU node can be provisioned in the cloud. Our SaaS deployment uses

but any NC or ND series with sufficient memory should work.

Running the Image

Currently, each GPU Inference Container can only run on a single GPU. If a system has more than one GPU, the device must be specified using CUDA_VISIBLE_DEVICES or selecting the device using the --gpus argument. See Nvidia/CUDA documentation for details.

docker run --rm -it \
  -v $PWD/license.json:/license.json \
  --gpus '"device=0"' \
  -p 8001:8001 \

When the Container starts you should see output similar to this, indicating that the server has started and is ready to serve requests.

I1207 09:34:22.462341 1]
| Model          | Version | Status |
| kaldi_enhanced | 1       | READY  |
| kaldi_standard | 1       | READY  |
| lm_enhanced    | 1       | READY  |
I1207 09:34:22.612211 1] Started GRPCInferenceService at
I1207 09:34:22.624076 1] Started HTTPService at
I1207 09:34:22.665759 1] Started Metrics Service at

Batch and Real-Time Inference

The Inference server can run in two modes: batch, for processing whole files and returning the transcript at the end, and real-time for processing audio streams. The default mode is batch. To configure the GPU server for real-time, set the environment variable SM_BATCH_MODE=false by passing it into the docker run command.

The modes correspond to the two types of client speech Container, which are distinguished by their name:

  • rt-asr-transcriber-en:<version>
  • batch-asr-transcriber-en:<version>

The server can only support one of these modes at once.

Linking to a GPU Inference Container

Once the GPU Server is running, follow the Instructions for Linking a CPU Container.

Running Only One Operating Point

Operating Points represent different levels of model complexity. To save GPU memory for throughput, you can run the server with only one operating point loaded. To do this, pass the SM_OPERATING_POINT environment variable to the container and set it to either standard or enhanced.

Monitoring the Server

The inference server is based on Nvidia's Triton architecture and as such can be monitored using Triton's inbuilt Prometheus metrics, or the GRPC/HTTP APIs. To expose these, configure an external mapping for port 8002(Prometheus) or 8000(HTTP).

Operating Points in GPU Inference

When inference is out-sourced to a GPU server, alternative GPU-specific models are used, so you should not expect to see identical results compared to CPU-based inference. For convenience, the GPU models are also designated as 'standard' and 'enhanced'.

Docker-Compose example

This docker-compose file will create a Speechmatics GPU inference server:

(assumes your license.json file is in the current working directory)

version: '3.8'

    driver: bridge

            - driver: nvidia
              ### Limit to N GPUs
              # count: 1
              ### Pick specific GPUs by device ID
              # device_ids:
              #   - 0
              #   - 3
                - gpu
    container_name: triton
      - transcriber
      - 8000/tcp
      - 8001/tcp
      - 8002/tcp
      - NVIDIA_REQUIRE_CUDA=cuda>=11.6
      - $PWD/license.json:/license.json:ro

Deploying on Kubernetes


You will have to either install Nvidia drivers on the node, or use a base image with the drivers already installed.


These manifests define the Kubernetes Deployment and Service objects:


The suggested way to license the inference server in Kubernetes is via a secret. With your file license.json from Speechmatics, run this command to generate a secret

kubectl create secret \
  generic transcriber-license \

which can then be mapped into the container as shown in the example manifest.

GPU Inference Performance

This is a comparison of the performance and estimated running costs of transcription executing on standard Azure VMs.

Operating PointStandardEnhancedStandardEnhanced
Lowest Processing Cost (US ¢ per hour)
Cost vs CPU (%)--38%59%
Maximum throughput153.223.7117.733.54
Minimum Real-Time Factor (RTF)20.140.330.0430.088
Transcriber count20205013

The benchmark was using the following configuration:

Benchmark details
Price basisAzure PAYG East US, Linux, Standard
1 Throughput is measured as hours of audio per hour of system runtime. A throughput of 50 would mean that in one hour, the system as a whole can transcribe fifty hours of audio.
2 An RTF of 1 would mean that a one hour file would take one hour to transcribe. An RTF of 0.1 would mean that a one hour file would take six minutes to transcribe.