Skip to main content

Translation GPU Inference Container



System Requirements

Note: System requirements for the translation inference server are the same as for the GPU inference container for transcription, except for RAM and CPU requirements which are lower. The two servers cannot use the same GPU.

The system must have:

  • Nvidia GPU(s) with at least 16GB of GPU memory
  • Nvidia drivers (see below for supported versions)
  • CUDA compute capability of 7.5 or above, which corresponds to the Turing architecture. Cards with the Volta architecture (7.0) or below are not able to run the models
  • 5GB RAM
  • 4 vCPUs
  • The nvidia-container-toolkit installed
  • Docker version > 19.03

The raw Docker image size of the Translation Container is around 10GB.

Nvidia Drivers

The Translation Container is based on CUDA 11.7.1, which requires the following Nvidia drivers:

  • 515 or later

If you are running on a data center GPU (e.g, a T4) you can use drivers:

  • 450.51 or later R450
  • 470.57 or later R470
  • 510.47 or later R510

Driver installation can be validated by running nvidia-smi. This command should return the Nvidia driver version and show additional information about the GPU(s).

Azure Instances

The GPU node can be provisioned in the cloud. Our SaaS deployment uses

but any NC or ND series with sufficient memory should work.

Running the image

Currently, each Translation Container can only run on a single GPU. If a system has more than one GPU, the device must be specified using CUDA_VISIBLE_DEVICES or selecting the device using the --gpus argument. See Nvidia/CUDA documentation for details.

docker run --rm -it \
  --gpus '"device=0"' \
  -p 8001:8001 \ # the grpc endpoint uses port 8001, can be mapped to any host port

On startup you will see logs detailing available GPU memory. As set out in the requirements section, the system must have a minimum of 16GB of GPU memory, though extra GPU memory may be used if available.

Total GPU memory: 40960MiB
Approx. size models: 5GB
Available GPU memory after models loaded: 35GB

Sending Requests

Batch and Real-Time (RT) transcribers handle sending requests to the translation inference server. To run a transcription job with translation, follow the instuctions for running the CPU container and additionally:

  • Set the environment variable SM_TRANSLATION_ENDPOINT in the transcriber to the GRPC endpoint of the running translation inference server, in the form <server_ip_address>:<port> where the port is the one bound to port 8001 of the translation docker container (see running the image)
  • Include a translation_config inside of your job config. More details.
  • Use a transcriber version 10.3.0 or newer.
  • Ensure you use a license which allows translation.

Translation Language Pairs

The translation inference container is not language specific, meaning that all 69 translation language pairs supported can run on a single inference container. The source language is defined by the language of the transcriber sending requests.

By default, a maximum of 5 target languages can be requested at once. This behaviour can be changed by setting the environment variable SM_TRANSLATION_MAX_TARGET_LANGUAGES in the transcriber. Setting this to 0 will disable the limit.

Example of running translation

Assuming the following config file:

  "type": "transcription",
  "transcription_config": {
    "operating_point": "enhanced",
    "language": "en"
  "translation_config": {
    "target_languages": ["es", "de"] # Set languages here to enable translation

You can run batch transcription and translation with:

cat ~/$AUDIO_FILE | docker run -i \
  -v ~/$CONFIG_FILE:/config.json \
  -e LICENSE_TOKEN=eyJhbGciOiJ... \
  -e SM_TRANSLATION_ENDPOINT=<server>:<port> \

Or start a translation enabled real-time container with:

docker run -p 9000:9000 -e LICENSE_TOKEN=eyJhbGciOiJ... \
    -e SM_TRANSLATION_ENDPOINT=<server>:<port> \
    -e SM_TRANSLATION_MAX_TARGET_LANGUAGES=10 \ # raise the allowed number of target languages

Monitoring the server

The inference server is based on Nvidia's Triton architecture and as such can be monitored using Triton's inbuilt Prometheus metrics, or the GRPC/HTTP APIs. To expose these, configure an external mapping for port 8002(Prometheus) or 8000(HTTP).

Docker-Compose example

This docker-compose file will create a Speechmatics GPU translation server:

version: '3.8'

    driver: bridge

            - driver: nvidia
              ### Limit to N GPUs
              # count: 1
              ### Pick specific GPUs by device ID
              # device_ids:
              #   - 0
              #   - 3
                - gpu
    container_name: triton
      - transcriber
      - 8000/tcp
      - 8001/tcp
      - 8002/tcp
      - NVIDIA_REQUIRE_CUDA=cuda>=11.6

Deploying on Kubernetes

See the instructions for deploying the transcription GPU inference server.


Translation running on a 4-core T4 has an RTF of roughly 0.008. It can handle up to 125 hours of batch audio per hour, or 125 real-time transcription streams. However, each translation target language is counted as a stream, meaning that a single real-time transcription stream which requests 5 target languages adds the same load on the translation inference server as 5 transcription streams each requesting a single target language.