Skip to main content

Translation GPU Inference Container

Transcription:BatchReal-TimeDeployments:Container

Prerequisites

System Requirements

Note: System requirements for the translation inference server are the same as for the GPU inference container for transcription, except for RAM and CPU requirements which are lower. The two servers cannot use the same GPU.

The system must have:

  • Nvidia GPU(s) with at least 16GB of GPU memory
  • Nvidia drivers (see below for supported versions)
  • CUDA compute capability of 7.5-9.0 inclusive, which corresponds to the Turing, Ampere, Lovelace, Hopper architecture. Cards with the Volta architecture or below are not able to run the models
  • 5GB RAM
  • 4 vCPUs
  • The nvidia-container-toolkit installed
  • Docker version > 19.03

The raw Docker image size of the Translation Container is around 10GB.

Nvidia Drivers

The Translation Container is based on CUDA 11.8, which requires the following Nvidia drivers:

  • 520 or later

If you are running on a data center GPU (e.g, a T4) you can use drivers:

  • 450.51 or later R450
  • 470.57 or later R470
  • 510.47 or later R510
  • 515.65 or later R515

Driver installation can be validated by running nvidia-smi. This command should return the Nvidia driver version and show additional information about the GPU(s).

Azure Instances

The GPU node can be provisioned in the cloud. Our SaaS deployment uses

but any NC or ND series with sufficient memory should work.

Running the image

Currently, each Translation Container can only run on a single GPU. If a system has more than one GPU, the device must be specified using CUDA_VISIBLE_DEVICES or selecting the device using the --gpus argument. See Nvidia/CUDA documentation for details.

docker run --rm -it \
  --gpus '"device=0"' \
  -e CUDA_VISIBLE_DEVICES \
  -p 8001:8001 \ # the grpc endpoint uses port 8001, can be mapped to any host port
  speechmatics-docker-public.jfrog.io/sm-translation-inference-server:10.5.1

On startup you will see logs detailing available GPU memory. As set out in the requirements section, the system must have a minimum of 16GB of GPU memory, though extra GPU memory may be used if available.

Total GPU memory: 40960MiB
Approx. size models: 5GB
Available GPU memory after models loaded: 35GB

Sending Requests

Batch and Real-Time (RT) transcribers handle sending requests to the translation inference server. To run a transcription job with translation, follow the instuctions for running the CPU container and additionally:

  • Set the environment variable SM_TRANSLATION_ENDPOINT in the transcriber to the GRPC endpoint of the running translation inference server, in the form <server_ip_address>:<port> where the port is the one bound to port 8001 of the translation docker container (see running the image)
  • Include a translation_config inside of your job config. More details.
  • Use a transcriber version 10.3.0 or newer.
  • Ensure you use a license which allows translation.

Translation Language Pairs

The translation inference container is not language specific, meaning that all 69 translation language pairs supported can run on a single inference container. The source language is defined by the language of the transcriber sending requests.

By default, a maximum of 5 target languages can be requested at once. This behaviour can be changed by setting the environment variable SM_TRANSLATION_MAX_TARGET_LANGUAGES in the transcriber. Setting this to 0 will disable the limit.

Example of running translation

Assuming the following config file:

{
  "type": "transcription",
  "transcription_config": {
    "operating_point": "enhanced",
    "language": "en"
  },
  "translation_config": {
    "target_languages": ["es", "de"] # Set languages here to enable translation
  }
}

You can run batch transcription and translation with:

cat ~/$AUDIO_FILE | docker run -i \
  -v ~/$CONFIG_FILE:/config.json \
  -e LICENSE_TOKEN=eyJhbGciOiJ... \
  -e SM_TRANSLATION_ENDPOINT=<server>:<port> \
  batch-asr-transcriber-en:10.5.1

Or start a translation enabled real-time container with:

docker run -p 9000:9000 -e LICENSE_TOKEN=eyJhbGciOiJ... \
    -e SM_TRANSLATION_ENDPOINT=<server>:<port> \
    -e SM_TRANSLATION_MAX_TARGET_LANGUAGES=10 \ # raise the allowed number of target languages
    rt-asr-transcriber-en:10.5.1

Monitoring the server

The inference server is based on Nvidia's Triton architecture and as such can be monitored using Triton's inbuilt Prometheus metrics, or the GRPC/HTTP APIs. To expose these, configure an external mapping for port 8002(Prometheus) or 8000(HTTP).

Docker-Compose example

This docker-compose file will create a Speechmatics GPU translation server:

---
version: '3.8'

networks:
  transcriber:
    driver: bridge

services:
  triton:
    image: speechmatics-docker-public.jfrog.io/sm-translation-inference-server:10.5.1
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              ### Limit to N GPUs
              # count: 1
              ### Pick specific GPUs by device ID
              # device_ids:
              #   - 0
              #   - 3
              capabilities:
                - gpu
    container_name: triton
    networks:
      - transcriber
    expose:
      - 8000/tcp
      - 8001/tcp
      - 8002/tcp
    environment:
      - NVIDIA_REQUIRE_CUDA=cuda>=11.8
      - NVIDIA_DRIVER_CAPABILITIES=all
      - NVIDIA_VISIBLE_DEVICES=all
      - CUDA_VISIBLE_DEVICES=0

Deploying on Kubernetes

See the instructions for deploying the transcription GPU inference server.

Error handling

Unsupported Target Language - Batch

If one or more of the target languages are not supported for the source language, an error message will be included in the final JSON output. No translations will be returned for that language pair.

{
  "job": { ... },
  "metadata": {
    "created_at": "2023-05-26T15:01:48.412714Z",
    "type": "transcription",
    "transcription_config": {...},
    "translation_config": {
      "target_languages": [
        "es",
        "zz"
      ]
    },
    "translation_errors": [
      {"type": "unsupported_translation_pair", "message": "Translation from en to zz currently not supported"}
    ],
    ...
  },
  "results": [...]
}

Please note, this behaviour is different when using our SaaS deployment.

For all other errors, please see documentation here

Performance

Translation running on a 4-core T4 has an RTF of roughly 0.008. It can handle up to 125 hours of batch audio per hour, or 125 real-time transcription streams. However, each translation target language is counted as a stream, meaning that a single real-time transcription stream which requests 5 target languages adds the same load on the translation inference server as 5 transcription streams each requesting a single target language.