Skip to main content
DeploymentsContainer

Performance and cost

Get an overview of the performance and cost of Speechmatics container deployments

Speech to text containers

This is a comparison of the performance and estimated running costs of transcription executing on standard Azure VMs. The comparison highlights the maximum number of concurrent real-time sessions (session density) and the maximum throughput for batch jobs on a single instance.

Batch transcription

Operating PointCPU StandardCPU EnhancedGPU StandardGPU Enhanced
Lowest Processing Cost (US ¢ per hour)1.73.80.712.21
Cost vs CPU Standard (%)-224%42%130%
Cost vs CPU Enhanced (%)45%-19%58%
Maximum Throughput153.223.717034
Representative Real-Time Factor (RTF)20.0850.20.0350.08
Transcriber Count20202013

The benchmark uses the following configuration:

Benchmark details
CPUD16ds_v5
GPU StandardStandard_NC16as_T4_v3
GPU EnhancedStandard_NC8as_T4_v3
Price BasisAzure PAYG East US, Linux, Standard

For GPU Operating Points, transcribers and inference servers were all run on a single VM node.

Realtime transcription

Operating PointCPU StandardCPU EnhancedGPU StandardGPU Enhanced
Lowest Processing Cost (US ¢ per hour)1.972.950.862.51
Cost vs. CPU Standard (%)-150%44%127%
Cost vs. CPU Enhanced (%)67%-29%85%
Session Density40241403303

This benchmark uses the following configuration4:

Benchmark detailsValue
CPUD16ds_v5
GPU StandardStandard_NC16as_T4_v3
GPU EnhancedStandard_NC8as_T4_v3
Price BasisAzure PAYG East US, Linux, Standard

For GPU Operating Points, the transcribers and inference servers were run on a single VM node.

Each first session, transcriber requires 0.25 cores for both OPs, with 1.2 GB memory (Standard OP) or 3 GB memory (Enhanced OP). Every additional session consumes 0.1 cores and 100 MB of memory.

Translation (GPU)

Translation running on a 4-core T4 has an RTF of roughly 0.008. It can handle up to 125 hours of batch audio per hour, or 125 Real-Time Transcription streams. However, each translation target language is counted as a stream, meaning that a single Real-Time Transcription stream which requests 5 target languages adds the same load on the Translation Inference Server as 5 transcription streams each requesting a single target language.

Footnotes

  1. Throughput is measured as hours of audio per hour of runtime. A throughput of 50 would mean that in one hour, the system as a whole can transcribe 50 hours of audio.

  2. An RTF of 1 would mean that a one hour file would take one hour to transcribe. An RTF of 0.1 would mean that a one hour file would take six minutes to transcribe. Benchmark RTFs are representative for processing audio files over 20 minutes in duration using parallel=4.

  3. Multiple sessions are handled by a single worker configured with the required concurrency. 2

  4. Benchmark results reflect performance on a fully loaded inference server operating at the session density recommended for the respective GPU platform.