Core Speech CPU Container
Transcription:BatchReal-TimeDeployments:ContainerPrerequisites
System Requirements
Speechmatics containerized deployments are built on the Docker platform. At present a separate Docker image is required for each language to be transcribed. Each Docker image takes about 3GB of storage. Each running container will require the following resources:
- 1 vCPU
- 2-5GB RAM
- 100MB hard disk space
- If you are using the Enhanced model, it is recommended to use the upper limit of the RAM recommendations
When using the parallel processing functionality of the batch container, this will require more resource due to the intensive memory required. When using parallel processing, we recommend using (NxRAM requirements) where N is the number of cores intended to be used for parallel processing. So if 2 cores were required for parallel processing, the RAM requirements would be up to 10GB
Host Recommended Specifications
Standard Operating Point
- The host machine requires a processor with at least a Broadwell class microarchitecture or newer, with AVX2 instruction support
- If you are using a hypervisor, you should check it is configured to allow VM access to the AVX2 instructions
Enhanced Operating Point
- The host machine should have a processor with at least a Cascade Lake class microarchitecture or newer, with AVX512-VNNI instruction support. This will greatly improve transcription processing speed. Support for AVX2 instructions is required
- If you are using a hypervisor, you should check it is configured to allow VM access to the AVX2 and AVX512-VNNI instructions
Architecture
- Batch Transcription
- Real-Time Transcription
- Processes one input file and outputs a resulting transcript in a predefined language in a number of supported outputs
- These outputs and relevant metadata are described in more detail in the Speech API guide here
- Is licensed for languages and speech features which vary depending upon each individual contract
- Speech features are described in the Speech API guide here
- Requires either a license file or license token before transcription starts
- Can run in a mode that parallelises processing across multiple cores
- Supports input file sizes up to 2 hours in length or 4GB in size
- Treats all data as transitory. Once a container completes its transcription it removes all record of the operation
- Provides the ability to transcribe speech data in a predefined language from a live stream or a recorded audio file
- Speech features are described in the Speech API guide
- Multiple instances of the container can be run on the same Docker host. This enables scaling of a single language or multiple languages as required
- All data is transitory, once a container completes its transcription it removes all record of the operation, no data is persisted
Docker Run
Once the Docker image has been pulled into a local environment, it can be started using the Docker run
command. More details about operating and managing the container are available in the Docker API documentation.
Batch Transcription
Input Methods
There are two different methods for passing an audio file into a container:
# Stream the audio through the container via standard input (STDIN)
cat ~/$AUDIO_FILE | docker run -i \
-e LICENSE_TOKEN=$TOKEN_VALUE \
batch-asr-transcriber-en:11.0.1
# Pull an audio file from a mapped directory into the container
# NOTE: the audio file must be mapped into the container with `:/input.audio`
docker run -i -v ~/$AUDIO_FILE:/input.audio \
-e LICENSE_TOKEN=$TOKEN_VALUE \
batch-asr-transcriber-en:11.0.1
The Docker run
options used are:
Name | Description |
---|---|
--env, -e | Set environment variables |
--interactive , -i | Keep STDIN open even if not attached |
--volume , -v | Bind mount a volume |
See Docker docs for a full list of the available options.
Both the methods will produce the same transcribed outcome and will write a JSON response to standard output (STDOUT).
The intermediate files created during the transcription are stored in /home/smuser/work
. This is the case whether running the container as a root or non-root user.
Here is an example output:
{
"format": "2.9",
"metadata": {
"created_at": "2023-08-02T15:43:50.871Z",
"type": "transcription",
"language_pack_info": {
"adapted": false,
"itn": true,
"language_description": "English",
"word_delimiter": " ",
"writing_direction": "left-to-right"
},
"transcription_config": {
"language": "en",
"diarization": "none"
}
},
"results": [
{
"alternatives": [
{
"confidence": 1.0,
"content": "Are",
"language": "en",
"speaker": "UU"
}
],
"end_time": 3.61,
"start_time": 3.49,
"type": "word"
},
{
"alternatives": [
{
"confidence": 1.0,
"content": "on",
"language": "en",
"speaker": "UU"
}
],
"end_time": 3.73,
"start_time": 3.61,
"type": "word"
}
]
}
Determining Success
The exit code of the Container will determine if the transcription was successful. There are two exit code possibilities:
- Exit Code == 0 : The transcript was a success; the output will contain a JSON output defining the transcript (more info below)
- Exit Code != 0 : the output will contain a stack trace and other useful information. This output should be used in any communication with Speechmatics Support to aid understanding and resolution of any problems that may occur
Modifying the Image
Building an Image
Using STDIN to pass files in and obtain the transcription may not be sufficient for all use cases. It is possible to build a new Docker Image that will use the Speechmatics Image as a layer if required for your specific workflow. To include the Speechmatics Docker Image inside another image, ensure to add the pulled Docker Image into the Dockerfile for the new application.
Requirements for a Custom Image
To ensure the Speechmatics Docker Image works as expected inside the custom image, please consider the following:
- Any audio that needs to be transcribed must to be copied to a file called
/input.audio
inside the running Container - To initiate transcription, call the application
pipeline
. Thepipeline
will start the transcription service and use/input.audio
as the audio source - When running
pipeline
, the working directory must be set to/opt/orchestrator
, using either the DockerfileWORKDIR
directive, thecd
command or similar means - Once
pipeline
finishes transcribing, ensure you move the transcription data outside the Container
Dockerfile
To add a Speechmatics Docker Image into a custom one, the Dockerfile must be modified to include the full image name of the locally available image.
Example: Adding Global English (en) with tag 11.0.1 to the DockerfileFROM batch-asr-transcriber-en:11.0.1
ADD download_audio.sh /usr/local/bin/download_audio.sh
RUN chmod +x /usr/local/bin/download_audio.sh
CMD ["/usr/local/bin/download_audio.sh"]
Once the above image is built, and a Container instantiated from it, a script called download_audio.sh
will be executed (this could do something like pulling a file from a webserver and copying it to /input.audio
before starting the pipeline application). This is a very basic Dockerfile to demonstrate a way of orchestrating the Speechmatics Docker Image.
For support purposes, it is assumed the Docker Image provided by Speechmatics has been unmodified. If you experience issues, Speechmatics support will require you to replicate the issues with the unmodified Docker image e.g.
batch-asr-transcriber-en:11.0.1
Parallel Processing Guide
For customers who are looking to improve job turnaround time and who are able to assign sufficient resources, it is possible to pass a parallel transcription parameter to the container to take advantage of multiple CPUs. The parameter is called parallel and the following example shows how it can be used. In this case to use 4 cores to process the audio you would run the Container like this:
docker run -i -rm -v ~/tmp/shipping-forecast.wav:/input.audio \
-v ~/tmp/config.json:/config.json \
batch-asr-transcriber-en:11.0.1\
--parallel=4
Depending on your hardware, you may need to experiment to find the optimum performance. We've noticed significant improvement in turnaround time for jobs by using this approach.
If you limit or are limited on the number of CPUs you can use (for example your platform places restrictions on the number of cores you can use, or you use the --cpu flag in your docker run command), then you should ensure that you do not set the parallel value to be more than the number of available cores. If you attempt to use a setting in excess of your free resources, then the Container will only use the available cores.
If you simply increase the parallel setting to a large number you will see diminishing returns. Moreover, because files are split into 5 minute chunks for parallel processing, if your files are shorter than 5 minutes then you will see no parallelization (in general the longer your audio files the more speedup you will see by using parallel processing).
If you are running the container on a shared resource you may experience different results depending on what other processes are running at the same time.
The optimum number of cores is N/5, where N is the length of the audio in minutes. Values higher than this will deliver little to no value, as there will be more cores than chunks of work. A typical approach will be to increment the parallel setting to a point where performance plateaus, and leave it at that (all else being equal).
For large files and large numbers of cores, the time taken by the first and last stages of processing (which cannot be parallelized) will start to dominate, with diminishing returns.
Generating Multiple Transcript Formats
In addition to our primary JSON format, the Speechmatics container can output transcripts in the plain text (TXT) and SubRip (SRT) subtitle format. This can be done by using
--all-formats
command and then specifying a directory parameter within the transcription request. This is where all supported transcript formats will be saved. You can also use
--allformats
to generate the same response.
This directory must be mounted into the container so the transcripts can be retrieved after container finishes. You will receive a transcript in all currently supported formats: JSON, TXT, and SRT.
The following example shows how to use --all-formats
parameter. In this scenario, after processing the file, three separate transcripts would be found in the ~/tmp/output
directory. These transcripts would be in JSON, TXT, and SRT format.
docker run \
-v ~/Projects/ba-test/data/shipping-forecast.wav:/input.audio \
-v ~/tmp/config.json:/config.json \
-v ~/tmp/output:/example_output_dir_name \
-e LICENSE_TOKEN=$TOKEN_VALUE \
batch-asr-transcriber-en:11.0.1 \
--all-formats /example_output_dir_name
Real-Time Transcription
Here's an example of how to start the Container from the command line:
docker run \
-p 9000:9000 \
-p 8001:8001 \
-e LICENSE_TOKEN=$TOKEN_VALUE \
rt-asr-transcriber-en:11.0.1
The Docker run
options used are:
Name | Description |
---|---|
--port, -p | Expose ports on the container so that they are accessible from the host |
--env, -e | Set the value of an environment variable |
See Docker docs for a full list of the available options.
Input Modes
The supported method for passing audio to a Real-Time Container is to use a WebSocket. A session is setup with configuration parameters passed in using a StartRecognition
message, and thereafter audio is sent to the container in binary chunks, with transcripts being returned in an AddTranscript
message.
In the AddTranscript
message individual result segments are returned, corresponding to audio segments defined by pauses (and other latency measurements).
Output
The results list in the V2 Output format are sorted by increasing start_time
, with a supplementary rule to sort by decreasing end_time
. See below for an example:
{
"message": "AddTranscript",
"format": "2.9",
"metadata": {
"transcript": "full tell radar",
"start_time": 0.11,
"end_time": 1.07
},
"results": [
{
"type": "word",
"start_time": 0.11,
"end_time": 0.4,
"alternatives": [{ "content": "full", "confidence": 0.7 }]
},
{
"type": "word",
"start_time": 0.41,
"end_time": 0.62,
"alternatives": [{ "content": "tell", "confidence": 0.6 }]
},
{
"type": "word",
"start_time": 0.65,
"end_time": 1.07,
"alternatives": [{ "content": "radar", "confidence": 1.0 }]
}
]
}
Transcription Duration Information
The Container will output a log message after every transcription session to indicate the duration of speech transcribed during that session. This duration only includes speech, and not any silence or background noise which was present in the audio. It may be useful to parse these log messages if you are asked to report usage back to us, or simply for your own records.
The format of the log messages produced should match the following example:
2020-04-13 22:48:05.312 INFO sentryserver Transcribed 52 seconds of speech
Consider using the following regular expression to extract just the seconds part from the line if you are parsing it:
^.+ .+ INFO sentryserver Transcribed (\d+) seconds of speech$
Read-Only Mode
Users may wish to run the Container in read-only mode. This may be necessary due to their regulatory environment, or a requirement not to write any media file to disk. An example of how to do this is below.
bash docker run -it --read-only \
-p 9000:9000 \
--tmpfs /tmp \
-e LICENSE_TOKEN=$TOKEN_VALUE \
rt-asr-transcriber-en:11.0.1
The Container still requires a temporary directory with write permissions. Users can provide a directory (e.g /tmp
) by using the --tmpfs
Docker argument. A tmpfs mount is temporary, and only persisted in the host memory. When the Container stops, the tmpfs mount is removed, and files written there won’t be persisted.
If customers want to use the shared Custom Dictionary Cache feature, they must also specify the location of cache and mount it as a volume
docker run -it --read-only \
-p 9000:9000 \
--tmpfs /tmp \
-v /cachelocation:/cache \
-e LICENSE_TOKEN=$TOKEN_VALUE \
-e SM_CUSTOM_DICTIONARY_CACHE_TYPE=shared \
rt-asr-transcriber-en:11.0.1
Running Container as a Non-Root User
A Real-Time Container can be run as a non-root user with no impact to feature functionality. This may be required if a hosting environment or a company's internal regulations specify that a Container must be run as a named user.
Users may specify the non-root command by the docker run –-user $USERNUMBER:$GROUPID
. User number and group ID are non-zero numerical values from a value of 1 up to a value of 65535
An example is below:
bash docker run -it --user 100:100 \
-p 9000:9000 \
-e LICENSE_TOKEN=$TOKEN_VALUE \
rt-asr-transcriber-en:11.0.1
How to use a Shared Custom Dictionary Cache
The Speechmatics Real-Time Container includes an optional Custom Dictionary cache mechanism to reduce session initialisation times.
You will see improvements when reusing an identical Custom Dictionary from the second time onwards.
The cache volume is safe to use from multiple Containers concurrently if the operating system and its filesystem support file locking operations. The cache can store multiple Custom Dictionaries in any language used for transcription. It can support multiple Custom Dictionaries in the same language.
If a Custom Dictionary is small enough to be stored within the cache volume, this will take place automatically if the shared cache is specified.
For more information about how the shared cache storage management works, please see Maintaining the Shared Cache.
We highly recommend you ensure any location you use for the shared cache has enough space for the number of Custom Dictionaries you plan to allocate there. How to allocate Custom Dictionaries to the shared cache is documented below.
How to Set Up the Shared Cache
The shared cache is enabled by setting the following value when running transcription:
- Cache Location: You must volume map the directory location you plan to use as the shared cache to
/cache
when submitting a job SM_CUSTOM_DICTIONARY_CACHE_TYPE
: (mandatory if using the shared cache) This environment variable must be set toshared
SM_CUSTOM_DICTIONARY_CACHE_ENTRY_MAX_SIZE
: (optional if using the shared cache). This determines the maximum size of any single Custom Dictionary that can be stored within the shared cache in bytes- E.G. a
SM_CUSTOM_DICTIONARY_CACHE_ENTRY_MAX_SIZE
with a value of 10000000 would set a max storage size of any Custom Dictionary at 10MB - For reference a Custom Dictionary wordlist with 1000 words produces a cache entry of size around 200 kB, or 200000 bytes
- A value of
-1
will allow every Custom Dictionary to be stored within the shared cache. This is the default assumed value - A Custom Dictionary Cache entry larger than the
SM_CUSTOM_DICTIONARY_CACHE_ENTRY_MAX_SIZE
will still be used in transcription, but will not be cached
- E.G. a
Maintaining the Shared Cache
If you specify the shared cache to be used and your Custom Dictionary is within the permitted size, Speechmatics Real-Time Container will always try to cache the Custom Dictionary. If a Custom Dictionary cannot occupy the shared cache due to other cached Custom Dictionaries within the allocated cache, then older Custom Dictionaries will be removed from the cache to free up as much space as necessary for the new Custom Dictionary. This is carried out in order of the least recent Custom Dictionary to be used.
Therefore, you must ensure your cache allocation large enough to handle the number of Custom Dictionaries you plan to store. We recommend a relatively large cache to avoid this situation if you are processing multiple Custom Dictionaries using the batch container (e.g 50 MB). If you don't allocate sufficient storage this could mean one or multiple Custom Dictionaries are deleted when you are trying to store a new Custom Dictionary.
It is recommended to use a Docker volume with a dedicated filesystem with a limited size. If a user decides to use a volume that shares filesystem with the host, it is the user's responsibility to purge the cache if necessary.
Creating the Shared Cache
In the example below, transcription is run where an example local docker volume is created for the shared cache. It will allow a Custom Dictionary of up to 5MB to be cached.
- Batch
- Real-Time