Container Usage Reporting
Speechmatics offers two ways for you to report your transcription usage.
- Automatic Usage Reporting is the most convenient way to report on usage. This works by automatically sending periodic HTTPS requests to Speechmatics over the course of a transcription session.
- The Usage Container can be used if you have specific restrictions preventing connection to the Internet. This produces an output which needs to be sent to Speechmatics via email.
For further information, please also refer to What Data Do We Record?
Automatic Usage Reporting
Transcription:BatchReal-TimeDeployments:ContainerStatus:BetaGetting Started
The most convenient way of reporting usage to Speechmatics is by allowing Automatic Usage Reporting. The transcriber will automatically connect to Speechmatics servers to send required usage analytics.
This feature works by sending periodic HTTPS requests to Speechmatics over the course of a transcription session. Information recorded includes the job configuration, the duration of transcription, and the amount of audio being transcribed. We aim to be completely transparent about exactly What Data We Record.
Compatibility
To enable automatic usage reporting, you must be running one of the following ASR Container versions:
- Batch Container 10.1.0 onwards
- Real-Time Container 10.1.0 onwards
Introduction
The most convenient way of reporting usage to Speechmatics is by allowing Automatic Usage Reporting. The transcriber will automatically connect to Speechmatics servers to send required usage analytics.
This feature works by sending periodic HTTPS requests to Speechmatics over the course of a transcription session. Information recorded includes the job configuration, the duration of transcription, and the amount of audio being transcribed. We aim to be completely transparent about exactly What Data We Record.
This feature is turned ON by default and is currently opt out. It is turned off by setting the environment variable SM_ENABLE_USAGE_REPORTING=false
(false
, no
or 0
are equally valid) when running the transcriber. For example:
docker run -i -v ~/$AUDIO_FILE:/input.audio \
-e LICENSE_TOKEN=eyJhbGciOiJ... \
-e SM_ENABLE_USAGE_REPORTING=false \
batch-asr-transcriber-en:11.0.1
Automatic Usage Reporting will be ON by default, starting from version 10.6.0.
For further information see automatic usage Technical Details
We will never send customer audio data over the network. See What Data Do We Record for a full description of what information will be recorded.
Offline Usage Reporting
Transcription:BatchReal-TimeDeployments:ContainerTerminology
Throughout this document there are references to different types of containers:
- ASR Containers - Speechmatics containers that transcribe media or audio files into a transcript. Two types are available - those can process media in batch, and those that can process media in real-time. When these are specifically referred to they are called the Batch or Real-Time Containers
- Usage Containers - a new container that stores event-specific data from ASR Containers
Getting Started
The ASR Usage Container can be retrieved from Speechmatics Docker Registry as a Docker Image. To access the Usage Container, you should use the same credentials that you use to access Speechmatics' ASR Containers from its Docker Registry. This information should already be provided to you by Support when you are onboarded.
You will also need to know the following information:
- Docker Registry URL, e.g.
https://speechmatics-docker-public.jfrog.io
- Image name, e.g.
asr-usage
- Image tag, e.g.
0.5.0
The image can be downloaded by using the standard Docker workflow:
# Login
docker login https://speechmatics-docker-public.jfrog.io
### Download image
docker pull speechmatics-docker-public.jfrog.io/asr-usage:0.5.0
Speechmatics require all customers to cache a copy of the Docker images within their own environment. Please do not pull directly from the Speechmatics docker registry for each deployment.
System Requirements
The ASR Usage Container requires the following resources:
- 1 vCPU
- 1 GB memory
- At least 1 GB of persistent storage per Usage Container deployed. Every 25 MB can store data for up to 13,000 batch jobs or up to 1250 (60 minute) Real-Time sessions.
Persisting storage to temporary locations (e.g. tmpfs
) is supported where this is necessary as part of a user's workflow, but is not recommended. If you are required to use tmpfs
or other such directories as a storage solution, Speechmatics recommends increasing the frequency of how often usage reports are sent to avoid any potential data loss
Configuration
The following section will show you how to set up an environment where you have a running ASR Usage Container that can accept data from one or multiple ASR Containers. It will show in order:
- How to set up and run an ASR Usage Container
- How to ensure an ASR Batch or Real-Time Container can send all required data to an ASR Usage Container during transcription.
You must set up a Usage Container before running Speechmatics' Batch or Real-Time ASR Containers in order to ensure that all usage data is captured. A Usage Container is persistent, which means it does not shut down after receiving transcription data.
Prerequisites
When setting up an environment with one or multiple Speechmatics ASR Container(s) and one or multiple Usage Container(s) please ensure:
- That all Batch or Real-Time Containers you require to send data to the Usage Container can exchange communication with each other in their environment
- That all communication between Docker containers is via HTTPS
- That when running Usage Containers, you enable the required ports when necessary to send and extract data. More detail is below
Compatibility
To use the Usage Container, you must be running the following ASR Container versions:
- Batch Container 8.2.0 onwards
- Real-Time Container 1.4.1 onwards
The ASR Usage Container has been tested using Docker Version 20. Compatibility with previous versions of Docker has not been tested.
Early Access
The Usage Container has been released as an early access product that any customer using either Speechmatics' Batch or Real-Time ASR Containers is entitled to use. Speechmatics encourages customers to try this solution, in order to simplify their usage logging and reporting processes.
Speechmatics encourages feedback on the Usage Container, and the raising of any bugs or usability issues. These will be subject to our normal bug triage process, and should be submitted to Support.
Workflow
The following workflow is recommended:
- The user downloads the Usage Container from Speechmatics' Docker Registry using their existing credentials
- The user must cache a copy of each Container they download within their own environment
- The user will run one or multiple Usage Container(s) depending on their requirements
- Any Usage Container must be assigned its own persistent data volume. The user is responsible for allocating and backing up this persistent volume
- Any Usage Container must also have all relevant ports opened to allow data exchange and export where necessary
- When requesting transcription from a Batch or Real-Time Container, the user must specify the hostname or IP address of the Usage Container via a new environment variable
- Data will then be stored in the Usage Container for up to 90 days
- At intervals of no more than a calendar month, the user will extract usage data processed in that interval from the ASR Usage Container via the RESTful API
- The user will then send this data to a designated Speechmatics email address (billing-reporting@speechmatics.com).
ASR Usage Container
The ASR Usage Container always requires a persistent storage volume to store the data.
This volume must be mounted inside the container at /data
.
Endpoints
The ASR Usage Container has 2 endpoints:
Endpoint | Use | Port | How to Set |
---|---|---|---|
v1/log | Receives transcription event data from Batch and Real-Time Containers | 9090 | use the SM_EATS_URL environment variable |
v1/export | All event data, or time-specific event data, can be extracted from this endpoint as a compressed file | 8000 | use the docker -p $PORT:$PORT command. If you need to change the default port use -e PANDAS_PORT environment variable as well as docker -p with your required port |
By default, all Docker Containers do not expose any ports. You must specifically request these ports to be open to ensure transcription events are captured, or that data can be extracted.
The example below starts a Usage Container with
- A persistent volume mounted to
/data
- Port 9090 open via the
EATS_PORT
environment variable to allow the Container to accept transcription event data - Port 8000 open via the
docker -p
command to allow data to be exported from the Usage Container
# Create volume
docker volume create volume-1
# Mount volume
docker run -it \
-v volume-1:/data \
-e EATS_PORT=9090 \
-p 8000:8000 \
speechmatics-docker-public.jfrog.io/asr-usage:0.2.0
Further documentation on using persistent storage volumes on popular container orchestration engines:
- Kubernetes https://kubernetes.io/docs/concepts/storage/persistent-volumes/
- Nomad https://www.nomadproject.io/docs/job-specification/volume
Speechmatics recommends setting up backup policies for the persistent volume. The ASR Usage Container cannot perform recovery by itself if the data file or volume is corrupted.
The ASR Usage Container accepts the following configuration option, which can be set via environment variables.
Key | Default | Type | Description |
---|---|---|---|
EATS_PORT | 9090 | int | Listening port for incoming data from transcribers. Must be set to accept usage data from Batch or Real-Time ASR Containers |
Sending Transcription Data from ASR Container to ASR Usage Container
An ASR Container must be explicitly configured to send data to the ASR Usage Container when starting. By default, this is via HTTPS.
The following configuration options must be specified when running the ASR Container to send usage data:
Key | Default | Type | Description |
---|---|---|---|
SM_EATS_URL | none | string | Address and listening port of the ASR Usage Container you wish to send data to |
To correctly configure the transcriber, set SM_EATS_URL
environment variable to point to ASR Usage Container. e.g., SM_EATS_URL=asr-usage.example.net:9090
or SM_EATS_URL=10.244.8.32:9090
, where asr-usage.example.net
and 10.244.8.32
correspond to the relevant ASR Usage Container instance. The port 9090
is the default listening port for incoming data from transcribers. The port number is alterable by using the EATS_PORT
environment variable.
Below is a working example of running an ASR Batch Container that will then send transcription event data to a running ASR Usage Container:
docker run -i -v $AUDIO_FILE:/input.audio \
-v $CONFIG_FILE:/config.json:ro \
-e LICENSE_TOKEN=$TOKEN_VALUE \
-e SM_EATS_URL=-asr-usage.example.net:9090
batch-asr-transcriber-en:8.2.0
Below is a similar example of a Real-Time Container that will send transcription event data to a running ASR Usage Container:
docker run -p 9000:9000 \
-e LICENSE_TOKEN=$TOKEN_VALUE \
-e SM_EATS_URL=asr-usage.example.net:9090 \
rt-asr-transcriber-en:1.4.0
Logging
The Usage Container will log event data sent by an ASR Batch or Real-Time Container:
- during transcription
- when transcription has finished
- For the Batch Container this is when transcription finishes as it is not a persistent container.
- For the Real-Time Container this is both after a endOfTranscription websocket message. The Real-Time Container will send a
SESSION_ENDED
message to the Usage Container
- when the Container is shut down or terminated by the user or due to system error during transcription itself (e.g. SIGTERM)
Example Logs - Success
The following is an example of a log from by a Batch or Real-Time ASR Container when they successfully send data to the ASR Usage Container:
2021-07-19 11:24:31.314 INFO sentryserver Transcription usage registered with EATS
The following is an example of a log from the Usage Container when it successfully receives data from a Batch or Real-Time ASR Container:
[2021-08-25T10:45:37Z INFO actix_web::middleware::logger] 172.19.0.3:39068 "POST /v1/log HTTP/1.1" 201 0 "-" "Go-http-client/1.1" 0.009459
The following is an example of a log from the Usage Container when a customer successfully exports data:
[2021-09-03T14:54:00Z INFO actix_web::middleware::logger] 172.19.0.1:55820 "GET /v1/export HTTP/1.1" 200 12912 "-" "curl/7.64.1" 0.006313
Example Logs - Failure
If data cannot be sent from the ASR Container to the ASR Usage Container, the following error message is shown in the ASR Container:
2021-07-19 11:27:43.158 ERROR sentryserver Error 'Post "https://asr-usage.net:9090/v1/log": dial tcp 172.25.0.2:909: connect: connection refused' occurred when logging EATS data: retrying
Example Logs - Failure Upon Container Termination
If a Container is shut down or terminated, both the Batch and the Real-Time Container will attempt retries for up to 1 minute after receiving SIGTERM
. For Batch, the Container will attempt to send data when transcription finishes. For the Real-Time Container, this is when Container termination is requested. After this point, any unsent data is lost with following message.
2021-07-19 11:28:55.288 WARNING sentryserver Some activity events could not be sent to EATS: count: 4
Orchestrating multiple ASR Usage Containers
It is up to the customer's level of risk tolerance and their internal topology and orchestration how many ASR Usage Containers they need to deploy in ratio to their number of ASR Containers. Speechmatics recommends that each environment in which Batch or Real-Time Containers are deployed requires at least one Usage Container. Customers can implement multiple Usage Containers in each environment for redundancy and to reduce the risk of failure.
If a customer has ASR containers in multiple availability zones or clusters, assigning Usage Containers per environment or cluster reduces latency and the requirement to send messages between clusters.
Orchestrating multiple ASR Usage Containers allows redundancy in the event of network or storage failure. It is possible to deploy multiple ASR Usage Containers in a single environment and have usage data distributed to those Containers. A basic scenario example is below,
The docker-compose
example below illustrate this scenario with:
- Two Speechmatics ASR Usage Containers
- One proxy Container, to route telemetry data
- One Speechmatics ASR Batch Container
---
#
# Example docker-compose file using multiple telementry Containers.
#
version: '3.4'
# Common setup for ASR Usage Containers
x-usage-template: &usage-template
image: asr-usage:x.y.z
labels:
- 'traefik.enable=true'
- 'traefik.tcp.routers.usage.rule=HostSNI(`*`)'
- 'traefik.tcp.routers.usage.entrypoints=custom'
- 'traefik.tcp.routers.usage.tls=true'
- 'traefik.tcp.routers.usage.tls.passthrough=true'
- 'traefik.tcp.routers.usage.service=telemeter'
- 'traefik.tcp.services.usage.loadbalancer.server.port=9090'
depends_on:
- proxy
services:
# Traefik reverse proxy, to route telemetry events to multiple ASR Usage Container
# containers
proxy:
image: traefik:v2.4
command: --providers.docker --providers.docker.exposedByDefault=false --entrypoints.custom.address=:9090
volumes:
- /var/run/docker.sock:/var/run/docker.sock
usagecontainer1:
<<: *usage-template
ports:
- '8001:8000'
usagecontainer2:
<<: *usage-template
ports:
- '8002:8000'
transcriber-batch:
image: batch-asr-transcriber-en:x.y.z
environment:
SM_EATS_URL: proxy:9090
volumes:
- ./input/10_sec_news.wav:/input.audio
- ./input/license.json:/license.json
depends_on:
- proxy
- usagecontainer1
- usagecontainer2
The example configures the ASR Batch Container to send data, using SM_EATS_URL
, to the proxy container instead of a specific ASR Usage Container. When receiving usage data, the proxy will forward it to one ASR Usage Container, using round robin balancing.
Each ASR Usage Container will need its own persistent storage volume to store usage data. This means that when generating reports to send to Speechmatics, an export request must be made for each ASR Usage Container the user has in operation. There will be as many reports as there are ASR Usage Containers deployed.
Exporting Usage Data
The exported data must not be modified in any way before sending to Speechmatics. Speechmatics will request a new unmodified data export if it is found that data has been altered.
Data is retained in the Usage Container for 90 days, after which point it is purged.
Currently, Speechmatics requires data to be sent via email to (billing-reporting@speechmatics.com).
The data must be exported from each ASR Usage Container you have used, and then sent to Speechmatics for calculation. The ASR Usage Container has a REST API to export transcription data. You will need to send at least as many reports as you have from ASR Usage Containers. Based on heavy transcription usage, you may have to provide multiple reports per single ASR Usage Container. You can send multiple attachments per email, or each email as a separate attachment, so long as you are under email provider limits for sending files.
To remain under the 25MB email attachment limit, we recommend compressed files with no more than 10,000 batch jobs or 1250 Real-Time sessions of one hour.
Data is exported in compressed json.gz
format. All files must be sent in this format to Speechmatics. The name of the file does not matter.
The complete API reference for extracting usage data can be found in the API Reference section.
# To export all data
curl 'asr-usage.net:8000/v1/export' > ExportExampleFile.json.gz
# To export data within a date window, e.g. 1-Jan-2020 to 1-Feb-2020
curl 'asr-usage.net:8000/v1/export?since=2020-01-01T00:00:00.000000Z&until=2020-02-01T00:00:00Z' > ExportExampleFile-01-01_2020-02-01.json.gz
If the number of jobs extracted is too large a 4XX response may be returned. Generally this has been shown in testing to be circa. 25,000 Batch jobs or 5,000 Real-Time jobs of an hour long.
In such cases, please select a smaller time window with since
and until
parameters.
It is fine to have overlapping reports with duplicate data. Transcriptions will always be billed once; the billing cycle will be determined by their time of completion.
The following example script exports reports by each week for the whole month:
#!/bin/bash
# Use ISO-8601 format
START="2021-11-01T00:00:00Z"
END="2021-12-01T00:00:00Z"
CHUNK="7 day"
d=$(date -d "$START" -I)
while [ $(date -d "$d" +%s) -le $(date -d "$END" +"%s") ]; do
SINCE=$(date -d "$d" +"%Y-%m-%dT%H:%M:%SZ")
d=$(date -I -d "$d + $CHUNK")
UNTIL=$(date -d "$d" +"%Y-%m-%dT%H:%M:%SZ")
curl "asr-usage.net:8000/v1/export?since=${SINCE}&until=${UNTIL}" > exported_$(date -d "${SINCE}" -I)_$(date -d "${UNTIL}" -I).json.gz
done
Once the data has been exported, it must be emailed as attachment(s) to billing-reporting@speechmatics.com. Any exported files should remain compressed when sending to Speechmatics.
The exported usage data is a compressed JSON file; it is possible to inspect the contents by unpacking it and opening the text file. The following example uses the jq JSON parser.
$ cat exported_2020-01-01_2020-02-01.json.gz | gunzip | jq .
{
"header": {
"alg": "HS512"
},
"payload": {
"events": [
{
...
For further information see offline usage Technical Details