Input
Learn about configuration and supported input audio formats for the Speechmatics Batch APIThis page documents audio inputs for transcription by REST API (a.k.a. Batch SaaS).
- For Real-time transcription, see the Real-time Transcription input.
- For Flow Voice AI, see the Flow Voice AI supported formats and limits.
Supported File Types
The following file formats types are supported for transcription by REST API:
wav
mp3
aac
ogg
mpeg
amr
m4a
mp4
flac
The list above is exhaustive - any file format outside the list above is explicitly not supported.
Only files where the type can be determined by data inspection are supported. Raw audio formats where the codec is not embedded in the file cannot be processed in batch mode. This includes files commonly given extensions like ".raw" or ".g729" where the codec is only hinted at in the name.
Job configuration options
Jobs are configured by passing a JSON string to the config
field of the CreateJobRequest
(see API reference)
JobConfig
schema
Below are the complete fields of the configuration object:
Possible values: [alignment
, transcription
]
fetch_data
A list of additional headers to be added to the input fetch request when using http or https. This is intended to support authentication or authorization, for example by supplying an OAuth2 bearer token.
fetch_text
A list of additional headers to be added to the input fetch request when using http or https. This is intended to support authentication or authorization, for example by supplying an OAuth2 bearer token.
alignment_config
transcription_config
Language model to process the audio input, normally specified as an ISO language code
Request a specialized model based on 'language' but optimized for a particular field, e.g. "finance" or "medical".
Language locale to be used when generating the transcription output, normally specified as an ISO language code
Specify an operating point to use.
Operating points change the transcription process in a high level way, such as altering the acoustic model.
The default is standard
.
standard
:enhanced
: transcription will take longer but be more accurate than 'standard'
Possible values: [standard
, enhanced
]
additional_vocab object[]
List of custom words or phrases that should be recognized. Alternative pronunciations can be specified to aid recognition.
punctuation_overrides
Control punctuation settings.
Ranges between zero and one. Higher values will produce more punctuation. The default is 0.5.
Possible values: >= 0
and <= 1
The punctuation marks which the client is prepared to accept in transcription output, or the special value 'all' (the default). Unsupported marks are ignored. This value is used to guide the transcription process.
Possible values: Value must match regular expression ^(.|all)$
Specify whether speaker or channel labels are added to the transcript.
The default is none
.
- none: no speaker or channel labels are added.
- speaker: speaker attribution is performed based on acoustic matching; all input channels are mixed into a single stream for processing.
- channel: multiple input channels are processed individually and collated into a single transcript.
Possible values: [none
, speaker
, channel
]
Transcript labels to use when using collating separate input channels.
Possible values: Value must match regular expression ^[A-Za-z0-9._]+$
Include additional 'entity' objects in the transcription results (e.g. dates, numbers) and their original spoken form. These entities are interleaved with other types of results. The concatenation of these words is represented as a single entity with the concatenated written form present in the 'content' field. The entities contain a 'spoken_form' field, which can be used in place of the corresponding 'word' type results, in case a spoken form is preferred to a written form. They also contain a 'written_form', which can be used instead of the entity, if you want a breakdown of the words without spaces. They can still contain non-breaking spaces and other special whitespace characters, as they are considered part of the word for the formatting output. In case of a written_form, the individual word times are estimated and might not be accurate if the order of the words in the written form does not correspond to the order they were actually spoken (such as 'one hundred million dollars' and '$100 million').
Whether or not to enable flexible endpointing and allow the entity to continue to be spoken.
Possible values: [fixed
, flexible
]
transcript_filtering_config
Configuration for applying filtering to the transcription
If true, words that are identified as disfluencies will be removed from the transcript. If false (default), they are tagged in the transcript as 'disfluency'.
speaker_diarization_config
Configuration for speaker diarization
Controls how sensitive the algorithm is in terms of keeping similar speakers separate, as opposed to combining them into a single speaker. Higher values will typically lead to more speakers, as the degree of difference between speakers in order to allow them to remain distinct will be lower. A lower value for this parameter will conversely guide the algorithm towards being less sensitive in terms of retaining similar speakers, and as such may lead to fewer speakers overall. The default is 0.5.
Possible values: >= 0
and <= 1
notification_config object[]
The url to which a notification message will be sent upon
completion of the job. The job id
and status
are added
as query parameters, and any combination of the job inputs
and outputs can be included by listing them in contents
.
If contents
is empty, the body of the request will be
empty.
If only one item is listed, it will be sent as the body of
the request with Content-Type
set to an appropriate value
such as application/octet-stream
or application/json
.
If multiple items are listed they will be sent as named file attachments using the multipart content type.
If contents
is not specified, the transcript
item will
be sent as a file attachment named data_file
, for
backwards compatibility.
If the job was rejected or failed during processing, that will be indicated by the status, and any output items that are not available as a result will be omitted. The body formatting rules will still be followed as if all items were available.
The user-agent header is set to Speechmatics-API/2.0
, or
Speechmatics API V2
in older API versions.
Specifies a list of items to be attached to the notification message. When multiple items are requested, they are included as named file attachments.
Possible values: [jobinfo
, transcript
, transcript.json-v2
, transcript.txt
, transcript.srt
, alignment
, alignment.word_start_and_end
, alignment.one_per_line
, data
, text
]
The method to be used with http and https urls. The default is post.
Possible values: [post
, put
]
A list of additional headers to be added to the notification request when using http or https. This is intended to support authentication or authorization, for example by supplying an OAuth2 bearer token.
tracking
The title of the job.
External system reference.
Customer-defined JSON structure.
output_config object
srt_overrides object
Parameters that override default values of srt conversion. max_line_length: sets maximum count of characters per subtitle line including white space. max_lines: sets maximum count of lines in a subtitle section.
translation_config
Possible values: <= 5
language_identification_config
Action to take if all of the predicted languages are below the confidence threshold
Possible values: [allow
, reject
, use_default_language
]
summarization_config object
Configuration options for summarization.
Choose from three options:
conversational
- Best suited for dialogues involving multiple participants, such as calls, meetings or discussions. It focuses on summarizing key points of the conversation.informative
- Recommended for more structured information delivered by one or more people, making it ideal for videos, podcasts, lectures, and presentations.auto
- Automatically selects the most appropriate content type based on an analysis of the transcript.
Possible values: [auto
, informative
, conversational
]
auto
Determines the depth of the summary:
brief
- Provides a succinct summary, condensing the content into just a few sentences.detailed
- Provide a longer, structured summary. For conversational content, it includes key topics and a summary of the entire conversation. For informative content, it logically divides the audio into sections and provides a summary for each.
Possible values: [brief
, detailed
]
brief
Possible values: [paragraphs
, bullets
]
topic_detection_config
audio_events_config object
Fetch URL
If you store your digital media in cloud storage (for example AWS S3 or Azure Blob Storage) you can also submit a job by providing the URL of the audio file. The configuration uses a fetch_data
section, which looks like this:
Configuration example
{
"type": "transcription",
"transcription_config": {
"language": "en",
"diarization": "speaker"
},
"fetch_data": {
"url": "${URL}/{FILENAME}"
}
}
In SaaS, fetch requests made to the URL in fetch_data
have user agent set to Speechmatics-API/2.0
.
Fetch failure
If the Speechmatics Batch SaaS is unable to retrieve audio from the specified online location, the job will fail, with a status
of rejected
, and no transcript will be generated. Users can now retrieve failure information by making a GET /jobs/$JOBID request, and use that to carry out diagnostic information.
If the job has failed, there will be an additional errors
element, which will show all failure messages Speechmatics Batch SaaS encountered when carrying out the fetch request. Please note, there can be multiple failure attempts associated with one submitted job, as there is a retry mechanism in place.
{
"job": {
"config": {
"fetch_data": {
"url": "https://example.com/average-files/punctuation1.mp3"
},
"notification_config": [
{
"contents": ["jobinfo"],
"url": "https://example.com/"
}
],
"transcription_config": {
"language": "de"
},
"type": "transcription"
},
"created_at": "2021-07-19T12:55:03.754Z",
"data_name": "",
"duration": 0,
"errors": [
{
"message": "unable to fetch audio: http status code 404",
"timestamp": "2021-07-19T12:55:05.425Z"
},
{
"message": "unable to fetch audio: http status code 404",
"timestamp": "2021-07-19T12:55:07.649Z"
},
{
"message": "unable to fetch audio: http status code 404",
"timestamp": "2021-07-19T12:55:17.665Z"
},
{
"message": "unable to fetch audio: http status code 404",
"timestamp": "2021-07-19T12:55:37.643Z"
}
],
"id": "a81ko4eqjl",
"status": "rejected"
}
}