Skip to main content
Speech to TextBatch Transcription

Input

Learn about configuration and supported input audio formats for the Speechmatics Batch API

This page documents audio inputs for transcription by REST API (a.k.a. Batch SaaS).

Supported File Types

The following file formats types are supported for transcription by REST API:

  • wav
  • mp3
  • aac
  • ogg
  • mpeg
  • amr
  • m4a
  • mp4
  • flac

The list above is exhaustive - any file format outside the list above is explicitly not supported.

Only files where the type can be determined by data inspection are supported. Raw audio formats where the codec is not embedded in the file cannot be processed in batch mode. This includes files commonly given extensions like ".raw" or ".g729" where the codec is only hinted at in the name.

Job configuration options

Jobs are configured by passing a JSON string to the config field of the CreateJobRequest (see API reference)

JobConfig schema

Below are the complete fields of the configuration object:

typestringrequired

Possible values: [alignment, transcription]

fetch_data
urlstringrequired
auth_headersstring[]

A list of additional headers to be added to the input fetch request when using http or https. This is intended to support authentication or authorization, for example by supplying an OAuth2 bearer token.

fetch_text
urlstringrequired
auth_headersstring[]

A list of additional headers to be added to the input fetch request when using http or https. This is intended to support authentication or authorization, for example by supplying an OAuth2 bearer token.

alignment_config
languagestringrequired
transcription_config
languagestringrequired

Language model to process the audio input, normally specified as an ISO language code

domainstring

Request a specialized model based on 'language' but optimized for a particular field, e.g. "finance" or "medical".

output_localestring

Language locale to be used when generating the transcription output, normally specified as an ISO language code

operating_pointstring

Specify an operating point to use. Operating points change the transcription process in a high level way, such as altering the acoustic model. The default is standard.

  • standard:
  • enhanced: transcription will take longer but be more accurate than 'standard'

Possible values: [standard, enhanced]

additional_vocab object[]

List of custom words or phrases that should be recognized. Alternative pronunciations can be specified to aid recognition.

  • Array [
  • contentstringrequired
    sounds_likestring[]
  • ]
  • punctuation_overrides

    Control punctuation settings.

    sensitivityfloat

    Ranges between zero and one. Higher values will produce more punctuation. The default is 0.5.

    Possible values: >= 0 and <= 1

    permitted_marksstring[]

    The punctuation marks which the client is prepared to accept in transcription output, or the special value 'all' (the default). Unsupported marks are ignored. This value is used to guide the transcription process.

    Possible values: Value must match regular expression ^(.|all)$

    diarizationstring

    Specify whether speaker or channel labels are added to the transcript. The default is none.

    • none: no speaker or channel labels are added.
    • speaker: speaker attribution is performed based on acoustic matching; all input channels are mixed into a single stream for processing.
    • channel: multiple input channels are processed individually and collated into a single transcript.

    Possible values: [none, speaker, channel]

    channel_diarization_labelsstring[]

    Transcript labels to use when using collating separate input channels.

    Possible values: Value must match regular expression ^[A-Za-z0-9._]+$

    enable_entitiesboolean

    Include additional 'entity' objects in the transcription results (e.g. dates, numbers) and their original spoken form. These entities are interleaved with other types of results. The concatenation of these words is represented as a single entity with the concatenated written form present in the 'content' field. The entities contain a 'spoken_form' field, which can be used in place of the corresponding 'word' type results, in case a spoken form is preferred to a written form. They also contain a 'written_form', which can be used instead of the entity, if you want a breakdown of the words without spaces. They can still contain non-breaking spaces and other special whitespace characters, as they are considered part of the word for the formatting output. In case of a written_form, the individual word times are estimated and might not be accurate if the order of the words in the written form does not correspond to the order they were actually spoken (such as 'one hundred million dollars' and '$100 million').

    max_delay_modestring

    Whether or not to enable flexible endpointing and allow the entity to continue to be spoken.

    Possible values: [fixed, flexible]

    transcript_filtering_config

    Configuration for applying filtering to the transcription

    remove_disfluenciesboolean

    If true, words that are identified as disfluencies will be removed from the transcript. If false (default), they are tagged in the transcript as 'disfluency'.

    speaker_diarization_config

    Configuration for speaker diarization

    speaker_sensitivityfloat

    Controls how sensitive the algorithm is in terms of keeping similar speakers separate, as opposed to combining them into a single speaker. Higher values will typically lead to more speakers, as the degree of difference between speakers in order to allow them to remain distinct will be lower. A lower value for this parameter will conversely guide the algorithm towards being less sensitive in terms of retaining similar speakers, and as such may lead to fewer speakers overall. The default is 0.5.

    Possible values: >= 0 and <= 1

    notification_config object[]
  • Array [
  • urlstringrequired

    The url to which a notification message will be sent upon completion of the job. The job id and status are added as query parameters, and any combination of the job inputs and outputs can be included by listing them in contents.

    If contents is empty, the body of the request will be empty.

    If only one item is listed, it will be sent as the body of the request with Content-Type set to an appropriate value such as application/octet-stream or application/json.

    If multiple items are listed they will be sent as named file attachments using the multipart content type.

    If contents is not specified, the transcript item will be sent as a file attachment named data_file, for backwards compatibility.

    If the job was rejected or failed during processing, that will be indicated by the status, and any output items that are not available as a result will be omitted. The body formatting rules will still be followed as if all items were available.

    The user-agent header is set to Speechmatics-API/2.0, or Speechmatics API V2 in older API versions.

    contentsstring[]

    Specifies a list of items to be attached to the notification message. When multiple items are requested, they are included as named file attachments.

    Possible values: [jobinfo, transcript, transcript.json-v2, transcript.txt, transcript.srt, alignment, alignment.word_start_and_end, alignment.one_per_line, data, text]

    methodstring

    The method to be used with http and https urls. The default is post.

    Possible values: [post, put]

    auth_headersstring[]

    A list of additional headers to be added to the notification request when using http or https. This is intended to support authentication or authorization, for example by supplying an OAuth2 bearer token.

  • ]
  • tracking
    titlestring

    The title of the job.

    referencestring

    External system reference.

    tagsstring[]
    detailsobject

    Customer-defined JSON structure.

    output_config object
    srt_overrides object

    Parameters that override default values of srt conversion. max_line_length: sets maximum count of characters per subtitle line including white space. max_lines: sets maximum count of lines in a subtitle section.

    max_line_lengthinteger
    max_linesinteger
    translation_config
    target_languagesstring[]required

    Possible values: <= 5

    language_identification_config
    expected_languagesstring[]
    low_confidence_actionstring

    Action to take if all of the predicted languages are below the confidence threshold

    Possible values: [allow, reject, use_default_language]

    default_languagestring
    summarization_config object

    Configuration options for summarization.

    content_typestring

    Choose from three options:

    • conversational - Best suited for dialogues involving multiple participants, such as calls, meetings or discussions. It focuses on summarizing key points of the conversation.
    • informative - Recommended for more structured information delivered by one or more people, making it ideal for videos, podcasts, lectures, and presentations.
    • auto - Automatically selects the most appropriate content type based on an analysis of the transcript.

    Possible values: [auto, informative, conversational]

    Default value: auto
    summary_lengthstring

    Determines the depth of the summary:

    • brief - Provides a succinct summary, condensing the content into just a few sentences.
    • detailed - Provide a longer, structured summary. For conversational content, it includes key topics and a summary of the entire conversation. For informative content, it logically divides the audio into sections and provides a summary for each.

    Possible values: [brief, detailed]

    Default value: brief
    summary_typestring

    Possible values: [paragraphs, bullets]

    sentiment_analysis_configobject
    topic_detection_config
    topicsstring[]
    auto_chapters_configobject
    audio_events_config object
    typesstring[]

    Fetch URL

    If you store your digital media in cloud storage (for example AWS S3 or Azure Blob Storage) you can also submit a job by providing the URL of the audio file. The configuration uses a fetch_data section, which looks like this:

    Configuration example

    {
    "type": "transcription",
    "transcription_config": {
    "language": "en",
    "diarization": "speaker"
    },
    "fetch_data": {
    "url": "${URL}/{FILENAME}"
    }
    }

    In SaaS, fetch requests made to the URL in fetch_data have user agent set to Speechmatics-API/2.0.

    Fetch failure

    If the Speechmatics Batch SaaS is unable to retrieve audio from the specified online location, the job will fail, with a status of rejected, and no transcript will be generated. Users can now retrieve failure information by making a GET /jobs/$JOBID request, and use that to carry out diagnostic information.

    If the job has failed, there will be an additional errors element, which will show all failure messages Speechmatics Batch SaaS encountered when carrying out the fetch request. Please note, there can be multiple failure attempts associated with one submitted job, as there is a retry mechanism in place.

    {
    "job": {
    "config": {
    "fetch_data": {
    "url": "https://example.com/average-files/punctuation1.mp3"
    },
    "notification_config": [
    {
    "contents": ["jobinfo"],
    "url": "https://example.com/"
    }
    ],
    "transcription_config": {
    "language": "de"
    },
    "type": "transcription"
    },
    "created_at": "2021-07-19T12:55:03.754Z",
    "data_name": "",
    "duration": 0,
    "errors": [
    {
    "message": "unable to fetch audio: http status code 404",
    "timestamp": "2021-07-19T12:55:05.425Z"
    },
    {
    "message": "unable to fetch audio: http status code 404",
    "timestamp": "2021-07-19T12:55:07.649Z"
    },
    {
    "message": "unable to fetch audio: http status code 404",
    "timestamp": "2021-07-19T12:55:17.665Z"
    },
    {
    "message": "unable to fetch audio: http status code 404",
    "timestamp": "2021-07-19T12:55:37.643Z"
    }
    ],
    "id": "a81ko4eqjl",
    "status": "rejected"
    }
    }