Speech to TextBatch Transcription

Input

Learn about configuration and supported input audio formats for the Speechmatics Batch API

This page documents audio inputs for transcription by REST API (a.k.a. Batch SaaS).

For Realtime transcription, see the Realtime Transcription input.
For Flow Voice AI, see the Flow Voice AI supported formats and limits.

Supported file types

The following file formats types are supported for transcription by REST API:

wav
mp3
aac
ogg
mpeg
amr
m4a
mp4
flac

The list above is exhaustive - any file format outside the list above is explicitly not supported.

Only files where the type can be determined by data inspection are supported. Raw audio formats where the codec is not embedded in the file cannot be processed in batch mode. This includes files commonly given extensions like ".raw" or ".g729" where the codec is only hinted at in the name.

Job configuration options

Jobs are configured by passing a JSON string to the config field of the CreateJobRequest (see API reference)

`JobConfig` schema

Below are the complete fields of the configuration object:

typestringrequired

Possible values: [alignment, transcription]

fetch_data

urlstringrequired

auth_headersstring[]

A list of additional headers to be added to the input fetch request when using http or https. This is intended to support authentication or authorization, for example by supplying an OAuth2 bearer token.

fetch_text

urlstringrequired

auth_headersstring[]

alignment_config

languagestringrequired

transcription_config

languagestringrequired

Language model to process the audio input, normally specified as an ISO language code

domainstring

Request a specialized model based on 'language' but optimized for a particular field, e.g. "finance" or "medical".

output_localestring

Language locale to be used when generating the transcription output, normally specified as an ISO language code

operating_pointstring

Specify an operating point to use. Operating points change the transcription process in a high level way, such as altering the acoustic model. The default is standard.

standard:
enhanced: transcription will take longer but be more accurate than 'standard'

Possible values: [standard, enhanced]

additional_vocab object[]

List of custom words or phrases that should be recognized. Alternative pronunciations can be specified to aid recognition.

Array [

contentstringrequired

sounds_likestring[]

]

punctuation_overrides

Control punctuation settings.

sensitivityfloat

Ranges between zero and one. Higher values will produce more punctuation. The default is 0.5.

Possible values: >= 0 and <= 1

permitted_marksstring[]

The punctuation marks which the client is prepared to accept in transcription output, or the special value 'all' (the default). Unsupported marks are ignored. This value is used to guide the transcription process.

Possible values: Value must match regular expression ^(.|all)$

diarizationstring

Specify whether speaker or channel labels are added to the transcript. The default is none.

none: no speaker or channel labels are added.
speaker: speaker attribution is performed based on acoustic matching; all input channels are mixed into a single stream for processing.
channel: multiple input channels are processed individually and collated into a single transcript.

Possible values: [none, speaker, channel]

channel_diarization_labelsstring[]

Transcript labels to use when using collating separate input channels.

Possible values: Value must match regular expression ^[A-Za-z0-9._]+$

enable_entitiesboolean

Include additional 'entity' objects in the transcription results (e.g. dates, numbers) and their original spoken form. These entities are interleaved with other types of results. The concatenation of these words is represented as a single entity with the concatenated written form present in the 'content' field. The entities contain a 'spoken_form' field, which can be used in place of the corresponding 'word' type results, in case a spoken form is preferred to a written form. They also contain a 'written_form', which can be used instead of the entity, if you want a breakdown of the words without spaces. They can still contain non-breaking spaces and other special whitespace characters, as they are considered part of the word for the formatting output. In case of a written_form, the individual word times are estimated and might not be accurate if the order of the words in the written form does not correspond to the order they were actually spoken (such as 'one hundred million dollars' and '$100 million').

max_delay_modestring

Whether or not to enable flexible endpointing and allow the entity to continue to be spoken.

Possible values: [fixed, flexible]

audio_filtering_config object

Configuration for limiting the transcription of quiet audio.

volume_thresholdfloat

Controls the lower limit of audio volume at which speech and audio events will be transcribed. If the volume limit is very low, then most sound will be passed to the speech recognition engine. Higher numbers will cut out increasing amounts of sound.

Possible values: >= 0 and <= 100

transcript_filtering_config

Configuration for applying filtering to the transcription

remove_disfluenciesboolean

If true, words that are identified as disfluencies will be removed from the transcript. If false (default), they are tagged in the transcript as 'disfluency'.

replacements object[]

A list of replacements to apply to the transcript. Each replacement is a pair of strings, where the first string is the pattern to be replaced and the second string is the replacement text.

Array [

fromstringrequired

tostringrequired

]

speaker_diarization_config

Configuration for speaker diarization

prefer_current_speakerboolean

If true, the algorithm will prefer to stay with the current active speaker if it is a close enough match, even if other speakers may be closer. This is useful for cases where we can flip incorrectly between similar speakers during a single speaker section."

speaker_sensitivityfloat

Controls how sensitive the algorithm is in terms of keeping similar speakers separate, as opposed to combining them into a single speaker. Higher values will typically lead to more speakers, as the degree of difference between speakers in order to allow them to remain distinct will be lower. A lower value for this parameter will conversely guide the algorithm towards being less sensitive in terms of retaining similar speakers, and as such may lead to fewer speakers overall. The default is 0.5.

Possible values: >= 0 and <= 1

get_speakersboolean

If true, speaker identifiers will be returned at the end of transcript.

speakers object[]

Use this option to provide speaker labels linked to their speaker identifiers. When passed, the transcription system will tag spoken words in the transcript with the provided speaker labels whenever any of the specified speakers is detected in the audio. A maximum of 50 speakers identifiers across all speakers can be provided.

Array [

labelstringrequired

Speaker label, which must not match the format used internally (e.g. S1, S2, etc)

Possible values: non-empty

speaker_identifiersbytes[]required

Possible values: >= 1

]

notification_config object[]

Array [

urlstringrequired

The url to which a notification message will be sent upon completion of the job. The job id and status are added as query parameters, and any combination of the job inputs and outputs can be included by listing them in contents.

If contents is empty, the body of the request will be empty.

If only one item is listed, it will be sent as the body of the request with Content-Type set to an appropriate value such as application/octet-stream or application/json.

If multiple items are listed they will be sent as named file attachments using the multipart content type.

If contents is not specified, the transcript item will be sent as a file attachment named data_file, for backwards compatibility.

If the job was rejected or failed during processing, that will be indicated by the status, and any output items that are not available as a result will be omitted. The body formatting rules will still be followed as if all items were available.

The user-agent header is set to Speechmatics-API/2.0, or Speechmatics API V2 in older API versions.

contentsstring[]

Specifies a list of items to be attached to the notification message. When multiple items are requested, they are included as named file attachments.

Possible values: [jobinfo, transcript, transcript.json-v2, transcript.txt, transcript.srt, alignment, alignment.word_start_and_end, alignment.one_per_line, data, text]

methodstring

The method to be used with http and https urls. The default is post.

Possible values: [post, put]

auth_headersstring[]

A list of additional headers to be added to the notification request when using http or https. This is intended to support authentication or authorization, for example by supplying an OAuth2 bearer token.

]

tracking

titlestring

The title of the job.

referencestring

External system reference.

tagsstring[]

detailsobject

Customer-defined JSON structure.

output_config object

srt_overrides object

Parameters that override default values of srt conversion. max_line_length: sets maximum count of characters per subtitle line including white space. max_lines: sets maximum count of lines in a subtitle section.

max_line_lengthinteger

max_linesinteger

translation_config

target_languagesstring[]required

Possible values: <= 5

language_identification_config

expected_languagesstring[]

low_confidence_actionstring

Action to take if all of the predicted languages are below the confidence threshold

Possible values: [allow, reject, use_default_language]

default_languagestring

summarization_config object

Configuration options for summarization.

content_typestring

Choose from three options:

conversational - Best suited for dialogues involving multiple participants, such as calls, meetings or discussions. It focuses on summarizing key points of the conversation.
informative - Recommended for more structured information delivered by one or more people, making it ideal for videos, podcasts, lectures, and presentations.
auto - Automatically selects the most appropriate content type based on an analysis of the transcript.

Possible values: [auto, informative, conversational]

Default value: auto

summary_lengthstring

Determines the depth of the summary:

brief - Provides a succinct summary, condensing the content into just a few sentences.
detailed - Provide a longer, structured summary. For conversational content, it includes key topics and a summary of the entire conversation. For informative content, it logically divides the audio into sections and provides a summary for each.

Possible values: [brief, detailed]

Default value: brief

summary_typestring

Possible values: [paragraphs, bullets]

sentiment_analysis_configobject

topic_detection_config

topicsstring[]

auto_chapters_configobject

audio_events_config object

typesstring[]

Fetch URL

If you store your digital media in cloud storage (for example AWS S3 or Azure Blob Storage) you can also submit a job by providing the URL of the audio file. The configuration uses a fetch_data section, which looks like this:

Configuration example

{
  "type": "transcription",
  "transcription_config": {
    "language": "en"
  },
  "fetch_data": {
    "url": "${URL}/{FILENAME}"
  }
}

In SaaS, fetch requests made to the URL in fetch_data have user agent set to Speechmatics-API/2.0.

Fetch failure

If the Speechmatics Batch SaaS is unable to retrieve audio from the specified online location, the job will fail, with a status of rejected, and no transcript will be generated. Users can now retrieve failure information by making a GET /jobs/$JOBID request, and use that to carry out diagnostic information.

If the job has failed, there will be an additional errors element, which will show all failure messages Speechmatics Batch SaaS encountered when carrying out the fetch request. Please note, there can be multiple failure attempts associated with one submitted job, as there is a retry mechanism in place.

{
  "job": {
    "config": {
      "fetch_data": {
        "url": "https://example.com/average-files/punctuation1.mp3"
      },
      "notification_config": [
        {
          "contents": ["jobinfo"],
          "url": "https://example.com/"
        }
      ],
      "transcription_config": {
        "language": "de"
      },
      "type": "transcription"
    },
    "created_at": "2021-07-19T12:55:03.754Z",
    "data_name": "",
    "duration": 0,
    "errors": [
      {
        "message": "unable to fetch audio: http status code 404",
        "timestamp": "2021-07-19T12:55:05.425Z"
      },
      {
        "message": "unable to fetch audio: http status code 404",
        "timestamp": "2021-07-19T12:55:07.649Z"
      },
      {
        "message": "unable to fetch audio: http status code 404",
        "timestamp": "2021-07-19T12:55:17.665Z"
      },
      {
        "message": "unable to fetch audio: http status code 404",
        "timestamp": "2021-07-19T12:55:37.643Z"
      }
    ],
    "id": "a81ko4eqjl",
    "status": "rejected"
  }
}

Supported file types​

Job configuration options​

JobConfig schema​

Fetch URL​

Configuration example​

Fetch failure​

Supported file types

Job configuration options

`JobConfig` schema

Fetch URL

Configuration example

Fetch failure