Skip to main content
Speech to TextBatch Transcription

Output

Learn about the supported output formats for the Speechmatics Batch API

Transcription jobs are processed asynchronously. You can check the status of a job to see if it has been completed.

You can also configure notifications to be sent to a webhook when a job is completed. See Notifications for more details.

Check single job status

If you wish to retrieve a particular job, you can do so using the job ID for up to 7 days, after which time it will be automatically deleted in accordance with our Data Retention Policy.

You can make a GET request to check the status of a job as follows:

# $JOB_ID is from the submit command output
curl -L -X GET "https://asr.api.speechmatics.com/v2/jobs/$JOB_ID" \
-H "Authorization: Bearer $API_KEY"

The response is a JSON object containing details of the job, with the status field showing whether the job is still processing or not. The possible values are:

  • running: The job is still processing.
  • done: The job has been completed. The transcript is available for download at the /jobs/:jobid/transcript endpoint.
  • rejected: Transcription was not possible. This will be accompanied by an error message.

An example response looks like this:

HTTP/1.1 200 OK
Content-Type: application/json
{
"job": {
"config": {
"notification_config": null,
"transcription_config": {
"additional_vocab": null,
"channel_diarization_labels": null,
"language": "en"
},
"type": "transcription"
},
"created_at": "2019-01-17T17:50:54.113Z",
"data_name": "example.wav",
"duration": 275,
"id": "yjbmf9kqub",
"status": "running"
}
}

Check multiple job statuses

You can retrieve the status of the 100 most recent jobs submitted in the past 7 days. This is done by making a GET request without a Job ID as follows:

curl -L -X GET "https://asr.api.speechmatics.com/v2/jobs/" \
-H "Authorization: Bearer $API_KEY"

Note that if a job has been deleted it will not be included in the list, unless specifically requested, as described in the API Reference.

Load and process the transcript

Transcripts can be loaded in the following formats: Plain text, JSON or SRT. You can request the format using the format query parameter. See API reference here

A few useful things to know about transcript formats:

  • The default format is JSON.
  • Use the format=txt query parameter to get the transcript in plain text. Useful for quick access to the transcript.
  • Use the format=srt query parameter to get the transcript in SRT format. Useful for displaying the transcript in a subtitle file.
  • To access other data, including word timestamps, translations, and speech intelligence features, use the default JSON format.

Transcript response schema

Below is the schema for the transcript response when using the default JSON format.

Please refer to our API Reference for further details.

formatstringrequired

Speechmatics JSON transcript format version number.

Example: 2.1
job required

Summary information about an ASR job, to support identification and tracking.

created_atdate-timerequired

The UTC date time the job was created.

Example: 2018-01-09T12:29:01.853047Z
data_namestringrequired

Name of data file submitted for job.

durationintegerrequired

The data file audio duration (in seconds).

Possible values: >= 0

idstringrequired

The unique id assigned to the job.

Example: a1b2c3d4e5
text_namestring

Name of the text file submitted to be aligned to audio.

tracking
titlestring

The title of the job.

referencestring

External system reference.

tagsstring[]
detailsobject

Customer-defined JSON structure.

metadata required

Summary information about the output from an ASR job, comprising the job type and configuration parameters used when generating the output.

created_atdate-timerequired

The UTC date time the transcription output was created.

Example: 2018-01-09T12:29:01.853047Z
typestringrequired

Possible values: [alignment, transcription]

transcription_config
languagestringrequired

Language model to process the audio input, normally specified as an ISO language code

domainstring

Request a specialized model based on 'language' but optimized for a particular field, e.g. "finance" or "medical".

output_localestring

Language locale to be used when generating the transcription output, normally specified as an ISO language code

operating_pointstring

Specify an operating point to use. Operating points change the transcription process in a high level way, such as altering the acoustic model. The default is standard.

  • standard:
  • enhanced: transcription will take longer but be more accurate than 'standard'

Possible values: [standard, enhanced]

additional_vocab object[]

List of custom words or phrases that should be recognized. Alternative pronunciations can be specified to aid recognition.

  • Array [
  • contentstringrequired
    sounds_likestring[]
  • ]
  • punctuation_overrides

    Control punctuation settings.

    sensitivityfloat

    Ranges between zero and one. Higher values will produce more punctuation. The default is 0.5.

    Possible values: >= 0 and <= 1

    permitted_marksstring[]

    The punctuation marks which the client is prepared to accept in transcription output, or the special value 'all' (the default). Unsupported marks are ignored. This value is used to guide the transcription process.

    Possible values: Value must match regular expression ^(.|all)$

    diarizationstring

    Specify whether speaker or channel labels are added to the transcript. The default is none.

    • none: no speaker or channel labels are added.
    • speaker: speaker attribution is performed based on acoustic matching; all input channels are mixed into a single stream for processing.
    • channel: multiple input channels are processed individually and collated into a single transcript.

    Possible values: [none, speaker, channel]

    channel_diarization_labelsstring[]

    Transcript labels to use when using collating separate input channels.

    Possible values: Value must match regular expression ^[A-Za-z0-9._]+$

    enable_entitiesboolean

    Include additional 'entity' objects in the transcription results (e.g. dates, numbers) and their original spoken form. These entities are interleaved with other types of results. The concatenation of these words is represented as a single entity with the concatenated written form present in the 'content' field. The entities contain a 'spoken_form' field, which can be used in place of the corresponding 'word' type results, in case a spoken form is preferred to a written form. They also contain a 'written_form', which can be used instead of the entity, if you want a breakdown of the words without spaces. They can still contain non-breaking spaces and other special whitespace characters, as they are considered part of the word for the formatting output. In case of a written_form, the individual word times are estimated and might not be accurate if the order of the words in the written form does not correspond to the order they were actually spoken (such as 'one hundred million dollars' and '$100 million').

    max_delay_modestring

    Whether or not to enable flexible endpointing and allow the entity to continue to be spoken.

    Possible values: [fixed, flexible]

    transcript_filtering_config

    Configuration for applying filtering to the transcription

    remove_disfluenciesboolean

    If true, words that are identified as disfluencies will be removed from the transcript. If false (default), they are tagged in the transcript as 'disfluency'.

    speaker_diarization_config

    Configuration for speaker diarization

    speaker_sensitivityfloat

    Controls how sensitive the algorithm is in terms of keeping similar speakers separate, as opposed to combining them into a single speaker. Higher values will typically lead to more speakers, as the degree of difference between speakers in order to allow them to remain distinct will be lower. A lower value for this parameter will conversely guide the algorithm towards being less sensitive in terms of retaining similar speakers, and as such may lead to fewer speakers overall. The default is 0.5.

    Possible values: >= 0 and <= 1

    orchestrator_versionstring

    The engine version used to generate transcription output.

    Example: 2024.12.26085+a0a32e61ad.HEAD
    translation_errors undefined[]

    List of errors that occurred in the translation stage.

  • Array [
  • typestring

    Possible values: [translation_failed, unsupported_translation_pair]

    messagestring

    Human readable error message

  • ]
  • summarization_errors undefined[]

    List of errors that occurred in the summarization stage.

  • Array [
  • typestring

    Possible values: [summarization_failed, unsupported_language]

    messagestring

    Human readable error message

  • ]
  • sentiment_analysis_errors undefined[]

    List of errors that occurred in the sentiment analysis stage.

  • Array [
  • typestring

    Possible values: [sentiment_analysis_failed, unsupported_language]

    messagestring

    Human readable error message

  • ]
  • topic_detection_errors undefined[]

    List of errors that occurred in the topic detection stage.

  • Array [
  • typestring

    Possible values: [topic_detection_failed, unsupported_list_of_topics, unsupported_language]

    messagestring

    Human readable error message

  • ]
  • auto_chapters_errors undefined[]

    List of errors that occurred in the auto chapters stage.

  • Array [
  • typestring

    Possible values: [auto_chapters_failed, unsupported_language]

    messagestring

    Human readable error message

  • ]
  • alignment_config
    languagestringrequired
    output_config object
    srt_overrides object

    Parameters that override default values of srt conversion. max_line_length: sets maximum count of characters per subtitle line including white space. max_lines: sets maximum count of lines in a subtitle section.

    max_line_lengthinteger
    max_linesinteger
    language_pack_info

    Properties of the language pack.

    language_descriptionstring

    Full descriptive name of the language, e.g. 'Japanese'.

    word_delimiterstringrequired

    The character to use to separate words.

    writing_directionstring

    The direction that words in the language should be written and read in.

    Possible values: [left-to-right, right-to-left]

    itnboolean

    Whether or not ITN (inverse text normalization) is available for the language pack.

    adaptedboolean

    Whether or not language model adaptation has been applied to the language pack.

    language_identification object

    Result of the language identification of the audio, configured using language_identification_config, or setting the transcription language to auto.

    results object[]
  • Array [
  • alternatives object[]
  • Array [
  • languagestring
    confidencenumber
  • ]
  • start_timenumber
    end_timenumber
  • ]
  • errorstring

    Possible values: [LOW_CONFIDENCE, UNEXPECTED_LANGUAGE, NO_SPEECH, FILE_UNREADABLE, OTHER]

    messagestring
    results RecognitionResult[]required
  • Array [
  • channelstring
    start_timefloatrequired
    end_timefloatrequired
    volumefloat

    An indication of the volume of audio across the time period the word was spoken.

    Possible values: >= 0 and <= 100

    is_eosboolean

    Whether the punctuation mark is an end of sentence character. Only applies to punctuation marks.

    typestringrequired

    New types of items may appear without being requested; unrecognized item types can be ignored.

    Possible values: [word, punctuation, entity]

    written_form object[]
  • Array [
  • alternatives undefined[]required
  • Array [
  • contentstringrequired
    confidencefloatrequired
    languagestringrequired
    display
    directionstringrequired

    Possible values: [ltr, rtl]

    speakerstring
    tagsstring[]
  • ]
  • end_timefloatrequired
    start_timefloatrequired
    typestringrequired

    What kind of object this is. See #/Definitions/RecognitionResult for definitions of the enums.

    Possible values: [word]

  • ]
  • spoken_form object[]
  • Array [
  • alternatives undefined[]required
  • Array [
  • contentstringrequired
    confidencefloatrequired
    languagestringrequired
    display
    directionstringrequired

    Possible values: [ltr, rtl]

    speakerstring
    tagsstring[]
  • ]
  • end_timefloatrequired
    start_timefloatrequired
    typestringrequired

    What kind of object this is. See #/Definitions/RecognitionResult for definitions of the enums.

    Possible values: [word, punctuation]

  • ]
  • alternatives undefined[]
  • Array [
  • contentstringrequired
    confidencefloatrequired
    languagestringrequired
    display
    directionstringrequired

    Possible values: [ltr, rtl]

    speakerstring
    tagsstring[]
  • ]
  • attaches_tostring

    Attachment direction of the punctuation mark. This only applies to punctuation marks. This information can be used to produce a well-formed text representation by placing the word_delimiter from language_pack_info on the correct side of the punctuation mark.

    Possible values: [previous, next, both, none]

  • ]
  • translations object

    Translations of the transcript into other languages. It is a map of ISO language codes to arrays of translated sentences. Configured using translation_config.

    [property name: string] object[]
  • Array [
  • start_timefloat
    end_timefloat
    contentstring
    speakerstring
    channelstring
  • ]
  • summary object

    Summary of the transcript, configured using summarization_config.

    contentstring
    sentiment_analysis object

    The main object that holds sentiment analysis data.

    sentiment_analysis object

    Holds the detailed sentiment analysis information.

    segments object[]

    An array of objects that represent a segment of text and its associated sentiment.

  • Array [
  • textstring

    Represents the transcript of the analysed segment

    sentimentstring

    The assigned sentiment to the segment, which can be positive, neutral or negative

    start_timefloat

    The timestamp corresponding to the beginning of the transcription segment

    end_timefloat

    The timestamp corresponding to the end of the transcription segment

    speakerstring

    The speaker label for the segment, if speaker diarization is enabled

    channelstring

    The channel label for the segment, if channel diarization is enabled

    confidencefloat

    A confidence score in the range of 0-1

  • ]
  • summary object

    An object that holds overall sentiment information, and per-speaker and per-channel sentiment data.

    overall object

    Summary for all segments in the file

    positive_countinteger
    negative_countinteger
    neutral_countinteger
    speakers object[]

    An array of objects that represent sentiment data for a specific speaker.

  • Array [
  • speakerstring
    positive_countinteger
    negative_countinteger
    neutral_countinteger
  • ]
  • channels object[]

    An array of objects that represent sentiment data for a specific channel.

  • Array [
  • channelstring
    positive_countinteger
    negative_countinteger
    neutral_countinteger
  • ]
  • topics object

    Main object that holds topic detection results.

    segments object[]

    An array of objects that represent a segment of text and its associated topic information.

  • Array [
  • textstring
    start_timefloat
    end_timefloat
    topics object[]
  • Array [
  • topicstring
  • ]
  • ]
  • summary object

    An object that holds overall information on the topics detected.

    overall object

    Summary of overall topic detection results.

    [property name: string]integer
    chapters object[]

    An array of objects that represent summarized chapters of the transcript

  • Array [
  • titlestring

    The auto-generated title for the chapter

    summarystring

    An auto-generated paragraph-style, short summary of the chapter

    start_timenumber

    The start time of the chapter in the audio file

    end_timenumber

    The end time of the chapter in the audio file

  • ]
  • audio_events object[]

    Timestamped audio events, only set if audio_events_config is in the config

  • Array [
  • typestring

    Kind of audio event. E.g. music

    start_timefloat

    Time (in seconds) at which the audio event starts

    end_timefloat

    Time (in seconds) at which the audio event ends

    confidencefloat

    Prediction confidence associated with this event

    channelstring

    Input channel this event occurred on

  • ]
  • audio_event_summary object

    Summary statistics per event type, keyed by type, e.g. music

    overall object

    Overall summary on all channels

    [property name: string] object

    Summary statistics for this audio event type

    total_durationfloat

    Total duration (in seconds) of all audio events of this type

    countnumber

    Number of events of this type

    channels object

    Summary keyed by channel, only set if channel diarization is enabled

    [property name: string] object
    [property name: string] object

    Summary statistics for this audio event type

    total_durationfloat

    Total duration (in seconds) of all audio events of this type

    countnumber

    Number of events of this type