Speech to Text

Formatting

Control how numbers, punctuation, and special text appear in your transcripts.

Output locale

Some languages have multiple spelling conventions that vary by region. To ensure consistent spelling throughout your transcript, specify an output locale:

{
  "type": "transcription",
  "transcription_config": {
    "model": "enhanced",
    "language": "en",
    "output_locale": "en-GB"
  }
}

Available English locales:

British English (en-GB)
US English (en-US)
Australian English (en-AU)

Available Mandarin locales:

Simplified Mandarin (cmn-Hans, default)
Traditional Mandarin (cmn-Hant)

Recommended for English transcription. Without a specified locale, spelling may be inconsistent within the same transcript.

Profanities

You can tag profanities to identify or censor offensive language in your workflow. Profanity tagging is available for:

English (en)
Italian (it)
Spanish (es)

Tagged profanities appear in the transcript with the profanity tag:

"results": [
  {
    "alternatives": [
      {
        "confidence": 1.0,
        "content": "$PROFANITY",
        "language": "en",
        "tags": [
          "profanity"
        ]
      }
    ],
    "end_time": 18.03,
    "start_time": 17.61,
    "type": "word"
  }
]

For other languages, consider using word replacement to identify profanities.

Disfluencies

Disfluencies are hesitation sounds like "um", "uh", and "hmm". Speechmatics automatically tags them with disfluency in the transcript output:

"results": [
  {
    "alternatives": [
      {
        "confidence": 1.0,
        "content": "hmm",
        "language": "en",
        "tags": [
          "disfluency"
        ]
      }
    ],
    "end_time": 18.03,
    "start_time": 17.61,
    "type": "word"
  }
]

Full list of tagged English disfluencies

huh
aha
ah
aw
eh
err
hmm
mm
um
uh
uh-oh
uh-huh
uh-uh
mhm
a-ha
aah
aahh
aaw
ah-ha
ahaa
ahh
ahha
aww
eeh
erm
hhm
hhmm
hm
huh-uh
m-hm
uggh
ugh
ughh
uhh
uhhm
uhm
uhmm
umm
uuh
uuhh
uum

Supported languages for disfluencies

Disfluency tagging and removal are available for the following languages. Each language has its own set of hesitation sounds; the expandable list at the start of this section covers English.

Arabic (ar)
Danish (da)
Dutch (nl)
English (en)
Finnish (fi)
French (fr)
German (de)
Greek (el)
Hebrew (he)
Hindi (hi)
Hungarian (hu)
Italian (it)
Japanese (ja)
Mandarin (cmn)
Norwegian (no)
Polish (pl)
Portuguese (pt)
Russian (ru)
Spanish (es)
Swedish (sv)

Coverage of hesitation sounds varies by language. If you rely on disfluency removal for a specific language, test it with representative audio rather than assuming full coverage.

Removing disfluencies

You can automatically remove disfluencies from your transcript:

"transcription_config": {
  "model": "enhanced",
  "language": "en",
  "transcript_filtering_config": {
    "remove_disfluencies": true
  }
}

This simplifies client-side processing by removing hesitation sounds and properly adjusting capitalization and spacing. For example:

Without disfluency removal:

Um, what would you like, hmm?

With disfluency removal:

What would you like?

This feature is available for the supported languages. The default setting is "remove_disfluencies": false.

Word replacement

Word replacement lets you substitute specific words or patterns in the transcript after processing:

"transcription_config": {
  "model": "enhanced",
  "language": "en",
  "transcript_filtering_config": {
    "replacements": [
      {"from": "foo", "to": "bar"},
      {"from": "heavy", "to": "light"}
    ]
  }
}

Common uses for word replacement:

Censoring profanities in languages without built-in support
Masking sensitive information (card numbers, personal data)
Standardizing terminology or brand names
Fixing known issues with particular words

Word replacement is case-sensitive and applied after transcription is complete. For example, "Foo" would not be replaced by "bar" in the example above.

For adding new vocabulary, use the custom dictionary feature instead.

Regex

You can use regular expressions (ECMAScript format) in the from field by adding forward-slash delimiters:

// Replace both "Hello" and "hello" with "goodbye"
{"from": "/^[hH]ello$/", "to": "goodbye"}

// Add brackets around "cheese" while preserving the original word
{"from": "/(cheese)/", "to": "[$1]"}

Word replacement rules:

Plain word replacements are processed first
If no match is found, regex replacements are tried in the order listed
Once a word matches a replacement, no further replacements are applied to it
Regex replacements are global (all matches are replaced)
Malformed regex patterns will cause the transcription to fail with an error

Smart formatting

Smart formatting converts spoken numbers, dates, currencies, and other entities into properly formatted text. This makes transcripts more readable without losing timing information.

An entity is a spoken value that has a conventional written form, such as a number, date, currency, time, or measurement. Speechmatics detects each entity and converts it from the words as spoken into its written form. For example, the spoken words "nineteen ninety nine" become "1999" in the output.

Smart formatting is applied by default. Set enable_entities to true to also expose the structure of each entity in the JSON output: the class of entity, and the individual spoken and written words it is made from.

Configuration

To include detailed entity information in your JSON output, add enable_entities to your configuration:

{
  "type": "transcription",
  "transcription_config": {
    "model": "enhanced",
    "language": "en",
    "enable_entities": true
  }
}

By default, enable_entities is false. When enabled, entity metadata appears only in JSON output (SRT and TXT formats remain unchanged).

Output

The JSON output will include:

A new type field with value entity for formatted numeric entities
Full written form in the content section, including any spaces or symbols
An entity_class field describing how the entity was formatted
Start and end times spanning all words in the entity
Two additional representations:
- spoken_form: Original words as spoken, with individual timing and confidence
- written_form: Formatted words separated individually

Here's an example of a transcript with enable_entities set to true:

{
  "results": [
    {
      "alternatives": [
        {
          "confidence": 0.99,
          "content": "17th of January 2022",
          "language": "en",
          "speaker": "UU"
        }
      ],
      "end_time": 3.14,
      "entity_class": "date",
      "spoken_form": [
        {
          "alternatives": [
            {
              "confidence": 1.0,
              "content": "seventeenth",
              "language": "en",
              "speaker": "UU"
            }
          ],
          "end_time": 1.41,
          "start_time": 0.72,
          "type": "word"
        },
        // Additional spoken words omitted for brevity
      ],
      "start_time": 0.72,
      "type": "entity",
      "written_form": [
        {
          "alternatives": [
            {
              "confidence": 0.99,
              "content": "17th",
              "language": "en",
              "speaker": "UU"
            }
          ],
          "end_time": 1.33,
          "start_time": 0.72,
          "type": "word"
        },
        // Additional written words omitted for brevity
      ]
    }
  ]
}

When enable_entities is false, the words appear individually in the output.

Entity classes

The system applies different formatting rules based on the type of entity detected. The following classes are available:

Entity class	Description	Spoken example	Written example
alphanum	Alphanumeric sequences (3+ characters)	"a z triple seven five four"	AZ77754
cardinal	Whole numbers (in English, numbers ≤10 remain as words)	"nineteen"	19
decimal	Numbers with decimal point	"eighteen point one two"	18.12
fraction	Fractions (complex ones use n/d format)	"three sixteenths"	3/16
ordinal	Position numbers with suffix	"forty second"	42nd
money	Currency values with symbol	"twenty dollars"	$20
percentage	Percentages with % symbol	"two hundred percent"	200%
date	Calendar dates and years	"fifteenth of January twenty twenty two"	15th of January 2022
time	Clock times with separators	"eleven forty a m"	11:40 a.m.
span	Ranges (x to y format)	"one hundred to two hundred million pounds"	100 to £200 million
credit card	Payment card number sequences	"one one one one..."	1111 2222 3333 4444
telephone	Phone number formatting	"five five five..."	(555) 429-2228
electronic	Email and web addresses	"bob at speechmatics dot com"	bob@speechmatics.com
measurement	Units with abbreviations	"ten kilometers per second"	10 km/s

The system chooses entity classes based on context, so occasionally a value might be classified differently than expected. For example, "2001" could be a "cardinal" number or a "date".

Languages for smart formatting

Each language follows its own conventions for:

Thousand separators
Decimal separators
Currency symbol position

Examples:

English: Uses commas for thousands (20,000), decimal points (10.5), and places currency symbols before values ($10)
German: Uses periods for thousands (20.000), commas for decimals (10,5), and places currency symbols after values with a non-breaking space (10 $)
French: Uses non-breaking spaces for thousands (20 000), commas for decimals (10,5), and places currency symbols after values with a non-breaking space (10 $)

Smart formatting has had dedicated work for consistent results in these languages:

Cantonese
Dutch
English
French
German
Hindi
Italian
Japanese
Mandarin (Simplified and Traditional)
Mandarin & English (bilingual)
Mandarin Malay Tamil & English (multilingual)
Norwegian
Portuguese
Russian
Spanish
Swedish
Tamil & English (bilingual)

Other languages still format numbers and entities on a best-effort basis through the model, with variable results. If you rely on formatting for a language that isn't listed, test it with representative audio rather than assuming full coverage.

Formatting coverage isn't reported by feature discovery, which covers transcription, translation, and language identification. This page is the reference for formatting language support.

Punctuation

All Speechmatics language packs support punctuation to improve transcript readability. Each language supports specific punctuation marks:

Language	Supported marks	End-of-sentence marks	Notes
Cantonese, Mandarin	，。？！、	。？！	Full-width punctuation
Japanese	。、	。	Full-width punctuation
Hindi	। ? , !	। ? !
All other languages	. , ! ?	. ! ?

Configuration

You can control which punctuation marks appear in your transcripts using the punctuation_overrides setting:

"transcription_config": {
   "model": "enhanced",
   "language": "en",
   "punctuation_overrides": {
      "permitted_marks": [".", ","],
      "sensitivity": 0.4
   }
}

This configuration:

Allows only periods and commas (no question or exclamation marks)
Sets punctuation sensitivity to 0.4 (lower than the default 0.5)
You can select all marks by selecting "permitted_marks": ["all"]
If you use an empty list for permitted_marks, no punctuation marks will be present in the output

The sensitivity parameter accepts values from 0 to 1. Higher values produce more punctuation in the output.

Disabling punctuation may slightly reduce speaker diarization accuracy. See speaker diarization for details.

Next steps

Custom Dictionary: Improve recognition of specific words and phrases by adding them to a custom dictionary.
Diarization: Enhance your transcripts with speaker and channel information.

Output locale​

Profanities​

Disfluencies​

Supported languages for disfluencies​

Removing disfluencies​

Word replacement​

Regex​

Smart formatting​

Configuration​

Output​

Entity classes​

Languages for smart formatting​

Punctuation​

Configuration​

Next steps​

Output locale

Profanities

Disfluencies

Supported languages for disfluencies

Removing disfluencies

Word replacement

Regex

Smart formatting

Configuration

Output

Entity classes

Languages for smart formatting

Punctuation

Configuration

Next steps