Skip to main content

Word Tagging

Transcription:BatchReal-TimeDeployments:All

Speechmatics outputs in the transcript a metadata tag to indicate whether a word is a profanity or a disfluency. You do not have to take any action to access this - it is provided in our transcription output as standard.

Profanity Tagging

You can use this tag in order to identify, redact, or obfuscate profanities and integrate this data into your own workflows.

Profanity tagging is available is for the following languages:

  • English (EN)
  • Italian (IT)
  • Spanish (ES)

Note that the list of profanities in each language is not alterable.

An example of how this looks is below.

"results": [
  {
    "alternatives": [
      {
        "confidence": 1.0,
        "content": "$PROFANITY",
        "language": "en",
        "speaker": "UU",
        "tags": [
          "profanity"
        ]
      }
    ],
    "end_time": 18.03,
    "start_time": 17.61,
    "type": "word"
  }
]

Disfluency Tagging

A disfluency here refers to a set list of words in English that imply hesitation or indecision. Please note while disfluency can cover a range of items like stuttering and interjections, here it is only used to tag words such as 'hmm' or 'um'. The full list of words tagged as disfluencies is as follows:

huh
aha
ah
aw
eh
err
hmm
mm
um
uh
uh-oh
uh-huh
uh-uh
mhm
a-ha
aah
aahh
aaw
ah-ha
ahaa
ahh
ahha
aww
eeh
erm
hhm
hhmm
hm
huh-uh
m-hm
uggh
ugh
ughh
uhh
uhhm
uhm
uhmm
umm
uuh
uuhh
uum

You can use this tag for your own post-processing workflows such as not displaying disfluencies. An example of how this looks is below:

English language only

"results": [
  {
    "alternatives": [
      {
        "confidence": 1.0,
        "content": "hmm",
        "language": "en",
        "speaker": "UU",
        "tags": [
          "disfluency"
        ]
      }
    ],
    "end_time": 18.03,
    "start_time": 17.61,
    "type": "word"
  }
]