Translation
Translate your audio into multiple languages with a single API call.Translate your audio into multiple languages through a single API call, with over 30 languages supported.
Use cases
- Translate audio files for international distribution
- Power live subtitles and captions for global events
- Build voice assistants or AI agents that communicate in multiple languages
Configuration
Enable translation when processing a file or in real-time in SaaS and on-prem deployment.
New to Speechmatics? See our guides on transcribing a file or transcribing in real-time. Once set up, add the following configuration to enable translation:
{
"type": "transcription",
"transcription_config": {
"operating_point": "enhanced",
"language": "en"
},
"translation_config": {
"target_languages": ["es", "de"],
"enable_partials": true
}
}
You can configure up to five translation languages at a time.
Batch output
The returned JSON will include a new property called translations
, which contains a list of translated text for each target language requested (using the same ISO language codes as for transcription).
Speechmatics JSON transcript format version number.
2.1
job required
Summary information about an ASR job, to support identification and tracking.
The UTC date time the job was created.
2018-01-09T12:29:01.853047Z
Name of data file submitted for job.
The data file audio duration (in seconds).
Possible values: >= 0
The unique id assigned to the job.
a1b2c3d4e5
Name of the text file submitted to be aligned to audio.
tracking
The title of the job.
External system reference.
Customer-defined JSON structure.
metadata required
Summary information about the output from an ASR job, comprising the job type and configuration parameters used when generating the output.
The UTC date time the transcription output was created.
2018-01-09T12:29:01.853047Z
Possible values: [alignment
, transcription
]
transcription_config
Language model to process the audio input, normally specified as an ISO language code
Request a specialized model based on 'language' but optimized for a particular field, e.g. "finance" or "medical".
Language locale to be used when generating the transcription output, normally specified as an ISO language code
Specify an operating point to use.
Operating points change the transcription process in a high level way, such as altering the acoustic model.
The default is standard
.
standard
:enhanced
: transcription will take longer but be more accurate than 'standard'
Possible values: [standard
, enhanced
]
additional_vocab object[]
List of custom words or phrases that should be recognized. Alternative pronunciations can be specified to aid recognition.
punctuation_overrides
Control punctuation settings.
Ranges between zero and one. Higher values will produce more punctuation. The default is 0.5.
Possible values: >= 0
and <= 1
The punctuation marks which the client is prepared to accept in transcription output, or the special value 'all' (the default). Unsupported marks are ignored. This value is used to guide the transcription process.
Possible values: Value must match regular expression ^(.|all)$
Specify whether speaker or channel labels are added to the transcript.
The default is none
.
- none: no speaker or channel labels are added.
- speaker: speaker attribution is performed based on acoustic matching; all input channels are mixed into a single stream for processing.
- channel: multiple input channels are processed individually and collated into a single transcript.
Possible values: [none
, speaker
, channel
]
Transcript labels to use when using collating separate input channels.
Possible values: Value must match regular expression ^[A-Za-z0-9._]+$
Include additional 'entity' objects in the transcription results (e.g. dates, numbers) and their original spoken form. These entities are interleaved with other types of results. The concatenation of these words is represented as a single entity with the concatenated written form present in the 'content' field. The entities contain a 'spoken_form' field, which can be used in place of the corresponding 'word' type results, in case a spoken form is preferred to a written form. They also contain a 'written_form', which can be used instead of the entity, if you want a breakdown of the words without spaces. They can still contain non-breaking spaces and other special whitespace characters, as they are considered part of the word for the formatting output. In case of a written_form, the individual word times are estimated and might not be accurate if the order of the words in the written form does not correspond to the order they were actually spoken (such as 'one hundred million dollars' and '$100 million').
Whether or not to enable flexible endpointing and allow the entity to continue to be spoken.
Possible values: [fixed
, flexible
]
transcript_filtering_config
Configuration for applying filtering to the transcription
If true, words that are identified as disfluencies will be removed from the transcript. If false (default), they are tagged in the transcript as 'disfluency'.
speaker_diarization_config
Configuration for speaker diarization
Controls how sensitive the algorithm is in terms of keeping similar speakers separate, as opposed to combining them into a single speaker. Higher values will typically lead to more speakers, as the degree of difference between speakers in order to allow them to remain distinct will be lower. A lower value for this parameter will conversely guide the algorithm towards being less sensitive in terms of retaining similar speakers, and as such may lead to fewer speakers overall. The default is 0.5.
Possible values: >= 0
and <= 1
The engine version used to generate transcription output.
2024.12.26085+a0a32e61ad.HEAD
translation_errors undefined[]
List of errors that occurred in the translation stage.
Possible values: [translation_failed
, unsupported_translation_pair
]
Human readable error message
summarization_errors undefined[]
List of errors that occurred in the summarization stage.
Possible values: [summarization_failed
, unsupported_language
]
Human readable error message
sentiment_analysis_errors undefined[]
List of errors that occurred in the sentiment analysis stage.
Possible values: [sentiment_analysis_failed
, unsupported_language
]
Human readable error message
topic_detection_errors undefined[]
List of errors that occurred in the topic detection stage.
Possible values: [topic_detection_failed
, unsupported_list_of_topics
, unsupported_language
]
Human readable error message
auto_chapters_errors undefined[]
List of errors that occurred in the auto chapters stage.
Possible values: [auto_chapters_failed
, unsupported_language
]
Human readable error message
alignment_config
output_config object
srt_overrides object
Parameters that override default values of srt conversion. max_line_length: sets maximum count of characters per subtitle line including white space. max_lines: sets maximum count of lines in a subtitle section.
language_pack_info
Properties of the language pack.
Full descriptive name of the language, e.g. 'Japanese'.
The character to use to separate words.
The direction that words in the language should be written and read in.
Possible values: [left-to-right
, right-to-left
]
Whether or not ITN (inverse text normalization) is available for the language pack.
Whether or not language model adaptation has been applied to the language pack.
language_identification object
Result of the language identification of the audio, configured using language_identification_config
, or setting the transcription language to auto
.
results object[]
alternatives object[]
Possible values: [LOW_CONFIDENCE
, UNEXPECTED_LANGUAGE
, NO_SPEECH
, FILE_UNREADABLE
, OTHER
]
results RecognitionResult[]required
An indication of the volume of audio across the time period the word was spoken.
Possible values: >= 0
and <= 100
Whether the punctuation mark is an end of sentence character. Only applies to punctuation marks.
New types of items may appear without being requested; unrecognized item types can be ignored.
Possible values: [word
, punctuation
, entity
]
written_form object[]
alternatives undefined[]required
display
Possible values: [ltr
, rtl
]
What kind of object this is. See #/Definitions/RecognitionResult for definitions of the enums.
Possible values: [word
]
spoken_form object[]
alternatives undefined[]required
display
Possible values: [ltr
, rtl
]
What kind of object this is. See #/Definitions/RecognitionResult for definitions of the enums.
Possible values: [word
, punctuation
]
alternatives undefined[]
display
Possible values: [ltr
, rtl
]
Attachment direction of the punctuation mark. This only applies to punctuation marks. This information can be used to produce a well-formed text representation by placing the word_delimiter
from language_pack_info
on the correct side of the punctuation mark.
Possible values: [previous
, next
, both
, none
]
translations object
Translations of the transcript into other languages. It is a map of ISO language codes to arrays of translated sentences. Configured using translation_config
.
[property name: string] object[]
summary object
Summary of the transcript, configured using summarization_config
.
sentiment_analysis object
The main object that holds sentiment analysis data.
sentiment_analysis object
Holds the detailed sentiment analysis information.
segments object[]
An array of objects that represent a segment of text and its associated sentiment.
Represents the transcript of the analysed segment
The assigned sentiment to the segment, which can be positive, neutral or negative
The timestamp corresponding to the beginning of the transcription segment
The timestamp corresponding to the end of the transcription segment
The speaker label for the segment, if speaker diarization is enabled
The channel label for the segment, if channel diarization is enabled
A confidence score in the range of 0-1
summary object
An object that holds overall sentiment information, and per-speaker and per-channel sentiment data.
overall object
Summary for all segments in the file
speakers object[]
An array of objects that represent sentiment data for a specific speaker.
channels object[]
An array of objects that represent sentiment data for a specific channel.
topics object
Main object that holds topic detection results.
segments object[]
An array of objects that represent a segment of text and its associated topic information.
topics object[]
summary object
An object that holds overall information on the topics detected.
overall object
Summary of overall topic detection results.
chapters object[]
An array of objects that represent summarized chapters of the transcript
The auto-generated title for the chapter
An auto-generated paragraph-style, short summary of the chapter
The start time of the chapter in the audio file
The end time of the chapter in the audio file
audio_events object[]
Timestamped audio events, only set if audio_events_config
is in the config
Kind of audio event. E.g. music
Time (in seconds) at which the audio event starts
Time (in seconds) at which the audio event ends
Prediction confidence associated with this event
Input channel this event occurred on
audio_event_summary object
Summary statistics per event type, keyed by type
, e.g. music
overall object
Overall summary on all channels
[property name: string] object
Summary statistics for this audio event type
Total duration (in seconds) of all audio events of this type
Number of events of this type
channels object
Summary keyed by channel, only set if channel diarization is enabled
[property name: string] object
[property name: string] object
Summary statistics for this audio event type
Total duration (in seconds) of all audio events of this type
Number of events of this type
An example of the response is below:
{
"format": "2.9",
"job": {
"created_at": "2023-01-23T19:31:19.354Z",
"data_name": "example.wav",
"duration": 15,
"id": "ggqjaazkqf"
},
"metadata": {
"created_at": "2023-01-23T19:31:44.766Z",
"type": "transcription",
"transcription_config": {
"language": "en",
"diarization": "speaker"
},
"translation_config": {
"target_languages": [
"es"
]
}
},
"results": [
{
"start_time": 0.78,
"end_time": 1.32,
"type": "word",
"alternatives": [
{
"content": "Welcome",
"confidence": 1.0,
"language": "en",
"speaker": "S1"
}
]
},
...
],
"translations": {
"es": [
{
"start_time": 0.78,
"end_time": 2.58,
"content": "Bienvenidos a Speechmatics.",
"speaker": "S1"
},
{
"start_time": 3.0,
"end_time": 7.94,
"content": "Esperamos que tengas un gran día.",
"speaker": "S1"
},
...
]
}
}
Realtime output
Realtime provides a stream of translation messages per language requested. Translation messages will arrive after transcription messages, but won't delay transcription. Realtime translations have the following schema:
AddTranslation
Speechmatics JSON output format version number.
2.1
results object[]required
Translations arrive as lower latency partial results and higher latency, more accurate finals.
Partials
Partial translations typically correspond to unfinished sentences and have lower latency than final translations. By default, only final translations are produced. Enable partials using the enable_partials
property in translation_config
for the session. For example:
{
"format": "2.9",
"message": "AddPartialTranslation",
"language": "es",
"results": [
{
"start_time": 5.45999987795949,
"end_time": 5.889999870583415,
"content": "Bienvenidos a",
"speaker": "S1"
}
]
}
Finals
Final translations are the most accurate and complete translations, usually at the end of a sentence. These translations are considered final and will not be updated afterwards. For example:
{
"format": "2.9",
"message": "AddTranslation",
"language": "es",
"results": [
{
"start_time": 5.45999987795949,
"end_time": 6.189999870583415,
"content": "Bienvenidos a Speechmatics.",
"speaker": "S1"
}
]
}
Examples
Python client example to translate a file for batch.
from speechmatics.models import ConnectionSettings
from speechmatics.batch_client import BatchClient
from httpx import HTTPStatusError
API_KEY = "YOUR_API_KEY"
PATH_TO_FILE = "example.wav"
LANGUAGE = "en" # Transcription language
TRANSLATION_LANGUAGES = ["es","de"]
settings = ConnectionSettings(
url="https://asr.api.speechmatics.com/v2",
auth_token=API_KEY,
)
# Define transcription parameters
conf = {
"type": "transcription",
"transcription_config": {
"language": LANGUAGE
},
"translation_config": {
"target_languages":TRANSLATION_LANGUAGES
}
}
# Open the client using a context manager
with BatchClient(settings) as client:
try:
job_id = client.submit_job(
audio=PATH_TO_FILE,
transcription_config=conf,
)
print(f'job {job_id} submitted successfully, waiting for transcript')
# Note that in production, you should set up notifications instead of polling.
# Notifications are described here: https://docs.speechmatics.com/features-other/notifications
transcript = client.wait_for_completion(job_id, transcription_format='json-v2')
for language in TRANSLATION_LANGUAGES:
# Print the translation for each language from the JSON
print(f"Translation for {language}")
translation = ""
for translated_segment in transcript["translations"][language]:
translation += translated_segment["content"] + " "
print(translation)
except HTTPStatusError as e:
if e.response.status_code == 401:
print('Invalid API key - Check your API_KEY at the top of the code!')
elif e.response.status_code == 400:
print(e.response.json()['detail'])
else:
raise e
Python client example to translate a file in real-time, see here for more examples of Real-Time Transcription
import speechmatics
from httpx import HTTPStatusError
API_KEY = "YOUR_API_KEY"
PATH_TO_FILE = "example.wav"
LANGUAGE = "en" # Transcription language
TRANSLATION_LANGUAGES = ["es","de"]
CONNECTION_URL = f"wss://eu2.rt.speechmatics.com/v2/{LANGUAGE}"
# Create a transcription client
ws = speechmatics.client.WebsocketClient(
speechmatics.models.ConnectionSettings(
url=CONNECTION_URL,
auth_token=API_KEY,
)
)
# Define an event handler to print the translations
def print_translation(msg):
msg_type="Final"
if msg['message'] == "AddPartialTranslation":
msg_type="Partial"
language = msg['language'] # language for translation message
translations = []
for translation_segment in msg['results']:
translations.append(translation_segment['content'])
translation = " ".join(translations).strip()
print(f"{msg_type} translation for {language}: {translation}")
# Register the event handler for partial translation
ws.add_event_handler(
event_name=speechmatics.models.ServerMessageType.AddPartialTranslation,
event_handler=print_translation,
)
# Register the event handler for full translation
ws.add_event_handler(
event_name=speechmatics.models.ServerMessageType.AddTranslation,
event_handler=print_translation,
)
settings = speechmatics.models.AudioSettings()
# Define transcription parameters with translation
# Full list of parameters described here: https://speechmatics.github.io/speechmatics-python/models
translation_config = speechmatics.models.RTTranslationConfig(
target_languages=TRANSLATION_LANGUAGES,
#enable_partials=True # Optional argument to provide translation of partial sentences
)
transcription_config = speechmatics.models.TranscriptionConfig(
language=LANGUAGE,
translation_config=translation_config
)
print("Starting transcription (type Ctrl-C to stop):")
with open(PATH_TO_FILE, 'rb') as fd:
try:
ws.run_synchronously(fd, transcription_config, settings)
except KeyboardInterrupt:
print("\nTranscription stopped.")
except HTTPStatusError as e:
if e.response.status_code == 401:
print('Invalid API key - Check your API_KEY at the top of the code!')
else:
raise e
Languages
The following languages can be translated to and from English in realtime and batch:
- Bulgarian (bg)
- Catalan (ca)
- Mandarin (cmn)
- Czech (cs)
- Danish (da)
- German (de)
- Greek (el)
- Spanish (es)
- Estonian (et)
- Finnish (fi)
- French (fr)
- Galician (gl)
- Hindi (hi)
- Croatian (hr)
- Hungarian (hu)
- Indonesian (id)
- Italian (it)
- Japanese (ja)
- Korean (ko)
- Lithuanian (lt)
- Latvian (lv)
- Malay (ms)
- Dutch (nl)
- Norwegian (no)
- Polish (pl)
- Portuguese (pt)
- Romanian (ro)
- Russian (ru)
- Slovakian (sk)
- Slovenian (sl)
- Swedish (sv)
- Turkish (tr)
- Ukrainian (uk)
- Vietnamese (vi)
In batch, you can also translate from Norwegian Bokmål to Nynorsk.
Best practices
Follow these guidelines to achieve optimal translation results:
- Use enhanced operating point — Higher transcription accuracy directly leads to better translations
- Keep punctuation enabled — Maintain all punctuation settings at default levels for optimal translation quality
- Consider processing times — Each additional target language increases processing time in batch jobs
- Plan for connection closing — Realtime sessions may have a 5-second delay when finalizing translations
Be aware of these limitations:
- Language limit — Maximum 5 target languages per transcription
- Format restrictions — Only JSON format includes translations (text and SRT formats contain source language only)
- Reduced metadata — Certain features (timestamps, confidence scores, word tagging, and regional spelling) are only available in the original language
Next Steps
- Try the portal to see how translation works with your own audio.
- Use diarization to enhance your translations with speaker information.