Skip to main content

Audio Events

Transcription:BatchDeployments:SaaS

info

Currently supported only for Batch calls through Speechmatics SaaS API, real-time support & support for other deployments coming very soon.

The Audio Events feature, available through our Automatic Speech Recognition (ASR) API, provides the detection and labelling of non-speech sounds within audio and video content, such as music, laughter, applause and silence.

Enable Audio Events in your application for file processing scenarios by leveraging Speechmatics SaaS solution.

If you're new to Speechmatics, start by exploring our guides on Processing a File. To activate Audio Events, include the following configuration:

{ 
  "type": "transcription", 
  "transcription_config": { 
    "operating_point": "enhanced", 
    "language": "en" 
  }, 

  "audio_events_config": {
    "types": []  # Empty list defaults to a request for all supported types
  } 

} 

Quick Start

Python client example to detect audio events in a file for batch processing.

1from speechmatics.models import ConnectionSettings, BatchTranscriptionConfig, AudioEventsConfig
2from speechmatics.batch_client import BatchClient 
3
4API_KEY = "YOUR_API_KEY" 
5PATH_TO_FILE = "example.wav" 
6
7settings = ConnectionSettings( 
8    url="https://asr.api.speechmatics.com/v2", 
9    auth_token=API_KEY, 
10) 
11
12with BatchClient(settings) as client: 
13    job_id = client.submit_job( 
14        audio=PATH_TO_FILE, 
15        transcription_config=BatchTranscriptionConfig(audio_events=AudioEventsConfig()), 
16    ) 
17    print(f'Job {job_id} submitted successfully, waiting for analysis') 
18
19    # In production, consider using notifications instead of polling 
20    analysis = client.wait_for_completion(job_id, analysis_format='json-v2') 
21    print("Detected audio events:") 
22    for event in analysis["audio_events"]: 
23        print(f"{event['type']} from {event['start_time']} to {event['end_time']}, confidence: {event['confidence']}, channel: {event['channel']}") 
24

Audio Events Response

The JSON output for batch processing will include the following information about each detected audio event:

  • type: A string indicating the type of audio event detected - applause, laughter, music
  • start_time: A number indicating the start time of the event in the media file, in seconds
  • end_time: A number indicating the end time of the event in the media file, in seconds
  • confidence: A number indicating the confidence value of the event detected by the model.
  • channel: Only returned if Channel Diarization is enabled, this would indicate the channel in which the event was detected

The JSON output will also contain an audio_event_summary which will summarise all detected audio events highlighting the number of times and the total duration for which each category of audio event occured. The audio event summary will also contain the summary of silence & speech events, with the total duration being calculated by adding duration of all the words spoken.

{ 
  "format": "2.9", 
  "job": { ... }, 
  "metadata": { 
    "created_at": "2022-09-26T15:01:48.412714Z", 
    "type": "transcription", 
    "transcription_config": {...}, 
    "audio_events_config": {"types": []}, 
    ... 
  }, 

  "results": [...], 

  "audio_events": [ 
    {
      "channel": "channel_1",
      "confidence": 0.75,
      "end_time": 21.76,
      "start_time": 19.2,
      "type": "laughter"
    },
    {
      "channel": "channel_1",
      "confidence": 0.76,
      "end_time": 21.76,
      "start_time": 19.2,
      "type": "applause"
    }, 
    ... 
  ], 

  "audio_event_summary": { 
      "applause": {
        "count": 6,
        "total_duration": 10.24
      },
      "laughter": {
        "count": 6,
        "total_duration": 19.84
      },
      "music": {
        "count": 8,
        "total_duration": 18.96
      },
      "silence": {
        "count": 5,
        "total_duration": 8.34
      },
      "speech": {
        "count": 135,
        "total_duration": 32.38
      }
    }, 

    "channels": { 
      "channel_1": {
        "applause": {
          "count": 3,
          "total_duration": 5.12
        },
          ....  
        } 
      } 
    } 
  } 
} 

Supported Audio Events

Event TypeAvailablity
musicBatch
laughterBatch
applauseBatch
silenceBatch Audio Event Summary
speechBatch Audio Event Summary

Configuring specific types of audio events in a request

The types applause, laughter and music can be requested specifically in a transcription request as a part of the audio_events_config payload An example of a request only for applause and music

{ 
  "type": "transcription", 
  "transcription_config": { 
    "operating_point": "enhanced", 
    "language": "en" 
  }, 

  "audio_events_config": {
    "types": ["applause", "music"]
  } 

} 

Considerations

  • Speech time summary is generated by adding together the durations of all words spoken
  • Gaps of less than 1s between two consecutive occurences of the same type of event will lead to the events be considered a single event e.g. two music sequences seperated by a break of 500ms will be returned as a single event
  • Silence is only detected for gaps where there are no other events (speech, music, laughter, applause) for at least 1 seconds
  • Multiple overlapping audio events of different type can be detected simultaneously, including Speech - for example, music, applause and speech can all be detected at the same time

Limitations

  • Audio Events is supported only in the JSON type API response
  • While the occurence of music can be detected, richer metadata about the music such as title, artist, genre, etc cannot be identified
  • Only one instance of an event type can be tracked at a point in time. e.g. seamlessly switching consecutive songs will be detected as one single music event
  • Audio Events cannot be used with the speaker_change diarization config option since the latter is being deprecated