Skip to main content

Audio Events

Transcription:BatchReal-TimeDeployments:On-Prem ContainersSaaS

The Audio Events feature, available through our Automatic Speech Recognition (ASR) API, provides the detection and labelling of non-speech sounds within audio and video content, such as music, laughter, and applause.

Enable Audio Events in your application for file processing scenarios by leveraging Speechmatics SaaS or On-Prem solutions.

If you're new to Speechmatics, start by exploring our guides on Processing a File or Analyzing in Real-Time. To activate Audio Events, include the following configuration:

{ 
  "type": "transcription", 
  "transcription_config": { 
    "operating_point": "enhanced", 
    "language": "en" 
  }, 
  "audio_events_config": {} 
} 

Quick Start

Python client example to detect Audio Events in a file for batch processing.

1from speechmatics.models import ConnectionSettings, BatchTranscriptionConfig, AudioEventsConfig
2from speechmatics.batch_client import BatchClient 
3
4API_KEY = "YOUR_API_KEY" 
5PATH_TO_FILE = "example.wav" 
6
7settings = ConnectionSettings( 
8    url="https://asr.api.speechmatics.com/v2", 
9    auth_token=API_KEY, 
10) 
11
12with BatchClient(settings) as client: 
13    job_id = client.submit_job( 
14        audio=PATH_TO_FILE, 
15        transcription_config=BatchTranscriptionConfig(audio_events_config=AudioEventsConfig()), 
16    ) 
17    print(f'Job {job_id} submitted successfully, waiting for analysis') 
18
19    # In production, consider using notifications instead of polling 
20    analysis = client.wait_for_completion(job_id, transcription_format='json-v2') 
21    print(f"Detected audio event summary: \n{analysis['audio_event_summary']}",)
22    print("Detected audio events:") 
23    for event in analysis["audio_events"]: 
24        print(f"{event['type']} from {event['start_time']} to {event['end_time']}, confidence: {event['confidence']}") 
25

Audio Events Response

The JSON output for batch processing will include the following information about each detected Audio Event:

  • type: A string indicating the type of Audio Event detected - applause, laughter, music
  • start_time: A number indicating the start time of the event in the media file, in seconds
  • end_time: A number indicating the end time of the event in the media file, in seconds
  • confidence: A number indicating the confidence value of the event detected by the model.
  • channel: Only returned if Channel Diarization is enabled, this would indicate the channel in which the event was detected

The JSON output will also contain an audio_event_summary which will summarise all detected Audio Events highlighting the number of times and the total duration for which each category of Audio Event occured. The Audio Event summary will also contain the summary of silence & speech events, with the total duration being calculated by adding duration of all the words spoken.

{ 
  "format": "2.9", 
  "job": { ... }, 
  "metadata": { 
    "created_at": "2022-09-26T15:01:48.412714Z", 
    "type": "transcription", 
    "transcription_config": {...}, 
    "audio_events_config": {}, 
    ... 
  }, 

  "results": [...], 

  "audio_events": [ 
    {
      "channel": "channel_1",
      "confidence": 0.75,
      "end_time": 21.76,
      "start_time": 19.2,
      "type": "laughter"
    },
    {
      "channel": "channel_1",
      "confidence": 0.76,
      "end_time": 21.76,
      "start_time": 19.2,
      "type": "applause"
    }, 
    ... 
  ], 

  "audio_event_summary": { 
      "applause": {
        "count": 6,
        "total_duration": 10.24
      },
      "laughter": {
        "count": 6,
        "total_duration": 19.84
      },
      "music": {
        "count": 8,
        "total_duration": 18.96
      },
      "silence": {
        "count": 5,
        "total_duration": 8.34
      },
      "speech": {
        "count": 135,
        "total_duration": 32.38
      }
    }, 

    "channels": { 
      "channel_1": {
        "applause": {
          "count": 3,
          "total_duration": 5.12
        },
          ....  
        } 
      } 
    } 
  } 
} 

Supported Audio Events

Event TypeAvailability
laughterBatch & Real-Time
applauseBatch & Real-Time
musicBatch & Real-Time. Note: Real-Time Audio Events can be overly sensitive to music. We are investigating the root cause and aim to fix this soon.
silenceBatch Audio Event Summary
speechBatch Audio Event Summary

Configuring Specific Types of Audio Events in a Request

The types applause, laughter and music can be requested specifically in a transcription request as a part of the audio_events_config payload

An example of a request only for applause and music

{ 
  "type": "transcription", 
  "transcription_config": { 
    "operating_point": "enhanced", 
    "language": "en" 
  }, 

  "audio_events_config": {
    "types": ["applause", "music"]
  } 

} 

Considerations

  • Speech time summary is generated by adding together the durations of all words spoken
  • Gaps of less than 1s between two consecutive occurences of the same type of event will lead to the events be considered a single event e.g. two music sequences seperated by a break of 500ms will be returned as a single event
  • Silence is only detected for gaps where there are no other events (speech, music, laughter, applause) for at least 1 second
  • Multiple overlapping Audio Events of different type can be detected simultaneously, including speech - for example, music, applause and speech can all be detected at the same time

Limitations

  • Audio Events is supported only in the JSON type API response
  • While the occurence of music can be detected, richer metadata about the music such as title, artist, genre, etc cannot be identified
  • Only one instance of an event type can be tracked at a point in time. e.g. seamlessly switching consecutive songs will be detected as one single music event
  • For On-Prem Containers, Audio Events is available only for GPU Operating Points