Skip to content

Sound Event Detection ML task built on Tensorflow 2. General audio classification repo built with intention for learning, testing and experimenting with on niche tasks. Testing on new data, exploratory data analysis. Tuning parameters, real-time inference on input audio device, send triggers on Artnet bus. Use wtih my ADRT dataset creator.


Notifications You must be signed in to change notification settings


Folders and files

Last commit message
Last commit date

Latest commit

0a6ac96 · Jul 12, 2024


17 Commits
Jun 2, 2024
Jul 12, 2024
Jun 2, 2024
Jun 2, 2024
Dec 5, 2023
Jun 2, 2024
Jul 11, 2024
Jun 2, 2024
Jun 2, 2024
Jun 2, 2024

Repository files navigation

Sound Event Detection Lab

SED-Lab is universal Sound Event Detection (SED) package designed for learning and experimenting with audio classification tasks. This repository includes a training notebook and real-time inference scripts built on the TensorFlow 2 backend. It supports various audio features, multiple model architectures, and easy transfer of configuration settings via JSON files saved with trained model.
Intended usage is to experiment with niche SED tasks, use integrated exploratory data analysis for experimentation with different features, scalers, nn architectures and use tests to find best working parameter for specific task.
The real-time inference class uses an input microphone audio device and sends triggers on the specified Artnet bus.
Use ADRT (Audio Dataset Recorder Tools) to record custom audio datasets.

Sound Event Detection and Audio Classification

Table of Contents

Why Use?

  • Educational: The training process runs in a Jupyter notebook with detailed descriptions in every step. Learn about audio ML, experiment with different audio features, normalization techniques, augmentations, and model architectures.
  • Easy deploy: After training, a config file is generated that includes all settings from the training phase, simplifying the deployment of various models.
  • Real-time inference: The inference script runs on real-time audio recordings and can send trigger commands via the Artnet bus.
  • Edge support: TF-Lite models and an inference script for mobile and edge computing are available.
  • Models: Choose from renowned model architectures, simple models, or custom-built models quickly during the training phase. Models can be browsed in the documentation.

Applications and HW

Tested on Linux and Windows os.
Runs on cpu or CUDA gpu.
Audio input device inference with overlaping samples (real-time).
Offline inference on audio files (sorting).
See example applications bellow.


Windows 10/11 setup:

  1. Download and install Python 3.10.11:
    "add python.exe to PATH" during installation.
    Restart Windows

  2. Clone (download) the project:
    Open terminal and clone it or just download from Github.
    git clone

  3. cd (go) to the project directory run code:
    cd SED-Lab

  4. Create Virtual environment:
    python -m venv venv

  5. Activate virtual env.

  6. Update pip:
    python -m pip install --upgrade pip

  7. Install dependencies:
    pip install -r requirements.txt

How to use?

This is all-in-one SED (Sound Event Detection) tasks repo. It's meant to be adapted and optimized for specific SED task. Experiment to find the best configuration for your task.


Prepare your local workspace, run: python src/utils/
Place your dataset data in DATA/DATASET directory. Audio samples are in .wav format and every subdirectory in DATASET directory is an unique label.
Folders with "-" at the beginngin are considered "neg" label - negative samples (i.e. -noises, -cuttlery). Folders with "_" are considered the same label as folder without it (i.e. "horn", "_horn").
If needed, use Audio-Dataset-Recorder-Tools for new recordings.


Open notebook located in the notebooks directory training_notebook.ipynb.
Setup the first cell "Training Parameters" according to your task.
In the training notebook press "run all", wait until it finishes.
You get MODEL, ENCODER, SCALER, CONFIG and various plots from training and evaluation.

  • MODEL: Trained tf/tflite model of custom or preset architecture.
  • ENCODER: One-hot encoder for the labels saved as joblib.
  • SCALER: Selected scaler fitted on the training dataset saved as joblib.
  • CONFIG: Training configuration saved as json to be used in the inference script.
  • PLOTS: In plots directory MODEL/PLOTS are saved training plots and evaluation conusion matrix.


From MODEL directory, copy CONFIG file of desired trained model, put it in config folder, and rename to config.json.
Edit or just run this inference code in terminal: python -conf MODEL/config.json -prob 0.9 -o 0.2 -ac 0 -au 0 -aip ""

If you did not train model with this repo, you need to setup inference via arguents.

Arguments and Settings

This contains documentation of possible settings in training and inference scripts.

Training Settings

Open training_notebook.ipynb in notebooks directory and setup training parameters at the top notebook cell.

Audio Samples:

  • AUDIO_CHUNK: Sets uniform length of the audio samples for training. For the training, all samples need to be the same size. The sample length is set to 0.4 seconds.
  • SLICE_AUDIO: Determines whether to trim/pad audio samples to the AUDIO_CHUNK length. Not necessary when dataset is recorded in uniform audio length.
  • DATA_RANGE: Specifies the range of data values that the neural network will take. It can be 1 or 255. Ranges tensor values 0-1 or 0-255. Some models require 255.
  • NUM_CHANNELS: The number of audio channels. Typically 1 for mono audio.
  • SAMPLE_RATE: The sample rate in Hz. 44100 Hz is ok for sounds.

Audio Feature Parameters:

  • MAIN_FEATURE: Specifies the main feature type for audio processing. Can be 'mfcc', 'mel', or 'stft'. Meaning Mel Frequency Cepstral Coefficients (MFCC), Mel Spectrograms, or Short-Time Fourier Transform (STFT) spectrograms.
  • N_MELS: Number of Mel bands, applicable only when MAIN_FEATURE is set to 'mel'.
  • NFFT: Size of a window (frame) of the audio in number of samples, on which the fourier transform is proceeded.
  • HOP_LENGTH: Number of samples between successive frames.
  • N_MFCC: Number of MFCC coefficients, applicable only when MAIN_FEATURE is set to 'mfcc'.
  • FMAX: Maximum frequency for the FFT to capture, typically the Nyquist frequency to avoid aliasing (half of SAMPLE_RATE).
  • N_FRAMES: Number of frames for the audio feature to have. It dynamically adjusts based on your sample rate, audio chunk, hop length.
  • SCALER: Data normalization method. Will be fitted on the whole dataset and then saved. Can be 'standard', 'minmax', 'robust', 'maxabs', or 'None'. Minmax worked good for spectrograms but try experimenting here.

Training Settings:

  • EPOCHS: Number of epochs for training. With datasets around 10k items, features apx. 100x100x1 and lr 1e-3 i go for 50 epochs.
  • BATCH_SIZE: The size of batches used in training. How many audio files does the network take at once, before it calculates gradient descent and updates its weights. Usually go with 16. Bigger might not learn to generalize that well, too small might become unstable gradients.
  • LEARNING:RATE: Speed at which the network learns, start with 1e-3, when it the training fluctuates too much , reduce to 1e-4 and increase epochs.
  • AUGMENTATION Do you want to use augmentations in the training ? Width and height shift in modest rate is applied.

Model Settings:

  • MODEL_FORMAT: Format of the output model. Can be 'h5', 'keras', 'tf', or 'tflite'. Keras is a modern format, well optimized.
  • LITE_VERSION: Indicates whether to produce a TensorFlow Lite version of the model.
  • MODEL_ARCH: The architecture of the model to be used. Currently set to "SmallerVGGNet".
  • MODEL_TYPE: The type of model. Currently set to "default".
  • NEW_MODEL_NAME: Setup your custom name for this session outputs. It will keep you organised and marks every output from this session with this label.

Inference Arguments

Inference arguments are passed when calling the script.
Example: python -conf MODEL/config.json -prob 0.9 -tw "music, children, saw" -o 0.1 -ac 0 -au 0 -aip ""

For running the inference script, several parameters are required. Most of these parameters are stored in the config.json file, which is generated during the training phase. This file includes settings about the audio processing, features and model, it helps automate the setup process, reducing the need for manual input. Below is a description of the inference arguments:

Configuration File:

  • -conf, --config_path: Path to the configuration file (config.json). This file includes most of the necessary parameters for inference, such as model paths, feature extraction settings, and more.

Prediction and Thresholds:

  • -prob, --probability_threshold: Confidence threshold for predictions. (default: 0.9)
  • -o, --chunk_overlap: Realtime recording sample overlap in seconds. (default: 0.2)
  • -tw, --trigger_words: List of labels that will trigger the action. config.json can be overriden by argument to select just some labels.

Art-Net Configuration:

  • -ac, --artnet_channel: Art-Net channel to send data to. (default: 0)
  • -au, --artnet_universe: Art-Net universe to send data to. (default: 0)
  • -aip, --artnet_ip: IP address of the Art-Net node. (default: "")

Device Configuration:

  • -dev, --device: Processing unit to use. Options are "cpu" or "gpu". (default: "cpu")
  • -mic, --mic_device: Microphone device index. (default: 0)

Parameters for Manual Setup (if config.json is not used):

If the config.json file is not available, the following parameters need to be set manually. This is unreliable method and is not recommended.

  • --model_path: Path to the trained model file.
  • --labeler_path: Path to the labeler file.
  • --lite_model_path: Path to the trained Lite model file.
  • --scaler_path: Path to the feature scaler file.
  • --sample_rate: Audio sample rate (default: 44100).
  • --num_channels: Number of audio channels (default: 1).
  • --audio_chunk: Length of audio slice in seconds (default: 0.4).
  • --num_mels: Number of Mel bands to generate (default: 256).
  • --n_fft: Number of samples in each FFT window (default: 2048).
  • --fmax: Maximum frequency when computing MEL spectrograms (default: 22050).
  • --hop_length: Number of samples between successive FFT windows (default: 512).
  • --n_frames: Number of frames of audio to use for prediction (default: 34).
  • --data_range: Range of data values (1 or 255, default: 255).
  • --n_mfcc: Number of MFCCs to extract (default: 40).

Recommended Application Settings

  • Speech Commands: SAMPLE_RATE: 16000, N_MELS: 40, MAIN_FEATURE: mfcc, NFFT: 512, HOP_LENGTH: 256, N_MFCC: 13, AUDIO_CHUNK: 1.0, SCALER: minmax

  • Music Genre Classification: SAMPLE_RATE: 22050, N_MELS: 128, MAIN_FEATURE: mel, NFFT: 2048, HOP_LENGTH: 512, AUDIO_CHUNK: 3.0, SCALER: standard

  • Cheering Glasses: SAMPLE_RATE: 44100, N_MELS: 64, MAIN_FEATURE: mel, NFFT: 1024, HOP_LENGTH: 512, AUDIO_CHUNK: 0.6, SCALER: robust

  • Bird Song Recognition: SAMPLE_RATE: 48000, N_MELS: 128, MAIN_FEATURE: mel, NFFT: 1024, HOP_LENGTH: 256, AUDIO_CHUNK: 5.0, SCALER: minmax

  • Heartbeat Sound Detection: SAMPLE_RATE: 8000, N_MELS: 40, MAIN_FEATURE: stft, NFFT: 256, HOP_LENGTH: 128, AUDIO_CHUNK: 1.0, SCALER: standard

  • Urban Sound Classification: SAMPLE_RATE: 22050, N_MELS: 64, MAIN_FEATURE: mel, NFFT: 1024, HOP_LENGTH: 512, AUDIO_CHUNK: 4.0, SCALER: minmax


Neural networks consisting of three CNN blocks trained with default parameters on custom datasets of approximately 10,000 samples and up to 5 categories typically achieve high accuracy, often surpassing 95% in classification tasks.
While niche tasks can benefit from fine-tuning training parameters and feature engineering, increasing the dataset size and variety, as well as maintaining balanced categories, consistently yields the best results for model generalization.

Sound Event Detection and Audio Classification


This project is licensed under the Apache 2 License - see the file for details.


Sound Event Detection ML task built on Tensorflow 2. General audio classification repo built with intention for learning, testing and experimenting with on niche tasks. Testing on new data, exploratory data analysis. Tuning parameters, real-time inference on input audio device, send triggers on Artnet bus. Use wtih my ADRT dataset creator.








No releases published


No packages published
