Skip to content

The code and data used to train multimodal models with synthetic audio data, as described in Soubki and Murzaku et al. (2025), which was accepted to NAACL 2025 Findings.

Notifications You must be signed in to change notification settings

adil-soubki/sad-training

Repository files navigation

Synthetic Audio Data (SAD) Training

This is the repository corresponding to Synthetic Audio Helps for Cognitive State Tasks, which was accepted to NAACL 2025 Findings. It contains the code and data used for our unimodal baseline experiments, synthetic audio data generation, and multimodal fusion runs.

In the paper, we perform a set of experiments on 7 cognitive state related datasets and 3 unrelated "control" datasets which find that multimodal models fine-tuned for cognitive state related tasks (i.e., belief, sentiment, emotion) using synthetically generated (text-to-speech) audio and text outperform text-only models — even for datasets which never had human audio to begin with.

Installation

Supposing conda and poetry are installed, the project dependencies can be setup using the following commands.

conda create -n sad-training python=3.10
conda activate sad-training
poetry install --no-root

By default, all scripts will log their output to /home/{username}/scratch/logs/. To change this behavior see ~line 40 of src/core/context.py.

Content

A summary of the content and structure of the repository is shown below.

sad-training/
|- bin/
|  |- download.py                 - downloads datasets.
|  |- sad_generation.py           - generates synthetic audio data.
|  |- sad_training.py             - runs training experiments.
|  |- tinfo.py                    - summarizes task information.
|  |- oinfo.py                    - summarizes experiment results.
|- configs/                       - example training configurations.
|- data/
|  |- cb/                         - commitment bank data.
|  |- ...                         - more task data.
|  |- sad/                        - per-task synthetic audio data.
|- keys/                          - expected location of API keys.
|- outputs/                       - default location for experimental outputs.
|- src/
|  |- ...                         - additional utilities and code.

Usage

We briefly summarize the main functionality provided by this repository.

Getting the Data.

The data, including the synthetic audio used in our experiments, is available on Hugging Face. A script can be used to download and unpack datasets into their expected locations.

./bin/download.py  # No arguments downloads everything.
./bin/download.py -d commitment_bank swbd_s                 # Specific datasets.
./bin/download.py -d commitment_bank -o custom/output/path  # Custom output path.

Generating Synthetic Audio

The following command will generate synthetic audio data for a task.

./bin/sad_generation.py -t commitment_bank --voices nova        # Basic usage.
./bin/sad_generation.py -t commitment_bank --voices nova, echo  # Multiple voices.
./bin/sad_generation.py -t commitment_bank --voices matcha      # Matcha voice.

By default the data is placed in data/sad/{task_name}. To use the OpenAI models, you must configure a key on line 116 of bin/sad_generation.py. To use matcha as a voice, espeak or espeak-ng must be installed on your system. Refer to their documentation for details.

Training Models

The sad_training.py script can be used to replicate our experiments or train new tasks.

# Run on all tasks in default task config
./bin/sad_training.py configs/sad_training/early-fusion.json

# Run on just one task.
./bin/sad_training.py configs/sad_training/early-fusion.json -t commitment_bank

# Run on custom configs.
./bin/sad_training.py path/to/config.json -c path/to/tasks.json

The configurations can be modified should you wish to try additional datasets, models, or conditions.

Summarizing Data

Basic information about each configured task can be printed with tinfo.py.

./bin/tinfo.py                     # Basic usage.
./bin/tinfo.py -t commitment_bank  # Single task.

If your data is properly installed it should print something along the lines of the following.

[2025-02-04 14:56:20] [    INFO] --- ==============================[ commitment_bank ]=============================== (tinfo.py:104)
[2025-02-04 14:56:20] [    INFO] --- character count   : 28,156 (avg=84) (tinfo.py:107)
[2025-02-04 14:56:20] [    INFO] --- token count       : 7,837 (avg=23) (tinfo.py:108)
[2025-02-04 14:56:20] [    INFO] --- OpenAI cost/voice : $0.84 (tinfo.py:109)
[2025-02-04 14:56:20] [    INFO] --- gold secs         : 1,548.2 (avg=4.6) (tinfo.py:113)
[2025-02-04 14:56:20] [    INFO] --- shimmer secs      : 1,737.0 (avg=5.2) (tinfo.py:113)
[2025-02-04 14:56:20] [    INFO] --- onyx secs         : 1,700.7 (avg=5.1) (tinfo.py:113)
[2025-02-04 14:56:20] [    INFO] --- nova secs         : 1,666.8 (avg=5.0) (tinfo.py:113)
[2025-02-04 14:56:20] [    INFO] --- fable secs        : 1,721.8 (avg=5.2) (tinfo.py:113)
[2025-02-04 14:56:20] [    INFO] --- echo secs         : 1,710.0 (avg=5.1) (tinfo.py:113)
[2025-02-04 14:56:20] [    INFO] --- alloy secs        : 1,674.1 (avg=5.0) (tinfo.py:113)
[2025-02-04 14:56:20] [    INFO] --- matcha secs       : 1,889.7 (avg=5.7) (tinfo.py:113)
[2025-02-04 14:56:20] [    INFO] --- 0 / 334 entries contain an error (0.0%) (tinfo.py:116)
[2025-02-04 14:56:20] [    INFO] ---     0 / 334 (0.0%) contain a tts_character_limit error (avg_badness=0) (tinfo.py:125)
[2025-02-04 14:56:20] [    INFO] ---     0 / 334 (0.0%) contain a text_model_token_limit error (avg_badness=0) (tinfo.py:125)
[2025-02-04 14:56:20] [    INFO] ---     0 / 334 (0.0%) contain a audio_model_length_limit_gold error (avg_badness=0) (tinfo.py:125)
[2025-02-04 14:56:20] [    INFO] ---     0 / 334 (0.0%) contain a audio_model_length_limit_shimmer error (avg_badness=0) (tinfo.py:125)
...

Summarizing Outputs

Experimental results can be summarized using oinfo.py.

./bin/oinfo.py                         # No arguments needed.
./bin/oinfo.py -d path/to/outputs_dir  # Custom output dir.

This collects and summarizes information from the output of sad_training.py runs. When working properly, the output will look something like this.

--------------------------------------------------------------------------------------------------------------
                     task   modality fusion_strategy     signals  fold      seed  epoch        metric    value
==============================================================================================================
                    boolq audio-only            none matcha-only   0.0 14.333333   10.0 eval_accuracy 0.697436
                    boolq audio-only            none  alloy-only   0.0 14.333333   10.0 eval_accuracy 0.694872
                    ...   ...                   ...          ...   ... ...          ...           ...      ...

Adding a Task

Adding new datasets as supported tasks is fairly straight forward. For the sake of example, suppose we want to add a dataset called exdata.

First, create a module src/data/exdata.py. Then implement a function in this module named load_kfold that returns a dataset dict with the train and test splits you want to use. The fold (which fold to load), k (total number of folds) and seed (random seed used for splitting) are all required parameters. Here we will assume there are two versions of exdata we might want to load and pass version as an additional parameter. Since k-fold splitting is not desired for exdata, we assert that the "first" fold is always the one being loaded.

def load_kfold(
    version: int = 1, fold: int, k: int = 5, seed: int = 42
) -> datasets.DatasetDict:
    assert fold == 0, "KFold splitting not implemented"
    df = pd.read_csv(f"path/to/corpus_v{version}.csv")
    return datasets.DatasetDict({
        split: datasets.Dataset.from_pandas(
            df[df.split == split], preserve_index=False
        ) for split in ("train", "test")
    })

Now configure tasks for your new corpus by adding an entry to src/data/tasks.json.

[
    ...,
    "exdata_v1": {
        "dataset": "exdata",
        "dataset_kwargs": {"version": 1},
        "text_column": "text",
        "label_column": "label"
    },
    "exdata_v2": {
        "dataset": "exdata",
        "dataset_kwargs": {"version": 2},
        "text_column": "text",
        "label_column": "label"
    },
    ...,
]

The last change to make is updating the module mapping in src/data/corpora.py to use your new module.

from ..data import exdata


CMAP = {
    ...,
    "exdata": exdata,
    ...,
}

Our new tasks should now be ready to use. We can double-check by trying to get some summary statistics.

./bin/tinfo.py -t exdata_v1 exdata_v2

There is one final configuration change required if you want to use these tasks with sad_training.py. We need an entry in configs/sad_training/tasks.json to let it the script know they exist and configure any parameters that should be overridden when training.

    "exdata_v1": {
        "metric_for_classification": "f1_per_class", # How you would like to evaluate.
        "num_train_epochs": 10,                      # Any parameters can be overridden here.
        "do_regression": true,                       # Regression or classification?
        "audio_sources": ["alloy", "matcha", ...]    # Remember to generate them first!
    },

Take a look at some of the other tasks for inspiration.

Citation

@inproceedings{soubki-etal-2025-synthetic,
    title={Synthetic Audio Helps for Cognitive State Tasks},
    author={Adil Soubki and John Murzaku and Peter Zeng and Owen Rambow},
    year={2025},
    booktitle={Findings of the Association for Computational Linguistics: NAACL 2025},
    publisher={Association for Computational Linguistics},
    url={https://arxiv.org/abs/2502.06922},
}

About

The code and data used to train multimodal models with synthetic audio data, as described in Soubki and Murzaku et al. (2025), which was accepted to NAACL 2025 Findings.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •