This is the repository corresponding to Synthetic Audio Helps for Cognitive State Tasks, which was accepted to NAACL 2025 Findings. It contains the code and data used for our unimodal baseline experiments, synthetic audio data generation, and multimodal fusion runs.
In the paper, we perform a set of experiments on 7 cognitive state related datasets and 3 unrelated "control" datasets which find that multimodal models fine-tuned for cognitive state related tasks (i.e., belief, sentiment, emotion) using synthetically generated (text-to-speech) audio and text outperform text-only models — even for datasets which never had human audio to begin with.
Supposing conda and poetry are installed, the project dependencies can be setup using the following commands.
conda create -n sad-training python=3.10
conda activate sad-training
poetry install --no-root
By default, all scripts will log their output to /home/{username}/scratch/logs/
. To change this behavior see ~line 40 of src/core/context.py
.
A summary of the content and structure of the repository is shown below.
sad-training/
|- bin/
| |- download.py - downloads datasets.
| |- sad_generation.py - generates synthetic audio data.
| |- sad_training.py - runs training experiments.
| |- tinfo.py - summarizes task information.
| |- oinfo.py - summarizes experiment results.
|- configs/ - example training configurations.
|- data/
| |- cb/ - commitment bank data.
| |- ... - more task data.
| |- sad/ - per-task synthetic audio data.
|- keys/ - expected location of API keys.
|- outputs/ - default location for experimental outputs.
|- src/
| |- ... - additional utilities and code.
We briefly summarize the main functionality provided by this repository.
The data, including the synthetic audio used in our experiments, is available on Hugging Face. A script can be used to download and unpack datasets into their expected locations.
./bin/download.py # No arguments downloads everything.
./bin/download.py -d commitment_bank swbd_s # Specific datasets.
./bin/download.py -d commitment_bank -o custom/output/path # Custom output path.
The following command will generate synthetic audio data for a task.
./bin/sad_generation.py -t commitment_bank --voices nova # Basic usage.
./bin/sad_generation.py -t commitment_bank --voices nova, echo # Multiple voices.
./bin/sad_generation.py -t commitment_bank --voices matcha # Matcha voice.
By default the data is placed in data/sad/{task_name}
. To use the OpenAI models, you must configure a key on line 116 of bin/sad_generation.py
. To use matcha as a voice, espeak or espeak-ng must be installed on your system. Refer to their documentation for details.
The sad_training.py
script can be used to replicate our experiments or train new tasks.
# Run on all tasks in default task config
./bin/sad_training.py configs/sad_training/early-fusion.json
# Run on just one task.
./bin/sad_training.py configs/sad_training/early-fusion.json -t commitment_bank
# Run on custom configs.
./bin/sad_training.py path/to/config.json -c path/to/tasks.json
The configurations can be modified should you wish to try additional datasets, models, or conditions.
Basic information about each configured task can be printed with tinfo.py
.
./bin/tinfo.py # Basic usage.
./bin/tinfo.py -t commitment_bank # Single task.
If your data is properly installed it should print something along the lines of the following.
[2025-02-04 14:56:20] [ INFO] --- ==============================[ commitment_bank ]=============================== (tinfo.py:104)
[2025-02-04 14:56:20] [ INFO] --- character count : 28,156 (avg=84) (tinfo.py:107)
[2025-02-04 14:56:20] [ INFO] --- token count : 7,837 (avg=23) (tinfo.py:108)
[2025-02-04 14:56:20] [ INFO] --- OpenAI cost/voice : $0.84 (tinfo.py:109)
[2025-02-04 14:56:20] [ INFO] --- gold secs : 1,548.2 (avg=4.6) (tinfo.py:113)
[2025-02-04 14:56:20] [ INFO] --- shimmer secs : 1,737.0 (avg=5.2) (tinfo.py:113)
[2025-02-04 14:56:20] [ INFO] --- onyx secs : 1,700.7 (avg=5.1) (tinfo.py:113)
[2025-02-04 14:56:20] [ INFO] --- nova secs : 1,666.8 (avg=5.0) (tinfo.py:113)
[2025-02-04 14:56:20] [ INFO] --- fable secs : 1,721.8 (avg=5.2) (tinfo.py:113)
[2025-02-04 14:56:20] [ INFO] --- echo secs : 1,710.0 (avg=5.1) (tinfo.py:113)
[2025-02-04 14:56:20] [ INFO] --- alloy secs : 1,674.1 (avg=5.0) (tinfo.py:113)
[2025-02-04 14:56:20] [ INFO] --- matcha secs : 1,889.7 (avg=5.7) (tinfo.py:113)
[2025-02-04 14:56:20] [ INFO] --- 0 / 334 entries contain an error (0.0%) (tinfo.py:116)
[2025-02-04 14:56:20] [ INFO] --- 0 / 334 (0.0%) contain a tts_character_limit error (avg_badness=0) (tinfo.py:125)
[2025-02-04 14:56:20] [ INFO] --- 0 / 334 (0.0%) contain a text_model_token_limit error (avg_badness=0) (tinfo.py:125)
[2025-02-04 14:56:20] [ INFO] --- 0 / 334 (0.0%) contain a audio_model_length_limit_gold error (avg_badness=0) (tinfo.py:125)
[2025-02-04 14:56:20] [ INFO] --- 0 / 334 (0.0%) contain a audio_model_length_limit_shimmer error (avg_badness=0) (tinfo.py:125)
...
Experimental results can be summarized using oinfo.py
.
./bin/oinfo.py # No arguments needed.
./bin/oinfo.py -d path/to/outputs_dir # Custom output dir.
This collects and summarizes information from the output of sad_training.py
runs. When working properly, the output will look something like this.
--------------------------------------------------------------------------------------------------------------
task modality fusion_strategy signals fold seed epoch metric value
==============================================================================================================
boolq audio-only none matcha-only 0.0 14.333333 10.0 eval_accuracy 0.697436
boolq audio-only none alloy-only 0.0 14.333333 10.0 eval_accuracy 0.694872
... ... ... ... ... ... ... ... ...
Adding new datasets as supported tasks is fairly straight forward. For the sake of example, suppose we want to add a dataset called exdata
.
First, create a module src/data/exdata.py
. Then implement a function in this module named load_kfold
that returns a dataset dict with the train and test splits you want to use. The fold
(which fold to load), k
(total number of folds) and seed
(random seed used for splitting) are all required parameters. Here we will assume there are two versions of exdata
we might want to load and pass version
as an additional parameter. Since k-fold splitting is not desired for exdata
, we assert that the "first" fold is always the one being loaded.
def load_kfold(
version: int = 1, fold: int, k: int = 5, seed: int = 42
) -> datasets.DatasetDict:
assert fold == 0, "KFold splitting not implemented"
df = pd.read_csv(f"path/to/corpus_v{version}.csv")
return datasets.DatasetDict({
split: datasets.Dataset.from_pandas(
df[df.split == split], preserve_index=False
) for split in ("train", "test")
})
Now configure tasks for your new corpus by adding an entry to src/data/tasks.json
.
[
...,
"exdata_v1": {
"dataset": "exdata",
"dataset_kwargs": {"version": 1},
"text_column": "text",
"label_column": "label"
},
"exdata_v2": {
"dataset": "exdata",
"dataset_kwargs": {"version": 2},
"text_column": "text",
"label_column": "label"
},
...,
]
The last change to make is updating the module mapping in src/data/corpora.py
to use your new module.
from ..data import exdata
CMAP = {
...,
"exdata": exdata,
...,
}
Our new tasks should now be ready to use. We can double-check by trying to get some summary statistics.
./bin/tinfo.py -t exdata_v1 exdata_v2
There is one final configuration change required if you want to use these tasks with sad_training.py
. We need an entry in configs/sad_training/tasks.json
to let it the script know they exist and configure any parameters that should be overridden when training.
"exdata_v1": {
"metric_for_classification": "f1_per_class", # How you would like to evaluate.
"num_train_epochs": 10, # Any parameters can be overridden here.
"do_regression": true, # Regression or classification?
"audio_sources": ["alloy", "matcha", ...] # Remember to generate them first!
},
Take a look at some of the other tasks for inspiration.
@inproceedings{soubki-etal-2025-synthetic,
title={Synthetic Audio Helps for Cognitive State Tasks},
author={Adil Soubki and John Murzaku and Peter Zeng and Owen Rambow},
year={2025},
booktitle={Findings of the Association for Computational Linguistics: NAACL 2025},
publisher={Association for Computational Linguistics},
url={https://arxiv.org/abs/2502.06922},
}