This repository contains the code used in the AI CUP 2023 Spring Multimodal Pathological Voice Classification Competition, where we achieved 8th place in the public ranking and 1st place in the private ranking, with scores of 0.657057 and 0.641098, respectively.
You can download all the files in this repository by cloning it using the following command:
git clone https://github.com/jwliao1209/Multimodal-Pathological-Voice-Classification.git
The feature extraction process consists of two key steps:
- Global Feature Extraction: We utilize Fast Fourier Transform (FFT) to extract frequency features and compute statistical indicators, constructing global features.
- Local Feature Extraction: A pre-trained deep learning model is used to extract local features, followed by dimensionality reduction using Principal Component Analysis (PCA) to retain relevant feature combinations.
For the model training phase, we apply machine learning-based tree models such as Random Forest and LightGBM, along with a transformer-based deep learning model called TabPFN. An ensemble method is then used to combine the predicted probabilities from these models, resulting in the final output.
To set up the environment, run the following commands:
conda create --name audio python=3.10
conda activate audio
pip install -r requirements.txt
To preprocess the dataset, run the command:
python process_data.py
To start training the models, run the command:
bash train.py
For inference, run the following command:
python inference.py
We implemented the code on an environment running Ubuntu 22.04.1, utilizing a 12th Generation Intel(R) Core(TM) i7-12700 CPU, along with a single NVIDIA GeForce RTX 4090 GPU equipped with 24 GB of dedicated memory.
@misc{multimodal_pathological_voice_classification_2023,
title = {Multimodal Pathological Voice Classification},
author = {Jia-Wei Liao, Chun-Hsien Chen, Shu-Cheng Zheng, Yi-Cheng Hung},
url = {https://github.com/jwliao1209/Multimodal-Pathological-Voice-Classification},
year = {2024}
}