- UCSD ML/AI Capstone Project: Audio Data Download and ASR Model Transcription Testing
This project represents the culmination of my work for the UCSD ML/AI Capstone course, where the goal was to build a robust framework for testing Automatic Speech Recognition (ASR) models on audio data derived from YouTube and ESPN. The capstone project is divided into two main components:
- Audio Data Download and Preprocessing
- ASR Model Transcription Testing Framework
The primary objective was to improve the accuracy and efficiency of ASR models in transcribing baseball game commentary, using a custom dataset derived from YouTube and ESPN APIs.
The initial concept for the capstone project emerged from the challenge of enhancing the real-time transcription of live audio, particularly in the sports domain. While ASR models like Whisper from OpenAI have demonstrated impressive results, one of the main limitations is the handling of proper nouns, event-specific jargon, and domain-specific vocabulary—especially when transcribing live commentary.
To address this, the project aimed to:
- Download relevant audio data from YouTube and ESPN, focusing on baseball commentary and related events.
- Preprocess the data to ensure it was properly cleaned, segmented, and aligned for transcription.
- Implement a testing framework to evaluate ASR model performance, specifically focusing on transcription accuracy, word error rates (WER), and the model's handling of domain-specific vocabulary.
In the early stages, I conducted a survey of existing research and explored available solutions for improving ASR models. This involved experimenting with fine-tuning techniques and leveraging pre-trained models like OpenAI's Whisper. I also explored methods for efficiently collecting and processing large datasets, particularly from video sources like YouTube, to ensure data quality and consistency.
Key Research Topics:
- Fine-tuning Whisper for multilingual ASR
- QLoRA (Quantized Low-Rank Adaptation) to efficiently fine-tune large language models
- Existing work on ASR models in the sports domain, particularly baseball commentary
The data collection phase involved gathering approximately 350,000 baseball-related videos from YouTube using the yt-dlp
library. Videos were filtered to ensure they were full-length games, in English, and included accurate metadata.
Once the data was collected, preprocessing tasks included:
- Filtering and Normalization: Ensuring consistent metadata across the dataset.
- Cleaning: Removing duplicates and handling missing data.
- Audio Extraction: Extracting audio from the video content and segmenting it for model input.
For this, I wrote the yt_dlp_async repository from scratch, which facilitates downloading YouTube videos asynchronously, along with helper functions for audio extraction and metadata management. More details about this repository can be found here.
To evaluate the performance of ASR models, I created a testing framework that would allow for:
- Model Evaluation: Comparing transcription results against a set of ground truth data, calculated as word error rate (WER) and accuracy.
- Dynamic Prompt Generation: Incorporating domain-specific vocabulary into the prompts for Whisper, ensuring better performance on sports-related jargon.
- Segmented Audio Testing: Splitting longer audio files into smaller chunks for efficient processing and evaluation.
This testing framework was built using the whisper_chunk_transcribe repository, which processes audio data and evaluates ASR model performance. More details about this repository can be found here.
To deploy the model testing framework and ensure it could be used by others, I adopted several strategies:
- Open-source Distribution: I made the source code available on GitHub, including comprehensive documentation and setup instructions.
- Dockerization: To ensure the environment was consistent across different systems, I containerized the entire framework using Docker, allowing others to run the project seamlessly.
- Continuous Integration (CI): Integrated automated tests using GitHub Actions to ensure that changes made to the codebase did not break the core functionality.
The goal was to make the deployment process as simple as possible, allowing other researchers or developers to replicate or extend the project. The deployment plan is fully documented in the Step 9: Deployment Method here.
The final deployment involved packaging the application as a Docker container, pushing it to Docker Hub for easy access, and providing clear instructions for installation and usage. I also ensured that the code was version-controlled in a GitHub repository, allowing for open collaboration and contributions.
This repository handles the downloading and preprocessing of YouTube videos. It provides tools to:
- Download videos asynchronously
- Extract audio and metadata
- Clean and prepare data for further processing
Key features include:
- Asynchronous data collection
- Metadata extraction and filtering
- Integration with external databases for storage
For more information, visit the yt_dlp_async repository.
This repository is focused on evaluating and testing ASR models, particularly OpenAI's Whisper. It allows for:
- Audio segmentation and transcription testing
- Dynamic prompt generation for model fine-tuning
- Logging and reporting of model performance metrics
Key features include:
- Transcription chunking and segment handling
- Integration with Whisper for ASR
- Performance evaluation based on WER and accuracy
For more information, visit the whisper_chunk_transcribe repository.
Throughout the development of the capstone, I submitted various interim steps to track my progress and ensure I was adhering to best practices. These included:
- Step 1: Planning - Read the planning document
- Step 3: Project Proposal - Read the project proposal
- Step 4: Survey Existing Research and Reproduce Solutions - Read the research document
- Step 5: Data Wrangling & Exploration - Read the data wrangling document
- Step 7: Experiment With Various Models - Read the experiment document
- Step 9: Deployment Method - Read the deployment strategy document
- Step 10: Deployment Solution Architecture - Read the solution architecture document
- Step 11: Deployment Implementation - Read the implementation document
- Step 12: Share Your Project with the World - Read the sharing document
Each of these steps documents the iterative process of building, testing, and deploying the model testing framework.
This capstone project represents a comprehensive effort to enhance real-time transcription of baseball game commentary using ASR models. Through iterative development, I created a robust framework for collecting and preprocessing audio data, fine-tuning ASR models, and evaluating their performance. The project is fully documented, with clear instructions for installation and deployment.
By sharing this project with the community, I hope to contribute to the ongoing development of ASR systems, particularly in the sports domain, and provide a valuable resource for other developers and researchers looking to improve transcription accuracy.
Note: This README serves as an overview of the entire capstone project. To dive deeper into any aspect of the development, please refer to the relevant markdown files and repositories linked throughout.