In various academic and professional settings, such as mathematics lectures or research presentations, it is often necessary to convey mathematical expressions orally. However, reading mathematical expressions aloud without accompanying visuals can significantly hinder comprehension, especially for those who are hearing-impaired or rely on subtitles due to language barriers. For instance, when a presenter reads Euler's Formula, current Automatic Speech Recognition (ASR) models often produce a verbose and error-prone textual description (e.g., e to the power of i x equals cosine of x plus i
Here, you can find the benchmark dataset, experimental code, and fine-tuned model checkpoints for MathSpeech, which we have developed for our research.
If you want to view the detailed information about the dataset used in this study or additional experimental results such as latency measurements included in the appendix, please refer to the version uploaded on arXiv.
The MathSpeech benchmark dataset is available on huggingface🤗 or through the following link.
The number of files | 1,101 |
---|---|
Total Duration | 5583.2 seconds |
Average Duration per file | 5.07 seconds |
The number of speakers | 10 |
The number of men | 8 |
The number of women | 2 |
source | [MIT OpenCourseWare] |
Models | Params | WER(%) (Leaderboard) | WER(%) (Formula) | |
---|---|---|---|---|
OpenAI | Whisper-base | 74M | 10.3 | 34.7 |
Whisper-small | 244M | 8.59 | 29.5 | |
Whisper-largeV2 | 1550M | 7.83 | 31.0 | |
Whisper-largeV3 | 1550M | 7.44 | 33.3 | |
NVIDIA | Canary-1B | 1B | 6.5 | 35.2 |
The WER for Leaderboard was from the HuggingFace Open ASR Leaderboard, while the WER for Formula was measured using our MathSpeech Benchmark. This value is based on results as of 2024-08-16.
You can download the MathSpeech checkpoint from the following link.
You can find the MathSpeech evaluation code, and the prompts used for the LLMs in the experiments at the following link.
You can find the code used in our Ablation Study at the following link.
- Clone this repository using the web URL.
git clone https://github.com/hyeonsieun/MathSpeech.git
- To build the environment, run the following code
pip install -r requirements.txt
- Place the audio dataset and the transcription Excel file inside the ASR folder.
- Run the following code.
python ASR.py
- Go to the Experiments folder
- Move the 'MathSpeech_checkpoint.pth' from the following link into the Experiments folder.
- Run the following code.
python MathSpeech_eval.py
- If you want to run LLMs like GPT-4o or Gemini, you'll need to configure the environment settings such as the API key and endpoint.
- You can also run the Ablation Study code from the Ablation_Study folder.
Notes: Here, example code for performing ASR using whisper-base and whisper-small is provided. If you want to use a different ASR model, you can modify that part of the code to use our MathSpeech.