Skip to content

hyeonsieun/MathSpeech

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MathSpeech: Leveraging Small LMs for Accurate Conversion in Mathematical Speech-to-Formula

Abstract

In various academic and professional settings, such as mathematics lectures or research presentations, it is often necessary to convey mathematical expressions orally. However, reading mathematical expressions aloud without accompanying visuals can significantly hinder comprehension, especially for those who are hearing-impaired or rely on subtitles due to language barriers. For instance, when a presenter reads Euler's Formula, current Automatic Speech Recognition (ASR) models often produce a verbose and error-prone textual description (e.g., e to the power of i x equals cosine of x plus i $\textit{side}$ of x), instead of the concise LaTeX format (i.e., $e^{ix} = \cos(x) + i\sin(x)$), which hampers clear understanding and communication. To address this issue, we introduce MathSpeech, a novel pipeline that integrates ASR models with small Language Models (sLMs) to correct errors in mathematical expressions and accurately convert spoken expressions into structured LaTeX representations. Evaluated on a new dataset derived from lecture recordings, MathSpeech demonstrates LaTeX generation capabilities comparable to leading commercial Large Language Models (LLMs), while leveraging fine-tuned small language models of only 120M parameters. Specifically, in terms of CER, BLEU, and ROUGE scores for LaTeX translation, MathSpeech demonstrated significantly superior capabilities compared to GPT-4o. We observed a decrease in CER from 0.390 to 0.298, and higher ROUGE/BLEU scores compared to GPT-4o.

This study is accepted for the AAAI-25 Main Technical Track.

Here, you can find the benchmark dataset, experimental code, and fine-tuned model checkpoints for MathSpeech, which we have developed for our research.

If you want to view the detailed information about the dataset used in this study or additional experimental results such as latency measurements included in the appendix, please refer to the version uploaded on arXiv.


Benchmart Dataset

The MathSpeech benchmark dataset is available on huggingface🤗 or through the following link.

Dataset statistics

The number of files 1,101
Total Duration 5583.2 seconds
Average Duration per file 5.07 seconds
The number of speakers 10
The number of men 8
The number of women 2
source [MIT OpenCourseWare]

WERs of various ASR models on the Mathspeech benchmark

Models Params WER(%) (Leaderboard) WER(%) (Formula)
OpenAI Whisper-base 74M 10.3 34.7
Whisper-small 244M 8.59 29.5
Whisper-largeV2 1550M 7.83 31.0
Whisper-largeV3 1550M 7.44 33.3
NVIDIA Canary-1B 1B 6.5 35.2
The WER for Leaderboard was from the HuggingFace Open ASR Leaderboard, while the WER for Formula was measured using our MathSpeech Benchmark. This value is based on results as of 2024-08-16.

MathSpeech Checkpoint

You can download the MathSpeech checkpoint from the following link.

Experiments codes

You can find the MathSpeech evaluation code, and the prompts used for the LLMs in the experiments at the following link.

Ablation Study codes

You can find the code used in our Ablation Study at the following link.


How to Use

  1. Clone this repository using the web URL.
git clone https://github.com/hyeonsieun/MathSpeech.git
  1. To build the environment, run the following code
pip install -r requirements.txt
  1. Place the audio dataset and the transcription Excel file inside the ASR folder.
  2. Run the following code.
python ASR.py
  1. Go to the Experiments folder
  2. Move the 'MathSpeech_checkpoint.pth' from the following link into the Experiments folder.
  3. Run the following code.
python MathSpeech_eval.py
  1. If you want to run LLMs like GPT-4o or Gemini, you'll need to configure the environment settings such as the API key and endpoint.
  2. You can also run the Ablation Study code from the Ablation_Study folder.

Notes: Here, example code for performing ASR using whisper-base and whisper-small is provided. If you want to use a different ASR model, you can modify that part of the code to use our MathSpeech.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published