MathSpeech: Leveraging Small LMs for Accurate Conversion in Mathematical Speech-to-Formula

Abstract

In various academic and professional settings, such as mathematics lectures or research presentations, it is often necessary to convey mathematical expressions orally. However, reading mathematical expressions aloud without accompanying visuals can significantly hinder comprehension, especially for those who are hearing-impaired or rely on subtitles due to language barriers. For instance, when a presenter reads Euler's Formula, current Automatic Speech Recognition (ASR) models often produce a verbose and error-prone textual description (e.g., e to the power of i x equals cosine of x plus i $\textit{side}$ of x), instead of the concise LaTeX format (i.e., $e^{ix} = \cos(x) + i\sin(x)$), which hampers clear understanding and communication. To address this issue, we introduce MathSpeech, a novel pipeline that integrates ASR models with small Language Models (sLMs) to correct errors in mathematical expressions and accurately convert spoken expressions into structured LaTeX representations. Evaluated on a new dataset derived from lecture recordings, MathSpeech demonstrates LaTeX generation capabilities comparable to leading commercial Large Language Models (LLMs), while leveraging fine-tuned small language models of only 120M parameters. Specifically, in terms of CER, BLEU, and ROUGE scores for LaTeX translation, MathSpeech demonstrated significantly superior capabilities compared to GPT-4o. We observed a decrease in CER from 0.390 to 0.298, and higher ROUGE/BLEU scores compared to GPT-4o.

This study is accepted for the AAAI-25 Main Technical Track.

Here, you can find the benchmark dataset, experimental code, and fine-tuned model checkpoints for MathSpeech, which we have developed for our research.

If you want to view the detailed information about the dataset used in this study or additional experimental results such as latency measurements included in the appendix, please refer to the version uploaded on arXiv.

Benchmart Dataset

The MathSpeech benchmark dataset is available on huggingface🤗 or through the following link.

Dataset statistics

The number of files	1,101
Total Duration	5583.2 seconds
Average Duration per file	5.07 seconds
The number of speakers	10
The number of men	8
The number of women	2
source	[MIT OpenCourseWare]

WERs of various ASR models on the Mathspeech benchmark

	Models	Params	WER(%) (Leaderboard)	WER(%) (Formula)
OpenAI	Whisper-base	74M	10.3	34.7
	Whisper-small	244M	8.59	29.5
	Whisper-largeV2	1550M	7.83	31.0
	Whisper-largeV3	1550M	7.44	33.3
NVIDIA	Canary-1B	1B	6.5	35.2

The WER for Leaderboard was from the HuggingFace Open ASR Leaderboard, while the WER for Formula was measured using our MathSpeech Benchmark. This value is based on results as of 2024-08-16.

MathSpeech Checkpoint

You can download the MathSpeech checkpoint from the following link.

Experiments codes

You can find the MathSpeech evaluation code, and the prompts used for the LLMs in the experiments at the following link.

Ablation Study codes

You can find the code used in our Ablation Study at the following link.

How to Use

Clone this repository using the web URL.

git clone https://github.com/hyeonsieun/MathSpeech.git

To build the environment, run the following code

pip install -r requirements.txt

Place the audio dataset and the transcription Excel file inside the ASR folder.
Run the following code.

python ASR.py

Go to the Experiments folder
Move the 'MathSpeech_checkpoint.pth' from the following link into the Experiments folder.
Run the following code.

python MathSpeech_eval.py

If you want to run LLMs like GPT-4o or Gemini, you'll need to configure the environment settings such as the API key and endpoint.
You can also run the Ablation Study code from the Ablation_Study folder.

Notes: Here, example code for performing ASR using whisper-base and whisper-small is provided. If you want to use a different ASR model, you can modify that part of the code to use our MathSpeech.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MathSpeech: Leveraging Small LMs for Accurate Conversion in Mathematical Speech-to-Formula

Abstract

This study is accepted for the AAAI-25 Main Technical Track.

Benchmart Dataset

Dataset statistics

WERs of various ASR models on the Mathspeech benchmark

The WER for Leaderboard was from the HuggingFace Open ASR Leaderboard, while the WER for Formula was measured using our MathSpeech Benchmark. This value is based on results as of 2024-08-16.

MathSpeech Checkpoint

Experiments codes

Ablation Study codes

How to Use

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
ASR		ASR
Ablation_Study		Ablation_Study
Experiments		Experiments
README.md		README.md
requirements.txt		requirements.txt

hyeonsieun/MathSpeech

Folders and files

Latest commit

History

Repository files navigation

MathSpeech: Leveraging Small LMs for Accurate Conversion in Mathematical Speech-to-Formula

Abstract

This study is accepted for the AAAI-25 Main Technical Track.

Benchmart Dataset

Dataset statistics

WERs of various ASR models on the Mathspeech benchmark

The WER for Leaderboard was from the HuggingFace Open ASR Leaderboard, while the WER for Formula was measured using our MathSpeech Benchmark. This value is based on results as of 2024-08-16.

MathSpeech Checkpoint

Experiments codes

Ablation Study codes

How to Use

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages