Blog on W2V2-BERT finetuning #1752

ylacombe · 2024-01-19T11:24:08Z

28165 introduced a new W2V2-based model that uses a different feature extractor than classic CTC-based models. It yields really interesting WER performance on low-resource languages fine-tuning with little effort.

cc @sanchit-gandhi

sanchit-gandhi

Looking good! Will be complete once we have the motivations for using this model nailed down (running Whisper fine-tuning for comparison now). Worth checking consistency of hyphenations (pre-trained vs pretrained, fine-tuned vs finetuned, pre-processing vs preprocessing, etc).

fine-tune-w2v2-bert.md

sanchit-gandhi · 2024-01-19T12:24:57Z

fine-tune-w2v2-bert.md

+
+**Wav2Vec2** is a pretrained model for Automatic Speech Recognition (ASR) and was released in [September 2020](https://ai.facebook.com/blog/wav2vec-20-learning-the-structure-of-speech-from-raw-audio/) by *Alexei Baevski, Michael Auli, and Alex Conneau*.  Soon after the superior performance of Wav2Vec2 was demonstrated on one of the most popular English datasets for ASR, called [LibriSpeech](https://huggingface.co/datasets/librispeech_asr).
+
+Following a series of improvements ([XLSR](https://arxiv.org/abs/2006.13979), [XLS-R](https://ai.facebook.com/blog/-xlm-r-state-of-the-art-cross-lingual-understanding-through-self-supervision/) and [**MMS**](https://ai.facebook.com/blog/multilingual-model-speech-recognition/)), MetaAI's released their own version of [W2v-BERT](https://arxiv.org/abs/2108.06209), as a building block of their [Seamless Communication](https://ai.meta.com/research/seamless-communication/), a family of AI translation models.


One thing that's missing is an explanation of what the Wav2Vec2-BERT model actually is:

MetaAI have released [Wav2Vec2-BERT](https://huggingface.co/docs/transformers/main/en/model_doc/wav2vec2-bert), as a building block of their [Seamless Communication](https://ai.meta.com/research/seamless-communication/), a family of AI translation models. Wav2Vec2-BERT is a...

sanchit-gandhi · 2024-01-19T12:26:32Z

fine-tune-w2v2-bert.md

+
+Following a series of improvements ([XLSR](https://arxiv.org/abs/2006.13979), [XLS-R](https://ai.facebook.com/blog/-xlm-r-state-of-the-art-cross-lingual-understanding-through-self-supervision/) and [**MMS**](https://ai.facebook.com/blog/multilingual-model-speech-recognition/)), MetaAI's released their own version of [W2v-BERT](https://arxiv.org/abs/2108.06209), as a building block of their [Seamless Communication](https://ai.meta.com/research/seamless-communication/), a family of AI translation models.
+
+This new 580M-parameters version was pre-trained on **4.5M** hours of unlabeled audio data covering **more than 143 languages**. It requires finetuning to be used for downstream tasks such as Automatic Speech Recognition (ASR), or Audio Classification.


For consistency with "pre-trained". Also left some bridging sentence ideas:

Suggested change

This new 580M-parameters version was pre-trained on **4.5M** hours of unlabeled audio data covering **more than 143 languages**. It requires finetuning to be used for downstream tasks such as Automatic Speech Recognition (ASR), or Audio Classification.

This new 580M-parameters version was pre-trained on **4.5M** hours of unlabeled audio data covering **more than 143 languages**. In this pre-training phase, the model is trained to... In doing so, it learns... Since it is only pre-trained on un-labeled audio data, it requires fine-tuning to be used for downstream tasks such as ASR or Audio Classification.

fine-tune-w2v2-bert.md

Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com>

ylacombe and others added 5 commits January 18, 2024 16:12

first draft of blopost

349a87c

update blog and assets

a474be4

Merge branch 'main' into YoachLacombe/FineTuneW2vBERT

062647c

Update fine-tune-w2v2-bert.md

e59d2c5

Update fine-tune-w2v2-bert.md

0e9e3e7

sanchit-gandhi approved these changes Jan 22, 2024

View reviewed changes

ylacombe and others added 12 commits January 22, 2024 10:25

Apply suggestions from code review

ee149bb

Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com>

Update fine-tune-w2v2-bert.md

666aedc

Update fine-tune-w2v2-bert.md

35cc721

Update fine-tune-w2v2-bert.md

6772362

Update fine-tune-w2v2-bert.md

285650b

Update fine-tune-w2v2-bert.md

b20708f

Update fine-tune-w2v2-bert.md

a8f5b69

Add print output

fbb7a2a

Update fine-tune-w2v2-bert.md

3acd00b

Merge branch 'main' into YoachLacombe/FineTuneW2vBERT

62c1c83

Update fine-tune-w2v2-bert.md

f20fb87

Update fine-tune-w2v2-bert.md

2d38b1e

ylacombe merged commit f8d86d4 into huggingface:main Jan 23, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Blog on W2V2-BERT finetuning #1752

Blog on W2V2-BERT finetuning #1752

ylacombe commented Jan 19, 2024

sanchit-gandhi left a comment

sanchit-gandhi Jan 19, 2024

sanchit-gandhi Jan 19, 2024


		Wav2Vec2 is a pretrained model for Automatic Speech Recognition (ASR) and was released in [September 2020](https://ai.facebook.com/blog/wav2vec-20-learning-the-structure-of-speech-from-raw-audio/) by Alexei Baevski, Michael Auli, and Alex Conneau. Soon after the superior performance of Wav2Vec2 was demonstrated on one of the most popular English datasets for ASR, called [LibriSpeech](https://huggingface.co/datasets/librispeech_asr).

		Following a series of improvements ([XLSR](https://arxiv.org/abs/2006.13979), [XLS-R](https://ai.facebook.com/blog/-xlm-r-state-of-the-art-cross-lingual-understanding-through-self-supervision/) and [MMS](https://ai.facebook.com/blog/multilingual-model-speech-recognition/)), MetaAI's released their own version of [W2v-BERT](https://arxiv.org/abs/2108.06209), as a building block of their [Seamless Communication](https://ai.meta.com/research/seamless-communication/), a family of AI translation models.


		Following a series of improvements ([XLSR](https://arxiv.org/abs/2006.13979), [XLS-R](https://ai.facebook.com/blog/-xlm-r-state-of-the-art-cross-lingual-understanding-through-self-supervision/) and [MMS](https://ai.facebook.com/blog/multilingual-model-speech-recognition/)), MetaAI's released their own version of [W2v-BERT](https://arxiv.org/abs/2108.06209), as a building block of their [Seamless Communication](https://ai.meta.com/research/seamless-communication/), a family of AI translation models.

		This new 580M-parameters version was pre-trained on 4.5M hours of unlabeled audio data covering more than 143 languages. It requires finetuning to be used for downstream tasks such as Automatic Speech Recognition (ASR), or Audio Classification.

Blog on W2V2-BERT finetuning #1752

Blog on W2V2-BERT finetuning #1752

Conversation

ylacombe commented Jan 19, 2024

sanchit-gandhi left a comment

Choose a reason for hiding this comment

sanchit-gandhi Jan 19, 2024

Choose a reason for hiding this comment

sanchit-gandhi Jan 19, 2024

Choose a reason for hiding this comment