Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Blog on W2V2-BERT finetuning #1752

Merged
merged 17 commits into from
Jan 23, 2024

Conversation

ylacombe
Copy link
Contributor

28165 introduced a new W2V2-based model that uses a different feature extractor than classic CTC-based models. It yields really interesting WER performance on low-resource languages fine-tuning with little effort.

cc @sanchit-gandhi

Copy link
Contributor

@sanchit-gandhi sanchit-gandhi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! Will be complete once we have the motivations for using this model nailed down (running Whisper fine-tuning for comparison now). Worth checking consistency of hyphenations (pre-trained vs pretrained, fine-tuned vs finetuned, pre-processing vs preprocessing, etc).

fine-tune-w2v2-bert.md Outdated Show resolved Hide resolved
fine-tune-w2v2-bert.md Outdated Show resolved Hide resolved
fine-tune-w2v2-bert.md Outdated Show resolved Hide resolved

**Wav2Vec2** is a pretrained model for Automatic Speech Recognition (ASR) and was released in [September 2020](https://ai.facebook.com/blog/wav2vec-20-learning-the-structure-of-speech-from-raw-audio/) by *Alexei Baevski, Michael Auli, and Alex Conneau*. Soon after the superior performance of Wav2Vec2 was demonstrated on one of the most popular English datasets for ASR, called [LibriSpeech](https://huggingface.co/datasets/librispeech_asr).

Following a series of improvements ([XLSR](https://arxiv.org/abs/2006.13979), [XLS-R](https://ai.facebook.com/blog/-xlm-r-state-of-the-art-cross-lingual-understanding-through-self-supervision/) and [**MMS**](https://ai.facebook.com/blog/multilingual-model-speech-recognition/)), MetaAI's released their own version of [W2v-BERT](https://arxiv.org/abs/2108.06209), as a building block of their [Seamless Communication](https://ai.meta.com/research/seamless-communication/), a family of AI translation models.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing that's missing is an explanation of what the Wav2Vec2-BERT model actually is:

MetaAI have released [Wav2Vec2-BERT](https://huggingface.co/docs/transformers/main/en/model_doc/wav2vec2-bert), as a building block of their [Seamless Communication](https://ai.meta.com/research/seamless-communication/), a family of AI translation models. Wav2Vec2-BERT is a...


Following a series of improvements ([XLSR](https://arxiv.org/abs/2006.13979), [XLS-R](https://ai.facebook.com/blog/-xlm-r-state-of-the-art-cross-lingual-understanding-through-self-supervision/) and [**MMS**](https://ai.facebook.com/blog/multilingual-model-speech-recognition/)), MetaAI's released their own version of [W2v-BERT](https://arxiv.org/abs/2108.06209), as a building block of their [Seamless Communication](https://ai.meta.com/research/seamless-communication/), a family of AI translation models.

This new 580M-parameters version was pre-trained on **4.5M** hours of unlabeled audio data covering **more than 143 languages**. It requires finetuning to be used for downstream tasks such as Automatic Speech Recognition (ASR), or Audio Classification.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consistency with "pre-trained". Also left some bridging sentence ideas:

Suggested change
This new 580M-parameters version was pre-trained on **4.5M** hours of unlabeled audio data covering **more than 143 languages**. It requires finetuning to be used for downstream tasks such as Automatic Speech Recognition (ASR), or Audio Classification.
This new 580M-parameters version was pre-trained on **4.5M** hours of unlabeled audio data covering **more than 143 languages**. In this pre-training phase, the model is trained to... In doing so, it learns... Since it is only pre-trained on un-labeled audio data, it requires fine-tuning to be used for downstream tasks such as ASR or Audio Classification.

fine-tune-w2v2-bert.md Outdated Show resolved Hide resolved
fine-tune-w2v2-bert.md Outdated Show resolved Hide resolved
fine-tune-w2v2-bert.md Outdated Show resolved Hide resolved
fine-tune-w2v2-bert.md Outdated Show resolved Hide resolved
fine-tune-w2v2-bert.md Outdated Show resolved Hide resolved
@ylacombe ylacombe merged commit f8d86d4 into huggingface:main Jan 23, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants