-
Notifications
You must be signed in to change notification settings - Fork 794
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Blog on W2V2-BERT finetuning #1752
Blog on W2V2-BERT finetuning #1752
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good! Will be complete once we have the motivations for using this model nailed down (running Whisper fine-tuning for comparison now). Worth checking consistency of hyphenations (pre-trained vs pretrained, fine-tuned vs finetuned, pre-processing vs preprocessing, etc).
fine-tune-w2v2-bert.md
Outdated
|
||
**Wav2Vec2** is a pretrained model for Automatic Speech Recognition (ASR) and was released in [September 2020](https://ai.facebook.com/blog/wav2vec-20-learning-the-structure-of-speech-from-raw-audio/) by *Alexei Baevski, Michael Auli, and Alex Conneau*. Soon after the superior performance of Wav2Vec2 was demonstrated on one of the most popular English datasets for ASR, called [LibriSpeech](https://huggingface.co/datasets/librispeech_asr). | ||
|
||
Following a series of improvements ([XLSR](https://arxiv.org/abs/2006.13979), [XLS-R](https://ai.facebook.com/blog/-xlm-r-state-of-the-art-cross-lingual-understanding-through-self-supervision/) and [**MMS**](https://ai.facebook.com/blog/multilingual-model-speech-recognition/)), MetaAI's released their own version of [W2v-BERT](https://arxiv.org/abs/2108.06209), as a building block of their [Seamless Communication](https://ai.meta.com/research/seamless-communication/), a family of AI translation models. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing that's missing is an explanation of what the Wav2Vec2-BERT model actually is:
MetaAI have released [Wav2Vec2-BERT](https://huggingface.co/docs/transformers/main/en/model_doc/wav2vec2-bert), as a building block of their [Seamless Communication](https://ai.meta.com/research/seamless-communication/), a family of AI translation models. Wav2Vec2-BERT is a...
fine-tune-w2v2-bert.md
Outdated
|
||
Following a series of improvements ([XLSR](https://arxiv.org/abs/2006.13979), [XLS-R](https://ai.facebook.com/blog/-xlm-r-state-of-the-art-cross-lingual-understanding-through-self-supervision/) and [**MMS**](https://ai.facebook.com/blog/multilingual-model-speech-recognition/)), MetaAI's released their own version of [W2v-BERT](https://arxiv.org/abs/2108.06209), as a building block of their [Seamless Communication](https://ai.meta.com/research/seamless-communication/), a family of AI translation models. | ||
|
||
This new 580M-parameters version was pre-trained on **4.5M** hours of unlabeled audio data covering **more than 143 languages**. It requires finetuning to be used for downstream tasks such as Automatic Speech Recognition (ASR), or Audio Classification. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For consistency with "pre-trained". Also left some bridging sentence ideas:
This new 580M-parameters version was pre-trained on **4.5M** hours of unlabeled audio data covering **more than 143 languages**. It requires finetuning to be used for downstream tasks such as Automatic Speech Recognition (ASR), or Audio Classification. | |
This new 580M-parameters version was pre-trained on **4.5M** hours of unlabeled audio data covering **more than 143 languages**. In this pre-training phase, the model is trained to... In doing so, it learns... Since it is only pre-trained on un-labeled audio data, it requires fine-tuning to be used for downstream tasks such as ASR or Audio Classification. |
Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com>
28165 introduced a new W2V2-based model that uses a different feature extractor than classic CTC-based models. It yields really interesting WER performance on low-resource languages fine-tuning with little effort.
cc @sanchit-gandhi