Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add AutoFeatureExtractor support to Wav2Vec2ProcessorWithLM #28706

Merged
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -70,15 +70,15 @@ class Wav2Vec2ProcessorWithLM(ProcessorMixin):
with language model support into a single processor for language model boosted speech recognition decoding.

Args:
feature_extractor ([`Wav2Vec2FeatureExtractor`]):
An instance of [`Wav2Vec2FeatureExtractor`]. The feature extractor is a required input.
feature_extractor ([`Wav2Vec2FeatureExtractor`] or [`SeamlessM4TFeatureExtractor`]):
An instance of [`Wav2Vec2FeatureExtractor`] or [`SeamlessM4TFeatureExtractor`]. The feature extractor is a required input.
tokenizer ([`Wav2Vec2CTCTokenizer`]):
An instance of [`Wav2Vec2CTCTokenizer`]. The tokenizer is a required input.
decoder (`pyctcdecode.BeamSearchDecoderCTC`):
An instance of [`pyctcdecode.BeamSearchDecoderCTC`]. The decoder is a required input.
"""

feature_extractor_class = "Wav2Vec2FeatureExtractor"
feature_extractor_class = "AutoFeatureExtractor"
Copy link
Contributor

@sanchit-gandhi sanchit-gandhi Jan 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps explicitly defining the valid feature extractor classes is less vulnerable to unexpected behaviour? Or do you think we should accept all feature extractors as valid attributes of the Wav2Vec2 processor? I would be in favour of the former

Suggested change
feature_extractor_class = "AutoFeatureExtractor"
feature_extractor_class = ("Wav2Vec2FeatureExtractor", "SeamlessM4TFeatureExtractor")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because of the tokenizer behaviour, when passing a tuple to ..._class, the second class is chosen by default if use_fast=True (and the other way around). So in that case, it won't work.

This is the first time that we had to face a processor with two possible classes. I'd be happy to modify the ProcessorMixin behaviour but I believe passing AutoFeatureExtractor is an easier workaround, especially since I don't really see a case in which an user would pass another FE yet!

cc @amyeroberts and @ArthurZucker, what do you think of this ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see - good with me to use AutoFeatureExtractor in this case since changing the ProcessorMixin would add unnecessary complexity to handle this unique example

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't the first processor to handle the case with two object -- that's Instruct BLIP -- although it's slightly different as that has two tokenizers, rather than accepts two tokenizers.

I'm not a fan of using AutoFeatureExtractor here as the processor doesn't accept any feature extractor, and so the type is misleading. Ideally the processor should be adapted so for e.g. tokenizers it can accept a list of lists e.g. [[ToknizerA, TokenizerAFast], [TokenizerB, TokenizerBFast]], and to accept lists of objects for the other types e.g. [FeatureExtractorA, FeatureExtractorB, FeatureExtractorC].

At the very least, the processor doc should indicate that only these two type are accepted and type verification happens in the init.

Copy link
Contributor

@NielsRogge NielsRogge Mar 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am hitting the same issue at #29414, however I don't really see the problem of using autoclasses in the processors, I just would like it to be consistent (use either both auto classes for feature extractor and tokenizer, or not at all). In case of the use of auto classes, it would indeed be great to explicitly check for the supported classes.

Cause this PR currently would use AutoFeatureExtractor for the feature extractor, but a specific tokenizer class for the tokenizer. Hence I've opened #29414 to make the discussion more general, feel free to comment there!

tokenizer_class = "Wav2Vec2CTCTokenizer"

def __init__(
Expand All @@ -93,6 +93,11 @@ def __init__(
if not isinstance(decoder, BeamSearchDecoderCTC):
raise ValueError(f"`decoder` has to be of type {BeamSearchDecoderCTC.__class__}, but is {type(decoder)}")

if feature_extractor.__class__.__name__ not in ["Wav2Vec2FeatureExtractor", "SeamlessM4TFeatureExtractor"]:
raise ValueError(
f"`feature_extractor` has to be of type `Wav2Vec2FeatureExtractor` or `SeamlessM4TFeatureExtractor`, but is {type(feature_extractor)}"
)

Comment on lines +96 to +100
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this is the most transformers-like way of checking this, WDYT ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Contributor

@NielsRogge NielsRogge Mar 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps it would be more general if we add this verification to the ProcessorMixin class rather than manually defining this ValueError (as we have this case for various models). Could be called supported_feature_extractors for instance, and the mixin class would then automatically raise a ValueError in case the class is not supported.

Alternatively this could be automatically checked based on the feature_extraction_auto.py class, where wav2vec2_with_lm would map to both Wav2Vec2FeatureExtractor and SeamlessM4TFeatureExtractor

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Auto classes are only used for two processors at the moment (neither merged)? If we enable using autoclasses and then add authentication into the ProcessorMixin, then we're just doing a more round-about alternative to just being able to specify more than one class for feature_extractor_class, which I think would be preferable.

# make sure that decoder's alphabet and tokenizer's vocab match in content
missing_decoder_tokens = self.get_missing_alphabet_tokens(decoder, tokenizer)
if len(missing_decoder_tokens) > 0:
Expand All @@ -117,7 +122,7 @@ def from_pretrained(cls, pretrained_model_name_or_path, **kwargs):

<Tip>

This class method is simply calling Wav2Vec2FeatureExtractor's
This class method is simply calling the feature extractor's
[`~feature_extraction_utils.FeatureExtractionMixin.from_pretrained`], Wav2Vec2CTCTokenizer's
[`~tokenization_utils_base.PreTrainedTokenizerBase.from_pretrained`], and
[`pyctcdecode.BeamSearchDecoderCTC.load_from_hf_hub`].
Expand Down Expand Up @@ -214,8 +219,8 @@ def get_missing_alphabet_tokens(decoder, tokenizer):

def __call__(self, *args, **kwargs):
"""
When used in normal mode, this method forwards all its arguments to Wav2Vec2FeatureExtractor's
[`~Wav2Vec2FeatureExtractor.__call__`] and returns its output. If used in the context
When used in normal mode, this method forwards all its arguments to the feature extractor's
[`~FeatureExtractionMixin.__call__`] and returns its output. If used in the context
[`~Wav2Vec2ProcessorWithLM.as_target_processor`] this method forwards all its arguments to
Wav2Vec2CTCTokenizer's [`~Wav2Vec2CTCTokenizer.__call__`]. Please refer to the docstring of the above two
methods for more information.
Expand Down Expand Up @@ -253,8 +258,8 @@ def __call__(self, *args, **kwargs):

def pad(self, *args, **kwargs):
"""
When used in normal mode, this method forwards all its arguments to Wav2Vec2FeatureExtractor's
[`~Wav2Vec2FeatureExtractor.pad`] and returns its output. If used in the context
When used in normal mode, this method forwards all its arguments to the feature extractor's
[`~FeatureExtractionMixin.pad`] and returns its output. If used in the context
[`~Wav2Vec2ProcessorWithLM.as_target_processor`] this method forwards all its arguments to
Wav2Vec2CTCTokenizer's [`~Wav2Vec2CTCTokenizer.pad`]. Please refer to the docstring of the above two methods
for more information.
Expand Down
23 changes: 22 additions & 1 deletion tests/models/wav2vec2_with_lm/test_processor_wav2vec2_with_lm.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@
from datasets import load_dataset
from parameterized import parameterized

from transformers import AutoProcessor
from transformers import AutoFeatureExtractor, AutoProcessor
from transformers.models.wav2vec2 import Wav2Vec2CTCTokenizer, Wav2Vec2FeatureExtractor
from transformers.models.wav2vec2.tokenization_wav2vec2 import VOCAB_FILES_NAMES
from transformers.testing_utils import require_pyctcdecode, require_torch, require_torchaudio, slow
Expand Down Expand Up @@ -157,6 +157,27 @@ def test_feature_extractor(self):
for key in input_feat_extract.keys():
self.assertAlmostEqual(input_feat_extract[key].sum(), input_processor[key].sum(), delta=1e-2)

def test_another_feature_extractor(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also test that a feature extractor this isn't SeamlessMT4 or Wav2vec2 raises an error on instantiation

feature_extractor = AutoFeatureExtractor.from_pretrained("facebook/w2v-bert-2.0")
tokenizer = self.get_tokenizer()
decoder = self.get_decoder()

processor = Wav2Vec2ProcessorWithLM(tokenizer=tokenizer, feature_extractor=feature_extractor, decoder=decoder)

raw_speech = floats_list((3, 1000))

input_feat_extract = feature_extractor(raw_speech, return_tensors="np")
input_processor = processor(raw_speech, return_tensors="np")

for key in input_feat_extract.keys():
self.assertAlmostEqual(input_feat_extract[key].sum(), input_processor[key].sum(), delta=1e-2)

self.assertListEqual(
processor.model_input_names,
feature_extractor.model_input_names,
msg="`processor` and `feature_extractor` model input names do not match",
)

def test_tokenizer(self):
feature_extractor = self.get_feature_extractor()
tokenizer = self.get_tokenizer()
Expand Down
Loading