add return_token_timestamps to WhisperProcessor #30812

kamilakesbi · 2024-05-14T17:19:53Z

What this PR do ?

This PR fixes #30433 by making sure we can compute timestamps with both WhisperForConditionalGeneration and AutomaticSpeechRecognitionPipeline.

We add a return_timestamps hyperparameter to WhisperProcessor.feature_extractor to be used when we want to compute timestamps. When True, the processor will return a num_frames parameter containing the number of frames of the input audios. num_frames is then passed to generate and used to compute timestamps.

Prior to that, timestamps were broken for whisper-large-v3 when used with WhisperForConditionalGeneration.

Who can review ?

cc @sanchit-gandhi

HuggingFaceDocBuilderDev · 2024-05-14T17:42:20Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

sanchit-gandhi

Think this is indeed the cleanest and most reliable approach for computing num_frames. The alternative method we discussed offline is detailed below. Leaving it here for the next reviewer to consider, in case they believe it's a superior strategy.

Anything beyond len(input_speech) is padded by zeros to 30-seconds in the feature extractor. If we know what zero’s correspond to in log-mel space, then we can know how many padded zeros we have in our spectrogram, and thus what the original input length was.

Note that this won’t be perfect: the last frame where the audio stops is going to be affected by the end of the audio, so we’ll be looking for the first frame where there is entirely padding (rather than finding the frame in which the audio stops).

However, the original method by OpenAI (and the one implemented in this PR) is also imperfect: if a user took a 10-second audio, and padded it by hand to 15-seconds with zeros, then num_frames would be computed on the length of the padded input, not the original one

sanchit-gandhi · 2024-05-15T16:33:07Z

src/transformers/models/whisper/generation_whisper.py

@@ -474,6 +475,13 @@ def generate(
                "The input name `inputs` is deprecated. Please make sure to use `input_features` instead.",
                FutureWarning,
            )
+
+        if input_features is not None and isinstance(input_features, BatchFeature):


Not sure why this has crept in? input_features should be a tensor of shape (bsz, num_mels, num_frames), not a BatchFeature encoding. Thus, this new logic isn't required.

The correct way of using the feature extractor should be:

from transformers import WhisperProcessor, WhisperForConditionalGeneration from datasets import load_dataset, Audio model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-tiny") processor = WhisperProcessor.from_pretrained("openai/whisper-tiny") dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation") dataset = dataset.cast_column("audio", Audio(16_000)) sample = next(iter(dataset)) inputs = processor(sample["audio"]["array"], return_tensors="pt") # note here how we un-pack the batch feature encoding pred_ids = model.generate(**inputs, language="english")

The output of the processor would be a BatchFeature as indicated here no ?

Yes, but then we un-pack the BatchFeature when we pass it to the model, i.e. we do:

pred_ids = model.generate(**inputs)

Not:

pred_ids = model.generate(inputs)

In this case it will work with both packed and unpacked inputs. Isn't that better?

I'm aligned with @sanchit-gandhi here - handling packed and unpacked inputs isn't something any of our other processing classes handle, so it's not something we need to introduce here

src/transformers/models/whisper/feature_extraction_whisper.py

src/transformers/pipelines/automatic_speech_recognition.py

Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com>

sanchit-gandhi

Looks good! Mostly formatting now, then we can get a final review

src/transformers/models/whisper/feature_extraction_whisper.py

src/transformers/models/whisper/generation_whisper.py

sanchit-gandhi · 2024-05-16T15:23:57Z

src/transformers/models/whisper/generation_whisper.py

@@ -474,6 +475,13 @@ def generate(
                "The input name `inputs` is deprecated. Please make sure to use `input_features` instead.",
                FutureWarning,
            )
+
+        if input_features is not None and isinstance(input_features, BatchFeature):


Yes, but then we un-pack the BatchFeature when we pass it to the model, i.e. we do:

pred_ids = model.generate(**inputs)

Not:

pred_ids = model.generate(inputs)

tests/models/whisper/test_modeling_whisper.py

Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com>

amyeroberts

Thanks for adding this feature and tests!

All looks good to me - just the handling of unpacked features to remove

src/transformers/models/whisper/feature_extraction_whisper.py

amyeroberts · 2024-05-16T17:57:22Z

src/transformers/models/whisper/generation_whisper.py

@@ -474,6 +475,13 @@ def generate(
                "The input name `inputs` is deprecated. Please make sure to use `input_features` instead.",
                FutureWarning,
            )
+
+        if input_features is not None and isinstance(input_features, BatchFeature):


I'm aligned with @sanchit-gandhi here - handling packed and unpacked inputs isn't something any of our other processing classes handle, so it's not something we need to introduce here

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

amyeroberts

Thanks for adding!

…b.com:kamilakesbi/transformers into timestamps_whisper_for_conditional_generation

amyeroberts · 2024-05-17T12:57:15Z

tests/models/whisper/test_modeling_whisper.py

@@ -1927,7 +1927,117 @@ def test_large_timestamp_generation(self):

        generated_ids = model.generate(input_features, max_length=448, return_timestamps=True).to("cpu")

-        EXPECTED_OUTPUT = torch.tensor([50258, 50259, 50360, 50365, 2221, 13, 2326, 388, 391, 307, 264, 50244, 295, 264, 2808, 5359, 11, 293, 321, 366, 5404, 281, 2928, 702, 14943, 13, 50629, 50682, 6966, 307, 2221, 13, 2326, 388, 391, 311, 9060, 1570, 1880, 813, 702,  1871, 13, 50870, 50911, 634, 5112, 505, 300, 412, 341, 42729, 3196, 295, 264,  1064,  11, 365,  5272,   293, 12904,  9256, 450, 10539, 949, 505, 11, 51245, 51287,  1034, 4680, 10117, 490, 3936, 293, 1080,  3542, 5160, 881, 26336, 281, 264, 1575, 13, 51494, 51523, 634, 575, 12525, 22618, 1968,  6144, 35617, 1456, 397, 266, 311, 589, 307, 534, 10281, 934, 439, 11, 51799, 51815, 50257])
+        EXPECTED_OUTPUT = torch.tensor(


I don't think we want to split across the lines like this.

You can wrap EXPECTED_OUTPUT around # fmt: off and fmt: on comments to avoid this

Ok thanks for the tips! will be useful ;)

You can also use # fmt: skip for single lines, c.f. the previous comment #30812 (comment)

kamilakesbi · 2024-05-20T08:07:03Z

cc @amyeroberts @sanchit-gandhi Could you please merge this PR as I don't have the rights to do so?
Thanks!

kamilakesbi added 2 commits May 14, 2024 18:37

compute num_frames in WhisperFeatureExtractor

0a6ac0e

add return_num_frames in WhisperFeatureProcessor + adapt pipeline

2353b00

kamilakesbi changed the title ~~[WIP] - Add return_num_frames in WhisperProcessor~~ [WIP] - add num_frames parameter in WhisperProcessor to compute word level timestamps in WhisperForConditionalGeneration May 14, 2024

return_timestamps renaming + pipeline fix

3c29789

kamilakesbi changed the title ~~[WIP] - add num_frames parameter in WhisperProcessor to compute word level timestamps in WhisperForConditionalGeneration~~ [WIP] - add return_timestamps in WhisperProcessor May 15, 2024

kamilakesbi added 2 commits May 15, 2024 17:05

fix

7c069fa

fix

b63ff6b

kamilakesbi changed the title ~~[WIP] - add return_timestamps in WhisperProcessor~~ [WIP] - add return_timestamps to WhisperProcessor May 15, 2024

fix

74fd61f

sanchit-gandhi reviewed May 15, 2024

View reviewed changes

kamilakesbi and others added 4 commits May 15, 2024 18:58

add tests

641f497

Update src/transformers/models/whisper/feature_extraction_whisper.py

597571d

Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com>

apply review changes

49fcb53

fix

65b137e

kamilakesbi changed the title ~~[WIP] - add return_timestamps to WhisperProcessor~~ [WIP] - add return_token_timestamps to WhisperProcessor May 16, 2024

kamilakesbi changed the title ~~[WIP] - add return_token_timestamps to WhisperProcessor~~ add return_token_timestamps to WhisperProcessor May 16, 2024

sanchit-gandhi approved these changes May 16, 2024

View reviewed changes

kamilakesbi and others added 5 commits May 16, 2024 17:29

Update src/transformers/models/whisper/feature_extraction_whisper.py

dad619c

Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com>

Update tests/models/whisper/test_modeling_whisper.py

62773e7

Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com>

apply review

f0eec25

apply review changes

2d7b3b8

fix

2f16185

kamilakesbi requested a review from amyeroberts May 16, 2024 16:38

amyeroberts reviewed May 16, 2024

View reviewed changes

kamilakesbi and others added 2 commits May 16, 2024 23:57

review changes

8318ba8

Update src/transformers/models/whisper/feature_extraction_whisper.py

6c5fce3

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

kamilakesbi requested a review from amyeroberts May 17, 2024 07:10

kamilakesbi added Audio Good Second Issue Issues that are more difficult to do than "Good First" issues - give it a try if you want! labels May 17, 2024

kamilakesbi self-assigned this May 17, 2024

amyeroberts approved these changes May 17, 2024

View reviewed changes

kamilakesbi added 2 commits May 17, 2024 14:41

make style quality

cd00965

Merge branch 'timestamps_whisper_for_conditional_generation' of githu…

f81d40d

…b.com:kamilakesbi/transformers into timestamps_whisper_for_conditional_generation

amyeroberts reviewed May 17, 2024

View reviewed changes

kamilakesbi added 3 commits May 17, 2024 15:05

EXPECTED_OUTPUT in single line

5bb9f1b

small numpy->torch fix

b434117

fix

dcca5cc

amyeroberts merged commit 1c2bb3a into huggingface:main May 20, 2024
21 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add return_token_timestamps to WhisperProcessor #30812

add return_token_timestamps to WhisperProcessor #30812

kamilakesbi commented May 14, 2024 •

edited

Loading

HuggingFaceDocBuilderDev commented May 14, 2024

sanchit-gandhi left a comment

sanchit-gandhi May 15, 2024

kamilakesbi May 15, 2024 •

edited

Loading

sanchit-gandhi May 16, 2024

kamilakesbi May 16, 2024

amyeroberts May 16, 2024

sanchit-gandhi left a comment

sanchit-gandhi May 16, 2024

amyeroberts left a comment

amyeroberts May 16, 2024

amyeroberts left a comment

amyeroberts May 17, 2024

kamilakesbi May 17, 2024

sanchit-gandhi May 17, 2024

kamilakesbi commented May 20, 2024

add return_token_timestamps to WhisperProcessor #30812

add return_token_timestamps to WhisperProcessor #30812

Conversation

kamilakesbi commented May 14, 2024 • edited Loading

What this PR do ?

Who can review ?

HuggingFaceDocBuilderDev commented May 14, 2024

sanchit-gandhi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kamilakesbi May 15, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sanchit-gandhi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amyeroberts left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amyeroberts left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kamilakesbi commented May 20, 2024

kamilakesbi commented May 14, 2024 •

edited

Loading

kamilakesbi May 15, 2024 •

edited

Loading