Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support generating with fallback for short form audio in Whisper #30984

Merged
merged 72 commits into from
Jul 19, 2024

Conversation

kamilakesbi
Copy link
Contributor

@kamilakesbi kamilakesbi commented May 23, 2024

What does this PR do?

The aim of this PR is to refacto the Whisper generate method to handle both short form and long form audio generation similarly. It will support short form audio generation with fallback (as requested in #29508).

Here's what I've done:

Removed previous short-form scripts:

I've removed the part of the code used for short form generation. This involve lines 562 to 603 and lines 498 to 505 in main. Now when a short form audio (or a batched short form of audio) is passed to generate, it's processed by the part of the code previously used for long form generation.

Use is_shortform to still distinguish between short form and long form in some cases:

  • In the _postprocess_outputs method we only return past_key_values if the audios are short form. For long form audios it is too expensive. (cf. this line).

  • In _retrieve_max_frames_and_seek : For long form audios, we necessarily need to pass an attention mask but not for short form audios. We can thus compute max_frames and seek without relying on the attention mask for short form audios.

  • I've also updated the split_by_batch_index method: the previous method was broken when return_dict_in_generate was set to True for different short form audio cases. Now it handles both short form and long form audios.

  • I've removed the is_shortform parameter from the inputs to the _retrieve_logit_processors method to allow the use of generation_config.no_speech_threshold for short form audios.

  • I've removed is_shortfrom parameter from the inputs to the _set_return_outputs method to allow the use of logprob_threshold for short form audios.

Make num_return_sequences>1 compatible with generate_with_fallback:

  • This is a bit tricky because generate_with_fallback can't handle num_return_sequences>1 by design. I've added a new method, called _expand_variables_for_generation , which expands the different variables before passing into generate_with_fallback when generation_config.num_return_sequences>1. After expansion it will set generation_config.num_return_sequences to 1 for compatibility with generate_with_fallback.

Ensure that the output format for short form audio is compatible with the output format in main:

The output format for long-form audio is different from that for short-form audio. In order to ensure that the output is similar to that obtained in main when processing short form audio, we need to add a few post-processing steps: This is what is done in lines 721 to 765. In particular here:

  • We add an EOS token to the output sequence as it was removed during generation with fallback.
  • We return the token timestamps if return_token_timestamps is True in the correct format (see here).
  • If return_dict_in_generate is True, we use the new method _stack_split_outputs to get the output dict (containing all attributes (scores, encoder_attentions, etc.)) in the right format. _stack_split_outputs basically performs the opposite operations to split_by_batch_index .

Make failing slow tests to pass:

  • I've updated some failing slow tests and made them pass (see here).

Add new tests to make sure generation with fallback works for short form audios:

I've added two tests: test_whisper_shortform_single_batch_prev_cond and test_whisper_shortform_multi_batch_hard_prev_cond.

Who can review:

@sanchit-gandhi

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@kamilakesbi kamilakesbi force-pushed the fallback_short_form branch from 956cfb4 to 07e7db3 Compare May 24, 2024 14:02
Copy link
Contributor

@sanchit-gandhi sanchit-gandhi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a good start @kamilakesbi. Two biggest suggestions are related to the designs of i) assisted generation, and ii) num return sequences. Think both can be simplified and assisted generation made more rigorous.

Two further design questions:

  1. Should we return the original decoder_input_ids and EOS tokens in the sequences for long-form generation as well? IMO this is an inconsistency that we return them for short-form, but not long-form, and I would be in-favour of unifying the two in this PR
  2. Is it correct to de-activate beam search when temperature>0? We currently don't do this for long-form generation, but given the original Whisper repo does, it would be good to determine whether this is a 'bug' or an intended design decision

kamilakesbi and others added 8 commits May 29, 2024 19:02
Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com>
Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com>
Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com>
@kamilakesbi
Copy link
Contributor Author

@ArthurZucker thanks for your review! I took your remarks into account :)

Failing tests are unrelated to this PR. If this is ok for you we can perhaps merge or wait for the CI to be green...

@kamilakesbi kamilakesbi requested a review from ArthurZucker July 17, 2024 09:38
@ArthurZucker
Copy link
Collaborator

Let's wait for the full CI seems alright now!

@ArthurZucker
Copy link
Collaborator

Also a question ont answered!

@kamilakesbi
Copy link
Contributor Author

The CI is green yes :) if it's ok for you I can merge!

@kamilakesbi kamilakesbi force-pushed the fallback_short_form branch from a00d2e8 to 6b7b3d6 Compare July 18, 2024 14:45
Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Last to nits and you can merge!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants