[FLAX] Whisper #19512

kamalkraj · 2022-10-12T08:13:33Z

What does this PR do?

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

HuggingFaceDocBuilderDev · 2022-10-12T12:29:12Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

kamalkraj · 2022-10-13T09:23:21Z

Hi,

I need little clarification about implementing the FlaxWhisperDecoder Module.

What would be the best way to pass past_key_values_length to the module?

Reference in Pytorch implementation.
https://github.com/huggingface/transformers/blob/main/src/transformers/models/whisper/modeling_whisper.py#L863-L873

@patrickvonplaten @ydshieh @patil-suraj

patrickvonplaten · 2022-10-14T17:52:35Z

Whisper on TPU will make 🔥 colab demos

kamalkraj · 2022-10-16T14:59:52Z

ArthurZucker · 2022-10-17T10:56:57Z

Awesome work here! Feel free to ping me for a review once it is ready 😄

kamalkraj · 2022-10-19T16:43:47Z

Hi,

I have finished the model and working on the test cases now.
The pt<->flax equivalence test is failing, even though the model.generate produce the exact speech-to-text like the PyTorch model.

I have attached steps to reproduce the issue in this notebook - https://colab.research.google.com/drive/1KmO8OBUpHfs1uYA_eSwamQAXnjsdbkRS?usp=sharing

Any pointers will be helpful.

Thanks

@patrickvonplaten @patil-suraj @ydshieh @ArthurZucker

ydshieh · 2022-10-20T09:11:49Z

Hi @kamalkraj First, thank you for this awesome PR!

Regarding the PT/Flax tests, I probably need to improve that PT/Flax equivalence tests to make it (a bit) easier to find out which layers gives the larger difference.

In the meantime, I have to say there is no easy way to debug such issue. We need patience to find out at which layer(s) we have the first large difference (greater than the tolerance) and see what's wrong inside that layer.

This is usually tedious and involving manually debugging process.

Anyway, I can open a PR to make the process (a bit) easier - if you want to wait a bit. But notice that we still need similar process even that PR is merged.

ArthurZucker · 2022-10-20T14:16:07Z

Will try to get #18420 merged so that we can maybe use the find_pt_fx_differences(pt_outputs, fx_outputs) function! But in the mean time, you should set output_hidden_states=True and check where the lists differ 🤗

ydshieh · 2022-10-20T14:18:14Z

Hi @kamalkraj Actually that test is quite good enough, but we need to change a bit to debug more.

The last 2 commit in this branch could log more information.

If you run the tests like

RUN_PT_FLAX_CROSS_TESTS=true python3 -m pytest -v tests/models/whisper/test_modeling_flax_whisper.py -k "test_equivalence_pt_to_flax"

it logs something

max diff. in outputs.logits: 0.0020506680011749268

but it doesn't fail the test -> it continues. So far, I got

E   AssertionError: <class 'list'> != <class 'tuple'> : outputs.decoder_hidden_states: Output types differ between Flax and PyTorch

so you will have to look the output type of decoder_hidden_states and make sure the type is the same as the PyTorch one.
Continue this process will eventually show you all the difference, and you can get a better idea where to debug in the modeling code.

Also, it seems when running the tests from tests/models/whisper/test_modeling_whisper.py, we have some shape issue. This is another thing to debug.

Hopefully this gives you some idea of how we can debug here 🤗

kamalkraj · 2022-10-25T11:48:21Z

Thanks, @ydshieh and @ArthurZucker

andyehrenberg · 2022-11-25T17:18:22Z

To make for a more consistent API across models, couldn't we swap out past_key_values_length and instead compute position_ids to get the current positional embeddings for the decoder? It feels like this would make it easier to fit Whisper in with other finetuning codebases (no need to create custom logic for computing past_key_values_length when dealing with Whisper). As the code currently stands, I think it would actually give incorrect outputs when decoding a batch when each element of the batch has different decoder prefix/prompt tokens. Computing position ids from the attention mask would also allow for either left or right padding.

I have another Flax Whisper implementation with .from_pretrained(..., from_pt=True) working correctly and it giving correct outputs for variable length prompts that I'd be happy to share (or create a separate PR for). It also adds some stuff to the generation utilities to support prompt tokens to the decoder that already exist in the PyTorch utilities (using prompt tokens instead of model.config.decoder_start_token_id if specified).

ydshieh · 2022-11-25T17:22:41Z

I haven't look into this. But @andyehrenberg do you suggest a different way of computation in Flax Whisper than the one implemented in our PyTorch/TensorFlow Whisper?

It's also better for @kamalkraj to express if he would like to continue this PR before we go ahead.

kamalkraj · 2022-11-25T17:47:22Z

@ydshieh @andyehrenberg

If there is already a working implementation, please continue.
I am closing this one.

Thanks

andyehrenberg · 2022-11-25T18:03:07Z

@ydshieh I guess what I'm suggesting for this could also be helpful for the PyTorch/TF implementations to improve flexibility/compatibility with existing codebases that use position_ids for other models (such as when finetuning).

For example, the use-case I'm working on is fine-tuning Whisper with RL (trying to expose it to its own outputs to reduce hallucinations). At each step when collecting rollouts, it is given a batch of audio features and decoder prompts (from previous audio snippets) - these prompts are of varying lengths, so padding/attention masks are needed, and the position embeddings need to adjust accordingly. And then when doing PPO updates on these steps, the position embeddings need to be computed correctly based off of which timesteps (tokens) are padding.

The implementation in this PR wouldn't accommodate this scenario as it assumes the same past_key_values_length for each sequence in the batch, whereas the implementation I've worked on uses position_ids to keep track of where we are in each sequence of the batch. Earlier I had use a different method that only used the attention mask along with another caching method in the decoder, but using position_ids is much simpler and accommodates multiple padding schemes more simply.

kamalkraj added 2 commits October 12, 2022 13:42

init

ac05c66

init clean up and updated doc strings

fe44147

kamalkraj added 2 commits October 13, 2022 14:18

+ FlaxWhisperEncoder/Decoder to check repo

7257f13

partial model with encoder layer completed

fafc094

kamalkraj added 4 commits October 15, 2022 13:20

add FlaxWhisperDecoder

b82117e

added whisper gen utils

8aab790

FlaxWhisperForConditionalGeneration done

f125ffc

Merge branch 'main' into whisper-flax

71b3309

kamalkraj added 3 commits October 19, 2022 00:28

updated tests

3c88b22

Merge branch 'main' into whisper-flax

4bacc50

updated test imports and init ModelIntegrationTest

e2585df

huggingface deleted a comment from github-actions bot Nov 18, 2022

kamalkraj closed this Nov 25, 2022

andyehrenberg mentioned this pull request Nov 28, 2022

add flax whisper implementation #20479

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FLAX] Whisper #19512

[FLAX] Whisper #19512

kamalkraj commented Oct 12, 2022

HuggingFaceDocBuilderDev commented Oct 12, 2022

kamalkraj commented Oct 13, 2022

patrickvonplaten commented Oct 14, 2022

kamalkraj commented Oct 16, 2022

ArthurZucker commented Oct 17, 2022

kamalkraj commented Oct 19, 2022 •

edited

Loading

ydshieh commented Oct 20, 2022

ArthurZucker commented Oct 20, 2022

ydshieh commented Oct 20, 2022

kamalkraj commented Oct 25, 2022

andyehrenberg commented Nov 25, 2022 •

edited

Loading

ydshieh commented Nov 25, 2022

kamalkraj commented Nov 25, 2022

andyehrenberg commented Nov 25, 2022

[FLAX] Whisper #19512

[FLAX] Whisper #19512

Conversation

kamalkraj commented Oct 12, 2022

What does this PR do?

Before submitting

Who can review?

HuggingFaceDocBuilderDev commented Oct 12, 2022

kamalkraj commented Oct 13, 2022

patrickvonplaten commented Oct 14, 2022

kamalkraj commented Oct 16, 2022

ArthurZucker commented Oct 17, 2022

kamalkraj commented Oct 19, 2022 • edited Loading

ydshieh commented Oct 20, 2022

ArthurZucker commented Oct 20, 2022

ydshieh commented Oct 20, 2022

kamalkraj commented Oct 25, 2022

andyehrenberg commented Nov 25, 2022 • edited Loading

ydshieh commented Nov 25, 2022

kamalkraj commented Nov 25, 2022

andyehrenberg commented Nov 25, 2022

kamalkraj commented Oct 19, 2022 •

edited

Loading

andyehrenberg commented Nov 25, 2022 •

edited

Loading