-
Notifications
You must be signed in to change notification settings - Fork 838
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speculative Decoding for Whisper #1704
Speculative Decoding for Whisper #1704
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very cool 🔥
_blog.yml
Outdated
@@ -3181,3 +3181,13 @@ | |||
- nlp | |||
- llm | |||
- transformers | |||
|
|||
- local: whisper-spec-dec |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if whisper-speculative-decoding
would be better for SEO (not sure tbh)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kept it short as per the instructions here, but agree that the full words are probably better for indexing!
- local: whisper-spec-dec | |
- local: whisper-speculative-decoding |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe @osanseviero has a better knowledge of this :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I renamed to whisper-speculative-decoding
since I agree it'll probably be more visible this way: 08903e4
whisper-spec-dec.md
Outdated
output, gen_time = generate_with_time(model, inputs, language="nl", task="transcribe") | ||
all_time += gen_time | ||
predictions.append(processor.batch_decode(output, skip_special_tokens=True, normalize=True)[0]) | ||
references.append(processor.tokenizer._normalize(sample["normalized_text"])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to call a "private" method here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opened a PR to make the method public here: huggingface/transformers#28136 (comment)
Otherwise we can instantiate the normalizer separately, but this is a bit more convoluted
whisper-spec-dec.md
Outdated
It is worth noting that the largest speed gains with speculative decoding come with a batch size of 1. For batched | ||
speculative decoding, all candidate tokens **across the batch** must match the validation tokens in order for the tokens | ||
to be accepted. If a token in the batch at a given position does not agree, all candidate tokens that proceed the position | ||
are discarded. Consequently, speculative decoding favours lower batch sizes. In practice, we find that speculative decoding |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting! I fail to visualize why we can't accept irregular sequences, I'll look at the code to get a better understanding.
Co-authored-by: Pedro Cuenca <pedro@huggingface.co> Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
0c2c9c4
to
1e22a48
Compare
Blog post and accompanying Google Colab for speculative decoding with the Whisper model.
The blog post provides a more 'in-depth' explanation for speculative decoding, along with some nice animations. The Google Colab is a more streamlined version that can be run end-to-end. Now that we have PT SDPA in Transformers, we can also leverage flash attention to get the reported 2x speed-up on a Google Colab free tier T4 GPU.