-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dynamic number of speculative tokens in order to accelerate speculative decoding #33258
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for opening the PR @jmamou 🤗 The idea in the paper is really cool, much better than my scrappy heuristic! 🔥
I've added a few nits in the PR review that should be trivial to solve.
A proposal for a follow-up PR: Different flavors of speculative decoding/assisted generation are (mostly) invisible to the user other than execution speed. According to your paper, DISCO should be faster than the current default. Do you think you can find a single default value for assistant_confidence_threshold
such that it beats the current default in most situations? If so, I'd be more that happy to make DISCO the default 🤗
Thanks @gante for your feedback. Concerning the follow-up PR, I have run experiments with vicuna-13b/vicuna-68m as target/assistant on the datasets Alpaca and CNN-DM with do_sample True and False. The values of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for iterating! To make our CI happy, run make fixup
and push the changes :)
implicit default value (None) Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com>
sure! |
sure! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the contribution! The failing test is unrelated to this PR and should now be fixed on main
@jmamou feel free to open a PR to update the defaults, I'd be glad to accept it! 🤗 |
…ve decoding (huggingface#33258) * optimal Speculation Lookahead based on probability * update peer finished condition * add support to do_sample True * add stopping criteria * gitignore * add print * remove prints * minor * minor * git ignore * adding test to stopping ConfidenceCriteria * doc + format * add doc * Update .gitignore * update docstring and default value of assistant_confidence_threshold * add docstring * Update src/transformers/generation/configuration_utils.py implicit default value (None) Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com> * style fix --------- Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com>
What does this PR do?
This PR adds support to dynamic number of speculative tokens in order to accelerate speculative decoding.
It is an unsupervised version of the dynamic speculation lookahead from Dynamic Speculation Lookahead Accelerates Speculative Decoding of Large Language Models.
We add the argument
assistant_confidence_threshold
to the generation configuration of the model. It is a confidence threshold for the assistant model. If the assistant model's confidence in its prediction for the current token is lower than this threshold, the assistant model stops the current token generation iteration, even if the number of speculative tokens (defined bynum_assistant_tokens
) is not yet reached.Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
@gante @amyeroberts