Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dynamic number of speculative tokens in order to accelerate speculative decoding #33258

Merged
merged 21 commits into from
Sep 11, 2024

Conversation

jmamou
Copy link
Contributor

@jmamou jmamou commented Sep 2, 2024

What does this PR do?

This PR adds support to dynamic number of speculative tokens in order to accelerate speculative decoding.
It is an unsupervised version of the dynamic speculation lookahead from Dynamic Speculation Lookahead Accelerates Speculative Decoding of Large Language Models.
 
We add the argument assistant_confidence_threshold to the generation configuration of the model. It is a confidence threshold for the assistant model. If the assistant model's confidence in its prediction for the current token is lower than this threshold, the assistant model stops the current token generation iteration, even if the number of speculative tokens (defined by num_assistant_tokens) is not yet reached.

Before submitting

  • [X ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • [ V] Did you read the contributor guideline,
    Pull Request section?
  • [ X] Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • [ V] Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • [ V] Did you write any new necessary tests?

Who can review?

@gante @amyeroberts

Copy link
Member

@gante gante left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for opening the PR @jmamou 🤗 The idea in the paper is really cool, much better than my scrappy heuristic! 🔥

I've added a few nits in the PR review that should be trivial to solve.

A proposal for a follow-up PR: Different flavors of speculative decoding/assisted generation are (mostly) invisible to the user other than execution speed. According to your paper, DISCO should be faster than the current default. Do you think you can find a single default value for assistant_confidence_threshold such that it beats the current default in most situations? If so, I'd be more that happy to make DISCO the default 🤗

@jmamou
Copy link
Contributor Author

jmamou commented Sep 8, 2024

Thank you for opening the PR @jmamou 🤗 The idea in the paper is really cool, much better than my scrappy heuristic! 🔥

I've added a few nits in the PR review that should be trivial to solve.

A proposal for a follow-up PR: Different flavors of speculative decoding/assisted generation are (mostly) invisible to the user other than execution speed. According to your paper, DISCO should be faster than the current default. Do you think you can find a single default value for assistant_confidence_threshold such that it beats the current default in most situations? If so, I'd be more that happy to make DISCO the default 🤗

Thanks @gante for your feedback.

Concerning the follow-up PR, I have run experiments with vicuna-13b/vicuna-68m as target/assistant on the datasets Alpaca and CNN-DM with do_sample True and False. The values of assistant_confidence_threshold=0.4 and num_assistant_tokens=20 are consistently almost optimal.

Copy link
Member

@gante gante left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for iterating! To make our CI happy, run make fixup and push the changes :)

jmamou and others added 2 commits September 11, 2024 12:02
implicit default value (None)

Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com>
@jmamou
Copy link
Contributor Author

jmamou commented Sep 11, 2024

Thank you for iterating! To make our CI happy, run make fixup and push the changes :)

sure!
I am done, except for the test "examples_tensorflow" that does not seem to be related to my code.

@jmamou jmamou closed this Sep 11, 2024
@jmamou
Copy link
Contributor Author

jmamou commented Sep 11, 2024

Thank you for iterating! To make our CI happy, run make fixup and push the changes :)

sure!
I am done, except for the test "examples_tensorflow" that does not seem to be related to my code.

@jmamou jmamou reopened this Sep 11, 2024
Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the contribution! The failing test is unrelated to this PR and should now be fixed on main

@LysandreJik LysandreJik merged commit 7a51cbc into huggingface:main Sep 11, 2024
18 of 20 checks passed
@gante
Copy link
Member

gante commented Sep 19, 2024

The values of assistant_confidence_threshold=0.4 and num_assistant_tokens=20 are consistently almost optimal.

@jmamou feel free to open a PR to update the defaults, I'd be glad to accept it! 🤗

@jmamou jmamou deleted the SL branch October 14, 2024 07:28
BernardZach pushed a commit to BernardZach/transformers that referenced this pull request Dec 5, 2024
…ve decoding (huggingface#33258)

* optimal Speculation Lookahead based on probability

* update peer finished condition

* add support to do_sample True

* add stopping criteria

* gitignore

* add print

* remove prints

* minor

* minor

* git ignore

* adding test to stopping ConfidenceCriteria

* doc + format

* add doc

* Update .gitignore

* update docstring and default value of assistant_confidence_threshold

* add docstring

* Update src/transformers/generation/configuration_utils.py

implicit default value (None)

Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com>

* style fix

---------

Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants