Dynamic number of speculative tokens in order to accelerate speculative decoding #33258

jmamou · 2024-09-02T13:21:03Z

What does this PR do?

This PR adds support to dynamic number of speculative tokens in order to accelerate speculative decoding.
It is an unsupervised version of the dynamic speculation lookahead from Dynamic Speculation Lookahead Accelerates Speculative Decoding of Large Language Models.

We add the argument assistant_confidence_threshold to the generation configuration of the model. It is a confidence threshold for the assistant model. If the assistant model's confidence in its prediction for the current token is lower than this threshold, the assistant model stops the current token generation iteration, even if the number of speculative tokens (defined by num_assistant_tokens) is not yet reached.

Before submitting

[X ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
[ V] Did you read the contributor guideline,
Pull Request section?
[ X] Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
[ V] Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
[ V] Did you write any new necessary tests?

Who can review?

@gante @amyeroberts

gante

Thank you for opening the PR @jmamou 🤗 The idea in the paper is really cool, much better than my scrappy heuristic! 🔥

I've added a few nits in the PR review that should be trivial to solve.

A proposal for a follow-up PR: Different flavors of speculative decoding/assisted generation are (mostly) invisible to the user other than execution speed. According to your paper, DISCO should be faster than the current default. Do you think you can find a single default value for assistant_confidence_threshold such that it beats the current default in most situations? If so, I'd be more that happy to make DISCO the default 🤗

.gitignore

src/transformers/generation/configuration_utils.py

src/transformers/generation/stopping_criteria.py

src/transformers/generation/configuration_utils.py

jmamou · 2024-09-08T12:38:00Z

Thank you for opening the PR @jmamou 🤗 The idea in the paper is really cool, much better than my scrappy heuristic! 🔥

I've added a few nits in the PR review that should be trivial to solve.

A proposal for a follow-up PR: Different flavors of speculative decoding/assisted generation are (mostly) invisible to the user other than execution speed. According to your paper, DISCO should be faster than the current default. Do you think you can find a single default value for assistant_confidence_threshold such that it beats the current default in most situations? If so, I'd be more that happy to make DISCO the default 🤗

Thanks @gante for your feedback.

Concerning the follow-up PR, I have run experiments with vicuna-13b/vicuna-68m as target/assistant on the datasets Alpaca and CNN-DM with do_sample True and False. The values of assistant_confidence_threshold=0.4 and num_assistant_tokens=20 are consistently almost optimal.

gante

Thank you for iterating! To make our CI happy, run make fixup and push the changes :)

src/transformers/generation/configuration_utils.py

implicit default value (None) Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com>

jmamou · 2024-09-11T10:02:55Z

Thank you for iterating! To make our CI happy, run make fixup and push the changes :)

sure!
I am done, except for the test "examples_tensorflow" that does not seem to be related to my code.

jmamou · 2024-09-11T10:04:56Z

Thank you for iterating! To make our CI happy, run make fixup and push the changes :)

sure!
I am done, except for the test "examples_tensorflow" that does not seem to be related to my code.

LysandreJik

Thank you for the contribution! The failing test is unrelated to this PR and should now be fixed on main

gante · 2024-09-19T18:32:28Z

The values of assistant_confidence_threshold=0.4 and num_assistant_tokens=20 are consistently almost optimal.

@jmamou feel free to open a PR to update the defaults, I'd be glad to accept it! 🤗

…ve decoding (huggingface#33258) * optimal Speculation Lookahead based on probability * update peer finished condition * add support to do_sample True * add stopping criteria * gitignore * add print * remove prints * minor * minor * git ignore * adding test to stopping ConfidenceCriteria * doc + format * add doc * Update .gitignore * update docstring and default value of assistant_confidence_threshold * add docstring * Update src/transformers/generation/configuration_utils.py implicit default value (None) Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com> * style fix --------- Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com>

jmamou added 16 commits June 26, 2024 06:03

optimal Speculation Lookahead based on probability

1e33d37

update peer finished condition

f1d92b1

Merge branch 'huggingface:main' into SL

3a25212

add support to do_sample True

21ab024

add stopping criteria

e7610f8

gitignore

a0b107d

Merge branch 'main' into SL

6f15efa

add print

adf3598

remove prints

39b9f63

minor

bdda459

minor

1916bcd

git ignore

6fea2b8

Merge branch 'main' into SL

00e3e79

adding test to stopping ConfidenceCriteria

7b0103d

doc + format

7d4a095

add doc

1e6a0e0

amyeroberts added the Generation label Sep 2, 2024

gante reviewed Sep 5, 2024

View reviewed changes

jmamou added 3 commits September 8, 2024 14:24

Update .gitignore

7a005d2

update docstring and default value of assistant_confidence_threshold

201741b

add docstring

7c90a8a

gante approved these changes Sep 10, 2024

View reviewed changes

src/transformers/generation/configuration_utils.py Outdated Show resolved Hide resolved

gante requested a review from LysandreJik September 10, 2024 15:20

gante mentioned this pull request Sep 10, 2024

Universal Assisted Generation: Assisted generation with any assistant model (by Intel Labs) #33383

Merged

5 tasks

jmamou and others added 2 commits September 11, 2024 12:02

Update src/transformers/generation/configuration_utils.py

f457553

implicit default value (None) Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com>

style fix

cd71a92

jmamou closed this Sep 11, 2024

jmamou reopened this Sep 11, 2024

LysandreJik approved these changes Sep 11, 2024

View reviewed changes

LysandreJik merged commit 7a51cbc into huggingface:main Sep 11, 2024
18 of 20 checks passed

jmamou mentioned this pull request Sep 23, 2024

🚨🚨 Setting default behavior of assisted decoding #33657

Merged

5 tasks

jmamou deleted the SL branch October 14, 2024 07:28

jmamou mentioned this pull request Oct 14, 2024

Adaptive dynamic number of speculative tokens #34156

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dynamic number of speculative tokens in order to accelerate speculative decoding #33258

Dynamic number of speculative tokens in order to accelerate speculative decoding #33258

jmamou commented Sep 2, 2024

gante left a comment

jmamou commented Sep 8, 2024 •

edited

Loading

gante left a comment

jmamou commented Sep 11, 2024

jmamou commented Sep 11, 2024

LysandreJik left a comment

gante commented Sep 19, 2024

Dynamic number of speculative tokens in order to accelerate speculative decoding #33258

Dynamic number of speculative tokens in order to accelerate speculative decoding #33258

Conversation

jmamou commented Sep 2, 2024

What does this PR do?

Before submitting

Who can review?

gante left a comment

Choose a reason for hiding this comment

jmamou commented Sep 8, 2024 • edited Loading

gante left a comment

Choose a reason for hiding this comment

jmamou commented Sep 11, 2024

jmamou commented Sep 11, 2024

LysandreJik left a comment

Choose a reason for hiding this comment

gante commented Sep 19, 2024

jmamou commented Sep 8, 2024 •

edited

Loading