-
-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Discussion] Will vLLM consider using Speculative Sampling to accelerating LLM decoding? #1171
Comments
I'm actually trying to integrate Medusa to vLLM. After some quick and dirty experiments, I found out what really makes the integration hard is PagedAttention itself, which, by the way, is one of the core features of vLLM. The whole vLLM system is built with an assumption that in decoder phase, there will be ONE newly generated token which ties to ONE kv cache block for each sequence. With speculative decoding methods, like Medusa, this assumption won't hold anymore. Though it's hard, I believe it's totally doable. Here are some places we might want to tweak:
Working on a PoC demo (only supports single running query), hopefully could share some new thoughts and findings later. |
How is your testing going @void-main ? |
Still working on it. Needs some modification to PagedAttention, which is the core to vLLM. @Data-drone |
Is there a branch I can have a look at? |
Currently the code change is in a private repo in my company. Later we'd like to release the working version. |
when input_token_len = 450 ouput_token_len = 150, the first prompt step time and the second generate time is 1:1, so when the accurate of draftmodel is 96%, the speedtime is 20%, only the accurate is 30% can cover additional consumption |
Thank @void-main for the sharing the progress on porting Medusa. I am porting Speculative Decoding into vLLM. I also found the main blocker is PagedAttention after some quick and dirty codes. More clearly, in speculative decoding mode, more than ONE tokens need to be taken as input when KV Cache has already existed. However, PagedAttention only supports two situations:
The main blocker for the porting work happend at decoding stage. I should create a new kernel which could take more than one generated new tokens and existing KV Cache as input. The new kernel may have large difference to the paged_attention_v1/v2 kernel. By now HuggingFace Transformers and llama.cpp has supported Speculative Decoding or similar method. I have found more and more vLLMers consider using Speculative Decoding to accelerating LLM inference. Hope free discussion here and any suggestions are welcome. @WoosukKwon @zhuohan123 @casper-hansen @Yard1 |
You're right on the decoding stage, but one more thing, when doing medusa SpS, the |
Recent update: I got a working demo version with vLLM + Medusa, for a single sentence, the average accept length is Here's the demo video: Left part is vLLM + Medusa, right side part is pure vLLM. The result is pretty interesting, |
Notes on the PagedAttention only works for decoding stages where you generate 1 token for each sequence, so you could use CUDA core to calculate attention score for each sequence. But with tree candidates from medusa, for each sequence, you need to process ~7-30 candidates. Sticking to CUDA core would make it too slow to get any benefits. But, wait a sec, processing many tokens at once, that's what I'd like to say it's a pretty fun journey to implement |
And I totally agree with @zhaoyang-star , vLLM is a great framework, but the whole framework is based on the assumption that each forward pass generates 1 token. Maybe later we should propose an RFC (maybe named |
Hi, does medusa or speculative decoding support topp or topk sampling? |
Speculative decoding supports temperature/topk/topp. |
I'm impressed with your excellent work. May I inquire about the current progress? Has speculative decoding been implemented in vLLM? |
As far as I understand, the medusa approach requires training of the added attention heads to be used for look-ahead. This makes it much harder to support a wide variety of models. Starting with simple n-gram may be best, just to get the feature out and give some speedup. |
Thank you. Currently, I am only aware that the performance bottleneck of vLLM lies in the decoding stage. Based on your experience, if I, as an individual, want to enhance the performance of vLLM specifically for Llama, are there any feasible solutions to achieve better results? |
@Lvjinhong it's possible to store n-grams - either from the prompt or the generated text. Every pass forward, you can see if there are matching n-grams that help you guess the following n tokens. You then include those tokens in the forward pass and can keep all of them if they are correct (or part if partly correct). If none are correct, you can at least use those tokens to add to your ngram list. I believe this is what --tgi does with the --speculate flag. |
Hi, I was wondering why speculative decoding support temperature/top-k/top-p sampling? |
hi,how is the performance for multiple sentences(e.g. batchsize = 32/64)? |
Hi @Moran232 , Medusa performs worse for large batch sizes, here's my test result: Medusa beats vLLM on small batches (BS < 8), but fails on larger batches. |
You might want to know that @cadedaniel is working on a PR that introduces a framework to score and verify draft tokens. That would allow vLLM to benefit from speculative decoding from Medusa or directly from your target model's ngrams w/ vLLM in #2188 🔥 |
Hello @RonanKMcGovern ! Do the store n-grams you mentioned mean the same thing as the n-grams mentioned in Lookaheaddecoding? In other words, assuming n=3, abc, def, xyz are stored in the grams list. When my prompt is forwarded and I get 123789a, can I directly guess the output as 123789abc based on the 3-grams list? |
I believe the TGI implementation does not use the Jacobi method. It is a
bland build of ngrams using both the prompt AND tokens generated to date. I
have to admit I don't grasp exactly how they build the ngrams. It may be
simple pattern matching of past sequences to the latest token.
…On Wed, Mar 20, 2024 at 3:26 AM Chen Shen ***@***.***> wrote:
@Lvjinhong <https://github.com/Lvjinhong> it's possible to store n-grams
- either from the prompt or the generated text. Every pass forward, you can
see if there are matching n-grams that help you guess the following n
tokens. You then include those tokens in the forward pass and can keep all
of them if they are correct (or part if partly correct). If none are
correct, you can at least use those tokens to add to your ngram list.
I believe this is what --tgi does with the --speculate flag.
Hello @RonanKMcGovern <https://github.com/RonanKMcGovern> ! Do the store
n-grams you mentioned mean the same thing as the n-grams mentioned in
Lookaheaddecoding <https://lmsys.org/blog/2023-11-21-lookahead-decoding/>?
In other words, assuming n=3, abc, def, xyz are stored in the grams list.
When my prompt is forwarded and I get 123789a, can I directly guess the
output as 123789abc based on the 3-grams list?
—
Reply to this email directly, view it on GitHub
<#1171 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ASVG6CXF55VQJ4X3SX67C4LYZD6UTAVCNFSM6AAAAAA5FTWOHSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMBYGYYDONBVGQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Thank you very much for your answer! |
Feature request for this #1023 |
Sampling is an already known bottleneck of vLLM(see #421 and #670 ). Last weekend I saw a project named Medusa, in it's blog, it introduce a new simple decoding way to accelerate LLM generation and reach a good performance. As far as I known, lepton.ai is alreay use this method.
Adopting Medusa Heads is not difficult, since there is no seperate model. But tree attention and typical acceptance scheme is not a standard process for most LLM inference framework and should take a huge effort.
Any advice or comments?
The text was updated successfully, but these errors were encountered: