-
Notifications
You must be signed in to change notification settings - Fork 10.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
speculative : PoC for speeding-up inference via speculative sampling #2926
Conversation
I tested this PR with 70b f16 and 7b q8_0. When using CPU only and the default sampling parameters the average t/s increases from 0.44 to 0.52:
|
With 70b q6_K and 7b q8_0 on 3x P40 the performance it 3.63 t/s which is only ~half of what I get with regular inference. The problem is most likely that the CUDA code that I wrote has not been optimized for this use case. I would expect the performance to end up better given the right optimizations though. |
Yes, so far I observe that this strategy is most effective for code generation with ~2x speedup for 34B/7B and ~1.5x speedup for 13B/7B pairs using If you try to generate free-form text, then the acceptance rate drops significantly and the method does not offer any benefit. I'm still tweaking, but my gut feeling is that this might be very efficient for cases where we have a very constrained grammar. |
Even for free-form text I would expect there to be quite a large speedup if you have a weak GPU and the CLI allows you to set the GPU layers for the draft and the target model separately. If you can fully offload the draft model it's essentially being evaluated instantaneously compared to the larger model on the CPU so even an acceptance rate of only 33% should translate to +50% t/s.
I would expect this technique to also work very well for cases where you have a lot of unconventional terms that consist of multiple tokens: in those situations the first token of such a term is almost always followed by the other tokens of the term. So I would expect large performance gains for program code and non-English languages. |
Out of curiosity I tested LLaMA 2 7b q8_0 with itself as a draft model. With free-form text and the default sampling parameters it had an acceptance rate of 37%. Meanwhile, when I used 7b q8_0 as the draft model for 70b f16 with If I understand the current implementation correctly, the draft model always chooses the token with the highest probability for creating the draft. But maybe you could get a higher acceptance rate by sampling from the draft and the target model in the exact same way (including the same RNG seed)? @charliexchen did you investigate this? |
Yes, that is the case now. There is room for experimentation, although it makes more sense to me to always draft the best token. Btw, when we add batched inference support, we should be able to implement Staged Speculative Decoding which might give some extra boost. Basically, instead of sampling 1 draft sequence, we sample N and then the target sampling can accept from either one of them. Would be an interesting experiment |
@ggerganov I think an earlier and better paper from CMU called SpecInfer first studied the idea of using multiple models to speculate and tree-like verification. They have an implementation in FlexFlow https://github.com/flexflow/FlexFlow/tree/inference Worth looking at.
|
JohannesGaessler The random seed doesn't actually matter. However you should definitely apply the same kind of sampling to both models (temp + top-k) along with the modified rejection scheme from the paper. (EDIT: To clarify, for both vanilla Speculative Sampling or SpecInfer, there is a stochastic resampling algorithm. This should have a higher acceptance rate than greedily sampling the draft) |
@zhisbug Thanks for mentioning our work! @ggerganov I'm one of the authors of the SpecInfer paper (https://arxiv.org/abs/2305.09781) and a lead contributor of FlexFlow Serve, a distributed framework for LLM inference. I'm really glad to see so much interested in speculative decoding techniques from the community, both in terms new ArXiv paper uploads, and integrations with existing open-source projects. If you or someone else wants to take a look at how we implemented the key ideas in our paper, this is the file to look at in our repo: request_manager.cc. Overall, FlexFlow Serve, which is also implemented in C++, is currently 1.3-2.4× faster than existing distributed LLM inference systems and by 2.6-3.5× faster than offloading-based inference frameworks |
Pardon my ignorance, but, you need two whole models to be loaded in for this to work yea? And I assume the second 'draft' model can't be fully outside of VRAM if it's gonna provide decent speed ups... |
The point is that one of the models is much smaller than the main model and can be used to avoid running a full generation on the large model. The more tokens you can skip running the big model for, the bigger the speedup. |
Unless I'm misunderstanding something you don't actually skip any tokens for the large model. Instead you first write a draft with the small model one token at a time. Then you pass all of those tokens at once to the larger model to validate the draft and use as many tokens from the draft as were correctly predicted. |
My explanation wasn't the best but the overall effect is that you don't have to run a full generation from the large model per token, like you would without the speculative sampling. The effects are pretty much the same as skipping running the evaluation of the big model for some of the tokens. |
May i suggest a wild idea? How feasible is it, to train the speculative little model with the answers of the large model, ON THE FLY and the weight differences are cached on disk after every use. Say, you run a 7B float16 (does it fit in 24GB?) on a 4090 and a 70B 8bit on CPU. Unrelated to the above, question : I'm really excited for this feature, it will bridge the gap for us 'GPU-poor', and it's something that will set apart this project in performance and capacity for bigger models on the same system versus exllama that's hard capped by GPU VRAM capacities. Speculative execution i think is more important on PC side, where a 3060 12GB might really boost a 34B or even 70B model into usable speeds on the CPU. Top notch work on llamacpp guys. |
22f7a9d
to
fdc53e2
Compare
fdc53e2
to
c33cd8a
Compare
5c2aad7
to
a15ca74
Compare
@ejones Adding grammar support to this example almost works, but we are missing a way to restore the grammar state to a previous state. To clarify this, we need 2 grammar contexts - one for the small "draft" model and one for the big "target" model. I think it's an easy fix - I can add a Is this code correct, or am I misunderstanding: struct llama_grammar * llama_grammar_copy(const struct llama_grammar * grammar) {
llama_grammar * result = new llama_grammar{ grammar->rules, grammar->stacks, grammar->partial_utf8 };
// redirect elements in stacks to point to new rules
for (size_t is = 0; is < result->stacks.size(); is++) {
for (size_t ie = 0; ie < result->stacks[is].size(); ie++) {
for (size_t ir0 = 0; ir0 < grammar->rules.size(); ir0++) {
for (size_t ir1 = 0; ir1 < grammar->rules[ir0].size(); ir1++) {
if (grammar->stacks[is][ie] == &grammar->rules[ir0][ir1]) {
result->stacks[is][ie] = &result->rules[ir0][ir1];
}
}
}
}
}
return result;
} |
@ggerganov i came across this repo https://github.com/FasterDecoding/Medusa, talking about their approaches vs speculative decoding |
Quick question: |
This is to prevent using drastically incompatible vocabs - you can increase the limit if you know what you are doing |
I'm wondering if this PR supports batched speculative decoding? What if each sequence in a batch has a different length of accepted draft tokens? |
ref: #2030
Initial results with the following config indicate a factor of x2 speed-up:
Code Llama 34B F16
Code Llama 7B Q4_10
Todo:
main
andspeculative
n_draft
parameter to CLIUsage:
In some cases (e.g. low temperature code generation), this clocks at about ~25 t/s for a full-precision F16 34B model on M2 Ultra.
speculative-1.mp4
speculative-2.mp4
speculative-0.mp4