-
-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Performance] [Speculative decoding] Speed up autoregressive proposal methods by making sampler CPU serialization optional #5561
Comments
@Yard1 has a good point: the north star here is we can CUDA graph the proposal method, and use an on-device mechanism to prepare input for the next fwd pass. We should get there iteratively, as it should also work for other proposal types (eagle which uses a different flavor of prepare inputs). |
Thanks for writing it up! I'll get started on this. |
I'll be working on (2) |
Hi @cadedaniel @alugowski, I have been looking into the CPU serialization overhead part of this. Is there any work already started on this? I'm available to collaborate on this, so do let me know how I can help. |
#6338 is covering this now :) |
Closed by @alexm-neuralmagic and @comaniac |
Background
Speculative decoding leverages the ability to cheaply generate proposals, and cheaply verify them to achieve speedup for memory-bound inference. Different methods of speculative decoding explore the frontier between cost of proposal, alignment with the target model, and cost of verification.
For example, Medusa produces very cheap proposals, but the quality of the proposals are strictly less than Eagle because the heads do not have access to the previous proposals. Eagle on the other hand pays more for the proposals by sampling autoregressively instead of 1-shot, but it brings the benefit of higher-quality proposals.
At the end of the day, what the user cares about will dictate which speculative technique is used. vLLM's job is to provide them with the option for best speedup for their use case.
Draft-model, EAGLE, and MLPSpeculator rely on autoregressive proposals. This means their top-1 proposals are higher-quality than Medusa, which gives vLLM an ITL reduction that is more flops-efficient than Medusa. This is what our speculative decoding efforts are focused on first -- afterward, we can support top-k proposals with Medusa so users who care more about ITL reduction can use vLLM.
Speedup autoregressive proposal methods
This issue is to speed up autoregressive proposal methods by optimizing the sampler. Specifically, the sampler performs wasted work by copying sampled values to GPU and serializing them into Python objects. In speculative decoding, we never use the python objects because we consume the raw sampled token ids / probabilities in their GPU tensors. This means that the copy and CPU serialization are pure overhead in speculative decoding.
How much overhead?
In profiling vLLM, I found that copy + serialization in the draft model takes ~441µs (cell J30). Note that the actual forward pass and sampling math of the draft model take (220µs + 639µs) = 859µs. This means that by removing the unnecessary copy and serialization, we can get 50% more draft tokens in the same time it takes with the copy and serialization enabled.
This difference is actually massive on the overall performance of speculative decoding.
Furthermore, the subsequent draft model forward pass must consume the output of the previous step. This allows us to reduce time spent in
prepare_inputs
. I don't have numbers here, but I expect a further ~150µs reduction per draft model step by this (~300µs to ~150µs).The work
This issue is to:
prepare_inputs
of the next draft model forward pass.1. Make CPU serialization optional
Warm up task: Note a good warmup task to get familiar with the Sampler is to add an option to disable
logprobs
for a given Worker. This will also provide some speedup to spec decode (~2ms e2e step time), but isn't part of this issue.Code pointers:
2. Allow
prepare_inputs
method to work on-deviceThe on-gpu sampled token ids should be appended to the next prepare_inputs batch.
prepare_inputs
method which will need to consume inputs from GPU in this case.The text was updated successfully, but these errors were encountered: