Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Performance] [Speculative decoding] Speed up autoregressive proposal methods by making sampler CPU serialization optional #5561

Closed
cadedaniel opened this issue Jun 14, 2024 · 6 comments
Labels
performance Performance-related issues speculative-decoding

Comments

@cadedaniel
Copy link
Collaborator

cadedaniel commented Jun 14, 2024

Background

Speculative decoding leverages the ability to cheaply generate proposals, and cheaply verify them to achieve speedup for memory-bound inference. Different methods of speculative decoding explore the frontier between cost of proposal, alignment with the target model, and cost of verification.

For example, Medusa produces very cheap proposals, but the quality of the proposals are strictly less than Eagle because the heads do not have access to the previous proposals. Eagle on the other hand pays more for the proposals by sampling autoregressively instead of 1-shot, but it brings the benefit of higher-quality proposals.

At the end of the day, what the user cares about will dictate which speculative technique is used. vLLM's job is to provide them with the option for best speedup for their use case.

Draft-model, EAGLE, and MLPSpeculator rely on autoregressive proposals. This means their top-1 proposals are higher-quality than Medusa, which gives vLLM an ITL reduction that is more flops-efficient than Medusa. This is what our speculative decoding efforts are focused on first -- afterward, we can support top-k proposals with Medusa so users who care more about ITL reduction can use vLLM.

Speedup autoregressive proposal methods

This issue is to speed up autoregressive proposal methods by optimizing the sampler. Specifically, the sampler performs wasted work by copying sampled values to GPU and serializing them into Python objects. In speculative decoding, we never use the python objects because we consume the raw sampled token ids / probabilities in their GPU tensors. This means that the copy and CPU serialization are pure overhead in speculative decoding.

How much overhead?

In profiling vLLM, I found that copy + serialization in the draft model takes ~441µs (cell J30). Note that the actual forward pass and sampling math of the draft model take (220µs + 639µs) = 859µs. This means that by removing the unnecessary copy and serialization, we can get 50% more draft tokens in the same time it takes with the copy and serialization enabled.

This difference is actually massive on the overall performance of speculative decoding.

Furthermore, the subsequent draft model forward pass must consume the output of the previous step. This allows us to reduce time spent in prepare_inputs. I don't have numbers here, but I expect a further ~150µs reduction per draft model step by this (~300µs to ~150µs).

The work

This issue is to:

  1. Make the CPU copy and CPU serialization optional in vLLM's sampler (thus leaving sampled token ids on GPU), and then
  2. passing those sampled token ids to prepare_inputs of the next draft model forward pass.

1. Make CPU serialization optional

Warm up task: Note a good warmup task to get familiar with the Sampler is to add an option to disable logprobs for a given Worker. This will also provide some speedup to spec decode (~2ms e2e step time), but isn't part of this issue.

Code pointers:

2. Allow prepare_inputs method to work on-device

The on-gpu sampled token ids should be appended to the next prepare_inputs batch.

@cadedaniel cadedaniel added misc speculative-decoding performance Performance-related issues and removed misc labels Jun 14, 2024
@cadedaniel
Copy link
Collaborator Author

@Yard1 has a good point: the north star here is we can CUDA graph the proposal method, and use an on-device mechanism to prepare input for the next fwd pass. We should get there iteratively, as it should also work for other proposal types (eagle which uses a different flavor of prepare inputs).

@alugowski
Copy link
Contributor

Thanks for writing it up! I'll get started on this.

@comaniac
Copy link
Collaborator

I'll be working on (2)

@ShantanuVichare
Copy link

Hi @cadedaniel @alugowski, I have been looking into the CPU serialization overhead part of this. Is there any work already started on this? I'm available to collaborate on this, so do let me know how I can help.

@comaniac
Copy link
Collaborator

#6338 is covering this now :)
cc @alexm-neuralmagic

@cadedaniel
Copy link
Collaborator Author

Closed by @alexm-neuralmagic and @comaniac

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance-related issues speculative-decoding
Projects
None yet
Development

No branches or pull requests

4 participants