[Performance] [Speculative decoding] Speed up autoregressive proposal methods by making sampler CPU serialization optional #5561

cadedaniel · 2024-06-14T22:24:19Z

Background

Speculative decoding leverages the ability to cheaply generate proposals, and cheaply verify them to achieve speedup for memory-bound inference. Different methods of speculative decoding explore the frontier between cost of proposal, alignment with the target model, and cost of verification.

For example, Medusa produces very cheap proposals, but the quality of the proposals are strictly less than Eagle because the heads do not have access to the previous proposals. Eagle on the other hand pays more for the proposals by sampling autoregressively instead of 1-shot, but it brings the benefit of higher-quality proposals.

At the end of the day, what the user cares about will dictate which speculative technique is used. vLLM's job is to provide them with the option for best speedup for their use case.

Draft-model, EAGLE, and MLPSpeculator rely on autoregressive proposals. This means their top-1 proposals are higher-quality than Medusa, which gives vLLM an ITL reduction that is more flops-efficient than Medusa. This is what our speculative decoding efforts are focused on first -- afterward, we can support top-k proposals with Medusa so users who care more about ITL reduction can use vLLM.

Speedup autoregressive proposal methods

This issue is to speed up autoregressive proposal methods by optimizing the sampler. Specifically, the sampler performs wasted work by copying sampled values to GPU and serializing them into Python objects. In speculative decoding, we never use the python objects because we consume the raw sampled token ids / probabilities in their GPU tensors. This means that the copy and CPU serialization are pure overhead in speculative decoding.

How much overhead?

In profiling vLLM, I found that copy + serialization in the draft model takes ~441µs (cell J30). Note that the actual forward pass and sampling math of the draft model take (220µs + 639µs) = 859µs. This means that by removing the unnecessary copy and serialization, we can get 50% more draft tokens in the same time it takes with the copy and serialization enabled.

This difference is actually massive on the overall performance of speculative decoding.

Furthermore, the subsequent draft model forward pass must consume the output of the previous step. This allows us to reduce time spent in prepare_inputs. I don't have numbers here, but I expect a further ~150µs reduction per draft model step by this (~300µs to ~150µs).

The work

This issue is to:

Make the CPU copy and CPU serialization optional in vLLM's sampler (thus leaving sampled token ids on GPU), and then
passing those sampled token ids to prepare_inputs of the next draft model forward pass.

1. Make CPU serialization optional

Warm up task: Note a good warmup task to get familiar with the Sampler is to add an option to disable logprobs for a given Worker. This will also provide some speedup to spec decode (~2ms e2e step time), but isn't part of this issue.

Code pointers:

2. Allow `prepare_inputs` method to work on-device

The on-gpu sampled token ids should be appended to the next prepare_inputs batch.

The text was updated successfully, but these errors were encountered:

cadedaniel · 2024-06-14T22:39:09Z

@Yard1 has a good point: the north star here is we can CUDA graph the proposal method, and use an on-device mechanism to prepare input for the next fwd pass. We should get there iteratively, as it should also work for other proposal types (eagle which uses a different flavor of prepare inputs).

alugowski · 2024-06-15T03:03:35Z

Thanks for writing it up! I'll get started on this.

comaniac · 2024-06-18T16:20:29Z

I'll be working on (2)

ShantanuVichare · 2024-07-12T17:09:43Z

Hi @cadedaniel @alugowski, I have been looking into the CPU serialization overhead part of this. Is there any work already started on this? I'm available to collaborate on this, so do let me know how I can help.

comaniac · 2024-07-12T17:12:08Z

#6338 is covering this now :)
cc @alexm-neuralmagic

cadedaniel · 2024-07-18T05:21:30Z

Closed by @alexm-neuralmagic and @comaniac

cadedaniel added misc speculative-decoding performance Performance-related issues and removed misc labels Jun 14, 2024

stephanie-wang mentioned this issue Jun 19, 2024

[RFC]: Refactor Worker and ModelRunner to consolidate control plane communication #5552

Closed

3 tasks

ShangmingCai mentioned this issue Jul 10, 2024

[Feature]: Multi-Proposers support for speculative decoding. #6300

Closed

cadedaniel closed this as completed Jul 18, 2024

cadedaniel mentioned this issue Jul 19, 2024

[Speculative decoding] [Help wanted] [Performance] Optimize draft-model speculative decoding #4630

Closed

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Performance] [Speculative decoding] Speed up autoregressive proposal methods by making sampler CPU serialization optional #5561

[Performance] [Speculative decoding] Speed up autoregressive proposal methods by making sampler CPU serialization optional #5561

cadedaniel commented Jun 14, 2024 •

edited

Loading

cadedaniel commented Jun 14, 2024

alugowski commented Jun 15, 2024

comaniac commented Jun 18, 2024

ShantanuVichare commented Jul 12, 2024

comaniac commented Jul 12, 2024

cadedaniel commented Jul 18, 2024

[Performance] [Speculative decoding] Speed up autoregressive proposal methods by making sampler CPU serialization optional #5561

[Performance] [Speculative decoding] Speed up autoregressive proposal methods by making sampler CPU serialization optional #5561

Comments

cadedaniel commented Jun 14, 2024 • edited Loading

Background

Speedup autoregressive proposal methods

How much overhead?

The work

1. Make CPU serialization optional

2. Allow prepare_inputs method to work on-device

cadedaniel commented Jun 14, 2024

alugowski commented Jun 15, 2024

comaniac commented Jun 18, 2024

ShantanuVichare commented Jul 12, 2024

comaniac commented Jul 12, 2024

cadedaniel commented Jul 18, 2024

cadedaniel commented Jun 14, 2024 •

edited

Loading

2. Allow `prepare_inputs` method to work on-device