CPU Bottleneck when using beam search #1646

physicsrob · 2023-11-13T20:08:57Z

I'm finding a surprising bottleneck in beam search generation in vllm 0.2.1.post1. I have one CPU process pegged at 100% CPU, and GPU utilization below 25%. When I use py-spy to inspect where the time is getting spent I see vllm/sequence.py:fork is calling deepcopy(), and that over 80% of my CPU time is getting spent there. So deepcopy() is clearly the bottleneck for this use case.

FWIW this is with llama2-7b on an A100-80. I'm not yet sure whether this is a regression or if there has always been this bottleneck in vLLM.

Here's a simple example which reproduces the issue: https://gist.github.com/physicsrob/f7bc0be046c01cd6f959966e24022bba

simon-mo · 2023-11-15T18:16:45Z

@physicsrob thank you for the detail report. @zhuohan123 will take a look.

kevinhu · 2023-11-29T00:49:29Z

Replacing the deepcopy() call with pickle via new_seq = pickle.loads(pickle.dumps(self, -1)) doubles GPU utilization—but I think any further improvements will require overriding the copy methods or making the classes serializable with Pydantic. Or refactoring this step to avoid copies altogether.

simon-mo assigned zhuohan123 Nov 15, 2023

hmellor closed this as not planned Won't fix, can't repro, duplicate, stale Mar 25, 2024

physicsrob mentioned this issue Sep 2, 2024

[RFC] Drop beam search support #6226

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CPU Bottleneck when using beam search #1646

CPU Bottleneck when using beam search #1646

physicsrob commented Nov 13, 2023

simon-mo commented Nov 15, 2023

kevinhu commented Nov 29, 2023

CPU Bottleneck when using beam search #1646

CPU Bottleneck when using beam search #1646

Comments

physicsrob commented Nov 13, 2023

simon-mo commented Nov 15, 2023

kevinhu commented Nov 29, 2023