You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm finding a surprising bottleneck in beam search generation in vllm 0.2.1.post1. I have one CPU process pegged at 100% CPU, and GPU utilization below 25%. When I use py-spy to inspect where the time is getting spent I see vllm/sequence.py:fork is calling deepcopy(), and that over 80% of my CPU time is getting spent there. So deepcopy() is clearly the bottleneck for this use case.
FWIW this is with llama2-7b on an A100-80. I'm not yet sure whether this is a regression or if there has always been this bottleneck in vLLM.
Replacing the deepcopy() call with pickle via new_seq = pickle.loads(pickle.dumps(self, -1)) doubles GPU utilization—but I think any further improvements will require overriding the copy methods or making the classes serializable with Pydantic. Or refactoring this step to avoid copies altogether.
I'm finding a surprising bottleneck in beam search generation in vllm 0.2.1.post1. I have one CPU process pegged at 100% CPU, and GPU utilization below 25%. When I use py-spy to inspect where the time is getting spent I see vllm/sequence.py:fork is calling deepcopy(), and that over 80% of my CPU time is getting spent there. So deepcopy() is clearly the bottleneck for this use case.
FWIW this is with llama2-7b on an A100-80. I'm not yet sure whether this is a regression or if there has always been this bottleneck in vLLM.
Here's a simple example which reproduces the issue: https://gist.github.com/physicsrob/f7bc0be046c01cd6f959966e24022bba
The text was updated successfully, but these errors were encountered: