-
-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug][v1][rocm] cuda graph gets stuck in case padding is used to meet a captured input size #13418
Comments
I'll take a look at this today and try to repro. |
I'll try again as well on MI300, maybe it's different there. |
I was able to reproduce this on an MI300X machine. I also confirmed that the script does not hang on an H100 machine so the issue does seem to be rocm specific. I've also confirmed that the issue goes away if I disable cudagraphs/torch.compile with --enforce-eager. |
well... Do you know which part of the network the multiple cuda graphs replayed for a single forward represent? |
I'll dig into this a bit more today and post my findings here. It's tough to say exactly what the problem is at this point. |
Every time I reproduce the hang I see the following
It looks like we are hitting some kind of memory corruption. It seems somewhat plausible that the hang is just a symptom of waiting on a stream that's crashed. Here's the callstack from my hung process.
Obviously, this isn't very parsable but it does look like it's getting stuck waiting on a memory copy. |
Your current environment
The output of `python collect_env.py`
🐛 Describe the bug
Hi,
This is a more detailed report for #12568 (comment). Essentially, a CUDA graph recorded through torch.compile using
VllmBackend
with v1 gets stuck when replayed in case the number of scheduled tokens in GPUModelRunner.execute_model is below the largest cudagraph_capture_sizes, in the specific case where the firstnum_scheduled_tokens
is NOT a multiple of 8 (to fit a captured size). No issue if the firstnum_scheduled_tokens
does not get padded.I am running within
rocm/dev-ubuntu-22.04:6.3
with vllm-project/vllm at ce77eb9 installed from source.Reproduction: run
VLLM_USE_V1=1 vllm serve meta-llama/Llama-2-7b-chat-hf --tensor-parallel-size 1 -O3
, and run the following script:If we add a sync after the model forward in
GPUModelRunner.execute_model
, for exampleprint("hidden_states", hidden_states)
here:vllm/vllm/v1/worker/gpu_model_runner.py
Line 948 in ce77eb9
If case no padding tokens are used (e.g. with 16 sequences, multiple of 8), the script runs fine and does not get stuck, even in later step where some sequences have hit EOS and we do pad in later decode steps:
I do not have nvidia devices at hand unfortunately, so can't confirm whether this is a rocm-only issue.
Probably running with dynamo debug flags would help to debug here.
Python backtrace in gdb:
& c backtrace in gdb:
Actually, there are multiple cuda graph replays in a single
execute_model
call it seems (which probably makes sense asVLLM_TEST_DYNAMO_FULLGRAPH_CAPTURE
defaults to false). Atvllm/vllm/compilation/backends.py
Line 702 in 4c21ce9
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: