You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
During training, I gave 4 GPUs for training the model and 1 GPU for vLLM, "cuda:4", but I am getting a device mismatch error when vllm constructs the CUDA_GRAPHS. How to overcome this issue?
...
[rank0]: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:4 and cuda:0! (when checking argument for argument index in method wrapper_CUDA__index_select)
...
What version of vLLM are you using? I had a similar issue when using vLLM for inference — for me, the issue was that I was using vLLM v0.6, and upgrading to 0.7.1 resolved this error.
Basically, vllm/worker/model_runner.py was using .cuda() to change tensor devices instead of setting device=[correct device name] until a bugfix on 1/4/2025, which is included in the 0.7 release.
Perhaps a brief line on vLLM version requirements could be added to the docs, if it isn't present already?
Hi @tchang1997 , I was actually using vLLM==0.6.6.post1. The updated version works! Thanks!
Also, I would like to say that the model needs to be loaded with flash-attention to work flawlessly with vllm, Adding this to the documentation for GRPO would also be beneficial for people who are new to this.
Thank for the help!
ctjlewis
added a commit
to ctjlewis/trl
that referenced
this issue
Feb 4, 2025
Reproduction
During training, I gave 4 GPUs for training the model and 1 GPU for vLLM, "cuda:4", but I am getting a device mismatch error when vllm constructs the CUDA_GRAPHS. How to overcome this issue?
export CUDA_VISIBLE_DEVICES=0,1,2,3,4 ACCELERATE_LOG_LEVEL=info accelerate launch \ --config_file \ --main_process_port 29501 \ grpo.py \ --model_name_or_path \ --dataset_name \ --output_dir \ --bf16 True \ --bf16_full_eval True \ --per_device_train_batch_size=1 \
Deepspeed Configuration File
outputs:
System Info
Checklist
The text was updated successfully, but these errors were encountered: