-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cuda runtime error (101) : invalid device ordinal #2407
Comments
to be And I think #2420 is related too. |
I am experiencing this issue, also. |
The fix described above seems to fix not only this particular issue for me, but a secondary problem whereby during the "Validation Check" GPU memory would be steadily climb to be significantly higher on the first visible device than any of the others. |
Here's a fix that doesn't require editing Lightning code: num_gpus = os.environ['CUDA_VISIBLE_DEVICES'].split(',').__len__()
os.environ['CUDA_VISIBLE_DEVICES'] = ','.join(f'{i}' for i in range(num_gpus)) If you run this before setting your models it should work. |
when this will be fixed in master? |
Stuck on the same issue. I recently started using PyTorch lightning on the HPC cluster. The node which I am trying has 4 Nvidia V100 GPUs. An imagenet classification code is running on the first 2 GPUs [0,1]. I am trying to run a segmentation code on the next 2 GPUs [2,3]. While trying to run with the configuration given below causes the error.
Error Message:
Please help! |
* fix #2407 * Update pytorch_lightning/trainer/distrib_data_parallel.py Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Hi, I just updated the pytorch lightning module on the cluster to the latest master. But unfortunately, the error still exists.
My sbatch file:
My trainer code:
|
what gpus flag are you passing to the model? since you set visible devices 2,3 your gpus flag should just be —gpus 2 |
Sorry for the incomplete comment. I have updated the previous comment to reflect the correct arguments sent. I also tried the specification |
There is one more mistake I think, the GPUs in the HPC look like below: The PCI_BUS_ID is wrong for some reason. So when I use use Interestingly, when I change the above code to
The V100s get initialized, Along with that, the P100s are grabbed as well. So finally, 4 GPUs are considered i.e. GPU 0,1,2,3 . Is this expected? Thank you, |
Hi, Any updates on this? Thank you, |
try master? |
tried, still not fixed. using CUDA_VISIBLE_DEVICES with ddp if CUDA_VISIBLE_DEVICES not start with 0 fails with |
sorry. you should use --gpus=2,3 instead of CUDA_VISIBLE_DEVICES=2,3 with --gpus 2 not setting |
So we should mention the GPU number in the --gpus flag? |
yes, exactly. @williamFalcon right? |
Hi, I tried your suggestion after making the following changes.
The trainer is initialized as
After doing this I get the following error:
Am I doing something wrong? Been trying for weeks now, but no luck! Any help is appreciated. |
I have not fully isolated the issue, but it seems to start with the call to get_all_available_gpus: in sanitize_gpu_ids here: What happens is the initial procedure gets the right counts with the correct gpus, etc, but then the spawned procedures get told their gpu to use along with an incorrect list of available gpus. For example, if I say If you change get_all_available_gpus to first check for where the wrong gpu_idx is set in non-master processes. You could try ameliorating that just below with the However, at that point, I do not trust that I know enough about what is going on in this system to make this change. It could very well break other dependencies. That being said, it would be really great if this could get addressed. This works np when using SLURM, but breaks when not. |
try master? we pushed fixes this week |
which commit fixes this? i was testing on 0.9.0rc12. |
ok, can you share a simple example to reproduce? i can test this now. this AM we merged a ddp fix but it might only affect ddp_cpu. I can run this on a 2-8 gpu machine |
i dont have time in the next few hrs to make a quick example, but i suspect most anything will fail if you use ddp backend with something like |
* fix gpus index error
Training with: commit c64520. I am still getting error....... |
Hi, I have opened another issue on the same topic since I found that the problem is still there in master. |
Just commenting here because I am really interested to see this solved (and I want the notifications on this issue). My team and I ported to 1.0.1 and then everything crashes in multi gpu with ddp... Had to roll back to 0.9.0 and everything works back again 🤷♂️ On a friday (yes, today) that is a critical day for us to launch experiments (so they are run over the weekend) |
@JVGD did you try 1.0.0? That solved my error. |
CUDA_VISIBLE_DEVICES and ddp are not compatible.
https://github.com/PyTorchLightning/pytorch-lightning/blob/25ee51bc570503f331dceecc610d0eb355e22327/pytorch_lightning/trainer/distrib_data_parallel.py#L504
the pytorch respects the CUDA_VISIBLE_DEVICES.
gpu_idx should be the same as local_rank.
If I set CUDA_VISIBLE_DEVICES to be 1. It will break.
The text was updated successfully, but these errors were encountered: