set_device error for non-integer CUDA_VISIBLE_DEVICES environment #2420

jgbos · 2020-06-29T15:34:03Z

🐛 Bug

My cluster does not set CUDA_VISIBLE_DEVICES is a list of integers, instead it is a list of hashes. Therefore my code crashes on this line

https://github.com/PyTorchLightning/pytorch-lightning/blob/master/pytorch_lightning/trainer/distrib_data_parallel.py#L509

To Reproduce

Can be reproduced by setting CUDA_VISIBLE_DEVICES to a list non-integers.

Expected behavior

Shouldn't local_rank already be a number between [0, num_gpus-1]? Is there a reason to actually choose from this list of devices (use case I'm not aware of)? Maybe just set local_rank = global_rank % num_gpus?

Environment

CUDA:
- GPU:
- Tesla V100-PCIE-32GB
- Tesla V100-PCIE-32GB
- available: True
- version: 10.2
Packages:
- numpy: 1.18.1
- pyTorch_debug: False
- pyTorch_version: 1.5.0
- pytorch-lightning: 0.8.2
- tensorboard: 2.2.1
- tqdm: 4.46.0
System:
- OS: Linux
- architecture:
- 64bit
-
- processor: x86_64
- python: 3.7.7
- version: Proposal for help #1 SMP Fri May 29 11:57:47 EDT 2020

The text was updated successfully, but these errors were encountered:

edenlightning · 2020-08-03T18:52:05Z

Closing this as it should be fixed. @jgbos please try master, and let us know if you run into any issues!

jgbos added bug Something isn't working help wanted Open to be worked on labels Jun 29, 2020

ruotianluo mentioned this issue Jun 29, 2020

cuda runtime error (101) : invalid device ordinal #2407

Closed

edenlightning closed this as completed Aug 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

set_device error for non-integer CUDA_VISIBLE_DEVICES environment #2420

set_device error for non-integer CUDA_VISIBLE_DEVICES environment #2420

jgbos commented Jun 29, 2020

edenlightning commented Aug 3, 2020

set_device error for non-integer CUDA_VISIBLE_DEVICES environment #2420

set_device error for non-integer CUDA_VISIBLE_DEVICES environment #2420

Comments

jgbos commented Jun 29, 2020

🐛 Bug

To Reproduce

Expected behavior

Environment

edenlightning commented Aug 3, 2020