You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Can be reproduced by setting CUDA_VISIBLE_DEVICES to a list non-integers.
Expected behavior
Shouldn't local_rank already be a number between [0, num_gpus-1]? Is there a reason to actually choose from this list of devices (use case I'm not aware of)? Maybe just set local_rank = global_rank % num_gpus?
Environment
CUDA:
- GPU:
- Tesla V100-PCIE-32GB
- Tesla V100-PCIE-32GB
- available: True
- version: 10.2
🐛 Bug
My cluster does not set
CUDA_VISIBLE_DEVICES
is a list of integers, instead it is a list of hashes. Therefore my code crashes on this linehttps://github.com/PyTorchLightning/pytorch-lightning/blob/master/pytorch_lightning/trainer/distrib_data_parallel.py#L509
To Reproduce
Can be reproduced by setting
CUDA_VISIBLE_DEVICES
to a list non-integers.Expected behavior
Shouldn't
local_rank
already be a number between[0, num_gpus-1]
? Is there a reason to actually choose from this list of devices (use case I'm not aware of)? Maybe just setlocal_rank = global_rank % num_gpus
?Environment
- GPU:
- Tesla V100-PCIE-32GB
- Tesla V100-PCIE-32GB
- available: True
- version: 10.2
- numpy: 1.18.1
- pyTorch_debug: False
- pyTorch_version: 1.5.0
- pytorch-lightning: 0.8.2
- tensorboard: 2.2.1
- tqdm: 4.46.0
- OS: Linux
- architecture:
- 64bit
-
- processor: x86_64
- python: 3.7.7
- version: Proposal for help #1 SMP Fri May 29 11:57:47 EDT 2020
The text was updated successfully, but these errors were encountered: