Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cuda runtime error (101) : invalid device ordinal #2407

Closed
ruotianluo opened this issue Jun 29, 2020 · 26 comments · Fixed by #2739
Closed

cuda runtime error (101) : invalid device ordinal #2407

ruotianluo opened this issue Jun 29, 2020 · 26 comments · Fixed by #2739
Assignees
Labels
bug Something isn't working help wanted Open to be worked on
Milestone

Comments

@ruotianluo
Copy link
Contributor

ruotianluo commented Jun 29, 2020

CUDA_VISIBLE_DEVICES and ddp are not compatible.

https://github.com/PyTorchLightning/pytorch-lightning/blob/25ee51bc570503f331dceecc610d0eb355e22327/pytorch_lightning/trainer/distrib_data_parallel.py#L504

the pytorch respects the CUDA_VISIBLE_DEVICES.
gpu_idx should be the same as local_rank.

If I set CUDA_VISIBLE_DEVICES to be 1. It will break.

@ruotianluo ruotianluo added bug Something isn't working help wanted Open to be worked on labels Jun 29, 2020
@Borda Borda added this to the 0.8.x milestone Jun 29, 2020
@ruotianluo
Copy link
Contributor Author

BTW simply set https://github.com/PyTorchLightning/pytorch-lightning/blob/25ee51bc570503f331dceecc610d0eb355e22327/pytorch_lightning/trainer/distrib_data_parallel.py#L509

to be
gpu_idx = self.local_rank
fixes my issue now.

And I think #2420 is related too.

@markdjwilliams
Copy link

I am experiencing this issue, also.

@markdjwilliams
Copy link

The fix described above seems to fix not only this particular issue for me, but a secondary problem whereby during the "Validation Check" GPU memory would be steadily climb to be significantly higher on the first visible device than any of the others.

@jgbos
Copy link
Contributor

jgbos commented Jul 11, 2020

Here's a fix that doesn't require editing Lightning code:

num_gpus = os.environ['CUDA_VISIBLE_DEVICES'].split(',').__len__()
os.environ['CUDA_VISIBLE_DEVICES'] = ','.join(f'{i}' for i in range(num_gpus))

If you run this before setting your models it should work.

@thepowerfuldeez
Copy link

when this will be fixed in master?

ibeltagy added a commit to ibeltagy/pytorch-lightning that referenced this issue Jul 18, 2020
ibeltagy added a commit to ibeltagy/pytorch-lightning that referenced this issue Jul 28, 2020
ibeltagy added a commit to ibeltagy/pytorch-lightning that referenced this issue Jul 28, 2020
@shreyaskamathkm
Copy link

shreyaskamathkm commented Jul 31, 2020

Stuck on the same issue.

I recently started using PyTorch lightning on the HPC cluster. The node which I am trying has 4 Nvidia V100 GPUs. An imagenet classification code is running on the first 2 GPUs [0,1]. I am trying to run a segmentation code on the next 2 GPUs [2,3]. While trying to run with the configuration given below causes the error.

trainer = pl.Trainer( default_root_dir=hparams.save_dir, auto_select_gpus=True, gpus=hparams.gpu, logger=[testTube_logger], max_epochs=hparams.epochs, distributed_backend=hparams.distributed_backend, use_amp=False, checkpoint_callback=checkpoint_callback, fast_dev_run=hparams.debug, row_log_interval=hparams.log_iter, progress_bar_refresh_rate=0)

Error Message:

Traceback (most recent call last):
  File "./../../lightning_scipts/segmentation/distributed_main.py", line 340, in <module>
    main(args)
  File "./../../lightning_scipts/segmentation/distributed_main.py", line 335, in main
    trainer.fit(model)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1095, in fit
    results = self.spawn_ddp_children(model)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/distrib_data_parallel.py", line 482, in spawn_ddp_children
    results = self.ddp_train(local_rank, mp_queue=None, model=model, is_master=True)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/distrib_data_parallel.py", line 554, in ddp_train
    torch.cuda.set_device(self.root_gpu)
  File "/opt/conda/lib/python3.7/site-packages/torch/cuda/__init__.py", line 245, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: cuda runtime error (101) : invalid device ordinal at /opt/conda/conda-bld/pytorch_1587428266983/work/torch/csrc/cuda/Module.cpp:59
Traceback (most recent call last):
  File "/cluster/kappa/90-days-archive/panettalab/skamat04/Nutrition/code_/lightning_scipts/segmentation/distributed_main.py", line 340, in <module>
    main(args)
  File "/cluster/kappa/90-days-archive/panettalab/skamat04/Nutrition/code_/lightning_scipts/segmentation/distributed_main.py", line 335, in main
    trainer.fit(model)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1079, in fit
    self.ddp_train(process_idx=task, mp_queue=None, model=model)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/distrib_data_parallel.py", line 577, in ddp_train
    model = model.configure_ddp(model, device_ids)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/core/lightning.py", line 887, in configure_ddp
    model = LightningDistributedDataParallel(model, device_ids=device_ids, find_unused_parameters=True)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 285, in __init__
    self.broadcast_bucket_size)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 483, in _distributed_broadcast_coalesced
    dist._broadcast_coalesced(self.process_group, tensors, buffer_size)
RuntimeError: Broken pipe

Please help!
Thank you!

ananyahjha93 pushed a commit to ibeltagy/pytorch-lightning that referenced this issue Jul 31, 2020
williamFalcon pushed a commit that referenced this issue Aug 2, 2020
* fix #2407

* Update pytorch_lightning/trainer/distrib_data_parallel.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
@shreyaskamathkm
Copy link

shreyaskamathkm commented Aug 2, 2020

Hi,

I just updated the pytorch lightning module on the cluster to the latest master. But unfortunately, the error still exists.

THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1587428266983/work/torch/csrc/cuda/Module.cpp line=59 error=101 : invalid device ordinal
Traceback (most recent call last):
  File "./../../lightning_scipts/segmentation/distributed_main.py", line 340, in <module>
    main(args)
  File "./../../lightning_scipts/segmentation/distributed_main.py", line 335, in main
    trainer.fit(model)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1020, in fit
    results = self.accelerator_backend.spawn_ddp_children(model)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/accelerator_backends/ddp_backend.py", line 118, in spawn_ddp_children
    results = self.ddp_train(local_rank, mp_queue=None, model=model, is_master=True)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/accelerator_backends/ddp_backend.py", line 193, in ddp_train
    torch.cuda.set_device(self.trainer.root_gpu)
  File "/opt/conda/lib/python3.7/site-packages/torch/cuda/__init__.py", line 245, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: cuda runtime error (101) : invalid device ordinal at /opt/conda/conda-bld/pytorch_1587428266983/work/torch/csrc/cuda/Module.cpp:59
Traceback (most recent call last):
  File "/cluster/kappa/90-days-archive/panettalab/skamat04/Nutrition/code_/lightning_scipts/segmentation/distributed_main.py", line 340, in <module>
    main(args)
  File "/cluster/kappa/90-days-archive/panettalab/skamat04/Nutrition/code_/lightning_scipts/segmentation/distributed_main.py", line 335, in main
    trainer.fit(model)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1007, in fit
    self.accelerator_backend.train(model)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/accelerator_backends/ddp_backend.py", line 56, in train
    self.ddp_train(process_idx=self.task_idx, mp_queue=None, model=model)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/accelerator_backends/ddp_backend.py", line 214, in ddp_train
    model = model.configure_ddp(model, device_ids)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/core/lightning.py", line 888, in configure_ddp
    model = LightningDistributedDataParallel(model, device_ids=device_ids, find_unused_parameters=True)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 285, in __init__
    self.broadcast_bucket_size)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 483, in _distributed_broadcast_coalesced
    dist._broadcast_coalesced(self.process_group, tensors, buffer_size)
RuntimeError: Broken pipe

My sbatch file:

#!/bin/sh
#SBATCH -N 1
#SBATCH --mem=48G  # memory in Mb
#SBATCH --gres=gpu:v100:2
#SBATCH -t 72:00:00
#SBATCH --job-name="distributedqopramidal_50_imagenet"
#SBATCH -e error_%j_%x.dat   # send stderr to errfile
#SBATCH -o output_%j_%x.dat # send stdout to outfile


module purge
module load singularity/3.1.0
module load cuda/10.2
module load cudnn/7.1
export CUDA_VISIBLE_DEVICES=2,3


MODEL=qicnet
BACKBONE=qoresnet50_v1s
DATASET=unimib
RUNDIR=./../../../Trained_Model_Segmentation/Lightning/"$MODEL"_"$DATASET"_"$BACKBONE"_test/
mkdir -p "$RUNDIR"
cp test_lightning.sh "$RUNDIR"


singularity exec --nv  /cluster/kappa/90-days-archive/panettalab/singularity/avocado_pytorch_lightning_v_1_5_0.sif \
python  ./../../lightning_scipts/segmentation/distributed_main.py \
--model "$MODEL" \
--pretrained-base True \
--pretrained False \
--backbone "$BACKBONE" \
--dataset "$DATASET" \
--gpus 2 \
--distributed-backend 'ddp' \
--save_dir "$RUNDIR"

My trainer code:


    checkpoint_callback = ModelCheckpoint(filepath=os.path.join(ckpt_dir, '{epoch}-{val_mIOU:.2f}'),
                                                save_top_k=1,
                                                verbose=True,
                                                monitor='val_mIOU',
                                                mode='max',
                                                save_last = True,
                                                prefix='')

    trainer = pl.Trainer(
        default_root_dir=hparams.save_dir,
        auto_select_gpus=True,
        gpus=hparams.gpus,
        logger=[testTube_logger],
        max_epochs=hparams.epochs,
        distributed_backend=hparams.distributed_backend,
        callbacks=[Steps_Per_LR_Scheduler(), # This callback has to come first
                   pl.callbacks.lr_logger.LearningRateLogger()],
        checkpoint_callback=checkpoint_callback,
        fast_dev_run=hparams.debug,
        progress_bar_refresh_rate=0,
        resume_from_checkpoint = hparams.resume,
        amp_level = 'O0' )

@williamFalcon
Copy link
Contributor

what gpus flag are you passing to the model? since you set visible devices 2,3 your gpus flag should just be —gpus 2

@shreyaskamathkm
Copy link

Sorry for the incomplete comment. I have updated the previous comment to reflect the correct arguments sent. I also tried the specification --ntasks-per-node=1 mentioned in horovod/horovod#1586 (comment). However, the error doesn't seem to change. Is there any particular flag I need to set in the SBATCH file?

@shreyaskamathkm
Copy link

There is one more mistake I think, the GPUs in the HPC look like below:

Screenshot from 2020-08-02 20-31-27

The PCI_BUS_ID is wrong for some reason. So when I use use export CUDA_VISIBLE_DEVICES=0,1 The V100s get initialized [GPU 2,3], and when I use export CUDA_VISIBLE_DEVICES=2,3 [the first 2 P100 GPU ID 0,1] get initialized.

Interestingly, when I change the above code to

module purge
module load singularity/3.1.0
module load cuda/10.2
module load cudnn/7.1
export SINGULARITY_BINDPATH="/cluster/kappa/90-days-archive/panettalab/skamat04/Nutrition/:/home --pwd home/code/bash/classification/"
export CUDA_VISIBLE_DEVICES=0,1

The V100s get initialized, Along with that, the P100s are grabbed as well. So finally, 4 GPUs are considered i.e. GPU 0,1,2,3 . Is this expected?

Screenshot from 2020-08-02 20-39-16

Thank you,
Best,
Shreyas

@shreyaskamathkm
Copy link

Hi,

Any updates on this?

Thank you,
Best,
Shreyas

@williamFalcon
Copy link
Contributor

try master?

@thepowerfuldeez
Copy link

try master?

tried, still not fixed. using CUDA_VISIBLE_DEVICES with ddp if CUDA_VISIBLE_DEVICES not start with 0 fails with
RuntimeError: cuda runtime error (101) : invalid device ordinal at /opt/conda/conda-bld/pytorch_1595629427478/work/torch/csrc/cuda/Module.cpp:59

@thepowerfuldeez
Copy link

sorry. you should use --gpus=2,3 instead of CUDA_VISIBLE_DEVICES=2,3 with --gpus 2

not setting CUDA_VISIBLE_DEVICES and setting only --gpus works.

@shreyaskamathkm
Copy link

So we should mention the GPU number in the --gpus flag?

@thepowerfuldeez
Copy link

So we should mention the GPU number in the --gpus flag?

yes, exactly. @williamFalcon right?

@shreyaskamathkm
Copy link

Hi,

I tried your suggestion after making the following changes.
Note: I haven't set os.environ['CUDA_VISIBLE_DEVICES'] in the sbatch file as gres allocates specific GPUs for the jobs. So I am using os.environ['CUDA_VISIBLE_DEVICES'] to figure out which GPUs are allocated.

        num_gpus = os.environ['CUDA_VISIBLE_DEVICES'].split(',').__len__()
        logger.info("Number of GPUS available {}".format(num_gpus))
        gpus = [int(x) for x in os.environ['CUDA_VISIBLE_DEVICES'].split(',')]

The trainer is initialized as

    trainer = pl.Trainer(
        default_root_dir=hparams.save_dir,
        gpus=gpus,
        logger=[testTube_logger, wandb_logger],
        max_epochs=hparams.epochs,
        distributed_backend=hparams.distributed_backend,
        callbacks=[Steps_Per_LR_Scheduler(),  # This callback has to come first
                   pl.callbacks.lr_logger.LearningRateLogger(),
                   Custom_Logging_Callback_with_Metrics(logger, meters)],
        checkpoint_callback=checkpoint_callback,
        fast_dev_run=hparams.debug,
        progress_bar_refresh_rate=0,
        resume_from_checkpoint=hparams.resume,
        amp_level='O0')

After doing this I get the following error:

THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1595629427478/work/torch/csrc/cuda/Module.cpp line=59 error=101 : invalid device ordinal

wandb: Waiting for W&B process to finish, PID 21991
Traceback (most recent call last):
  File "./../../lightning_scipts/encoder/distributed_imagenet.py", line 288, in <module>
    main(args)
  File "./../../lightning_scipts/encoder/distributed_imagenet.py", line 283, in main
    trainer.fit(model)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/states.py", line 34, in wrapped_fn
    result = fn(self, *args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1039, in fit
    results = self.accelerator_backend.spawn_ddp_children(model)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_backend.py", line 118, in spawn_ddp_children
    results = self.ddp_train(local_rank, mp_queue=None, model=model, is_master=True)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_backend.py", line 197, in ddp_train
    torch.cuda.set_device(self.trainer.root_gpu)
  File "/opt/conda/lib/python3.7/site-packages/torch/cuda/__init__.py", line 281, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: cuda runtime error (101) : invalid device ordinal at /opt/conda/conda-bld/pytorch_1595629427478/work/torch/csrc/cuda/Module.cpp:59
wandb: Program failed with code 1. Press ctrl-c to abort syncing.
wandb: Process crashed early, not syncing files

Am I doing something wrong? Been trying for weeks now, but no luck! Any help is appreciated.

@cinjon
Copy link
Contributor

cinjon commented Aug 14, 2020

I have not fully isolated the issue, but it seems to start with the call to get_all_available_gpus:

https://github.com/PyTorchLightning/pytorch-lightning/blob/4354690e55704f0c1742e5a95411b63c6055ddc7/pytorch_lightning/trainer/distrib_parts.py#L259

in sanitize_gpu_ids here:

https://github.com/PyTorchLightning/pytorch-lightning/blob/4354690e55704f0c1742e5a95411b63c6055ddc7/pytorch_lightning/trainer/distrib_parts.py#L304

What happens is the initial procedure gets the right counts with the correct gpus, etc, but then the spawned procedures get told their gpu to use along with an incorrect list of available gpus. For example, if I say --gpus 3,4,5, then the first spawned procedure will get gpus as [3] but available gpus as [0,1,2] in the sanitize procedure.

If you change get_all_available_gpus to first check for os.environ.get('CUDA_VISIBLE_DEVICES'), then there is an issue in ddp_backend:

https://github.com/PyTorchLightning/pytorch-lightning/blob/4354690e55704f0c1742e5a95411b63c6055ddc7/pytorch_lightning/accelerators/ddp_backend.py#L179

where the wrong gpu_idx is set in non-master processes. You could try ameliorating that just below with the is_master check:

https://github.com/PyTorchLightning/pytorch-lightning/blob/4354690e55704f0c1742e5a95411b63c6055ddc7/pytorch_lightning/accelerators/ddp_backend.py#L185

However, at that point, I do not trust that I know enough about what is going on in this system to make this change. It could very well break other dependencies.

That being said, it would be really great if this could get addressed. This works np when using SLURM, but breaks when not.

(@williamFalcon)

@williamFalcon
Copy link
Contributor

try master? we pushed fixes this week

@cinjon
Copy link
Contributor

cinjon commented Aug 14, 2020

which commit fixes this? i was testing on 0.9.0rc12.

@williamFalcon
Copy link
Contributor

williamFalcon commented Aug 14, 2020

ok, can you share a simple example to reproduce? i can test this now. this AM we merged a ddp fix but it might only affect ddp_cpu.

I can run this on a 2-8 gpu machine

@cinjon
Copy link
Contributor

cinjon commented Aug 14, 2020

i dont have time in the next few hrs to make a quick example, but i suspect most anything will fail if you use ddp backend with something like python ... --gpus 3,4,5. i'm doing that on dgx and it's saying nah.

williamFalcon added a commit that referenced this issue Aug 14, 2020
* fix gpus index error
ameliatqy pushed a commit to ameliatqy/pytorch-lightning that referenced this issue Aug 17, 2020
@ruotianluo
Copy link
Contributor Author

Training with: commit c64520.

I am still getting error.......
I am using CUDA_VISIBLE_DEVICES=3 (set by slurm), and in the code I have Trainer(..., gpus=-1, ...)

@rotabulo
Copy link

rotabulo commented Oct 2, 2020

Hi, I have opened another issue on the same topic since I found that the problem is still there in master.
#3791 (comment)
In the issue I opened I explain the source of the problem and provide a possible fix.
(CC @williamFalcon )

@JVGD
Copy link

JVGD commented Oct 16, 2020

Just commenting here because I am really interested to see this solved (and I want the notifications on this issue).

My team and I ported to 1.0.1 and then everything crashes in multi gpu with ddp... Had to roll back to 0.9.0 and everything works back again 🤷‍♂️ On a friday (yes, today) that is a critical day for us to launch experiments (so they are run over the weekend)

@shreyaskamathkm
Copy link

@JVGD did you try 1.0.0? That solved my error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Open to be worked on
Projects
None yet
Development

Successfully merging a pull request may close this issue.