Error while Using Mutiple GPUs #127

Sandeep0076 · 2020-10-24T21:30:01Z

I am trying to train a model and It works fine with GPU = 1. But if I am using more than 1 GPU, or using num_worker more than 0, I am getting following error :

-- Process 3 terminated with the following error: Traceback (most recent call last): File "/home/pathania/data-private/longformer/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap fn(i, *args) File "/home/pathania/data-private/longformer/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_spawn_accelerator.py", line 152, in ddp_train results = self.train_or_test() File "/home/pathania/data-private/longformer/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 66, in train_or_test results = self.trainer.train() File "/home/pathania/data-private/longformer/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 483, in train self.train_loop.run_training_epoch() File "/home/pathania/data-private/longformer/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 541, in run_training_epoch batch_output = self.run_training_batch(batch, batch_idx, dataloader_idx) File "/home/pathania/data-private/longformer/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 673, in run_training_batch opt_closure_result = self.training_step_and_backward( File "/home/pathania/data-private/longformer/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 769, in training_step_and_backward self.backward(result, optimizer, opt_idx) File "/home/pathania/data-private/longformer/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 783, in backward result.closure_loss = self.trainer.accelerator_backend.backward( File "/home/pathania/data-private/longformer/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 89, in backward closure_loss = self.trainer.precision_connector.backend.backward( File "/home/pathania/data-private/longformer/lib/python3.8/site-packages/pytorch_lightning/plugins/native_amp.py", line 32, in backward model.backward(closure_loss, optimizer, opt_idx) File "/home/pathania/data-private/longformer/lib/python3.8/site-packages/pytorch_lightning/core/lightning.py", line 1105, in backward loss.backward() File "/home/pathania/data-private/longformer/lib/python3.8/site-packages/torch/tensor.py", line 185, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph) File "/home/pathania/data-private/longformer/lib/python3.8/site-packages/torch/autograd/__init__.py", line 125, in backward Variable._execution_engine.run_backward( RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of a module parameter outside the forwardfunction. Please make sure model parameters are not shared across multiple concurrent forward-backward passes2) Reused parameters in multiple reentrant backward passes. For example, if you use multiplecheckpoint functions to wrap the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready multiple times. DDP does not support such use cases yet. Exception raised from mark_variable_ready at /pytorch/torch/csrc/distributed/c10d/reducer.cpp:453 (most recent call first):

If I am using mutiple gpu with
trainer = Trainer(gpus=4,auto_lr_find=True,distributed_backend='dp')
Its give : TypeError: cannot unpack non-iterable NoneType object terminate called after throwing an instance of 'c10::Error' what(): CUDA error: device-side assert triggered
My code : https://colab.research.google.com/drive/1uj3obOeVysbAxpdAuqXWSScx0EOTefDB?usp=sharing

The text was updated successfully, but these errors were encountered:

ibeltagy · 2020-10-27T06:09:58Z

This seems like a pytroch-ligtning issue, maybe the same issues as #128?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error while Using Mutiple GPUs #127

Error while Using Mutiple GPUs #127

Sandeep0076 commented Oct 24, 2020

ibeltagy commented Oct 27, 2020

Error while Using Mutiple GPUs #127

Error while Using Mutiple GPUs #127

Comments

Sandeep0076 commented Oct 24, 2020

ibeltagy commented Oct 27, 2020