Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error while Using Mutiple GPUs #127

Open
Sandeep0076 opened this issue Oct 24, 2020 · 1 comment
Open

Error while Using Mutiple GPUs #127

Sandeep0076 opened this issue Oct 24, 2020 · 1 comment

Comments

@Sandeep0076
Copy link

I am trying to train a model and It works fine with GPU = 1. But if I am using more than 1 GPU, or using num_worker more than 0, I am getting following error :

-- Process 3 terminated with the following error: Traceback (most recent call last): File "/home/pathania/data-private/longformer/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap fn(i, *args) File "/home/pathania/data-private/longformer/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_spawn_accelerator.py", line 152, in ddp_train results = self.train_or_test() File "/home/pathania/data-private/longformer/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 66, in train_or_test results = self.trainer.train() File "/home/pathania/data-private/longformer/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 483, in train self.train_loop.run_training_epoch() File "/home/pathania/data-private/longformer/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 541, in run_training_epoch batch_output = self.run_training_batch(batch, batch_idx, dataloader_idx) File "/home/pathania/data-private/longformer/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 673, in run_training_batch opt_closure_result = self.training_step_and_backward( File "/home/pathania/data-private/longformer/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 769, in training_step_and_backward self.backward(result, optimizer, opt_idx) File "/home/pathania/data-private/longformer/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 783, in backward result.closure_loss = self.trainer.accelerator_backend.backward( File "/home/pathania/data-private/longformer/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 89, in backward closure_loss = self.trainer.precision_connector.backend.backward( File "/home/pathania/data-private/longformer/lib/python3.8/site-packages/pytorch_lightning/plugins/native_amp.py", line 32, in backward model.backward(closure_loss, optimizer, opt_idx) File "/home/pathania/data-private/longformer/lib/python3.8/site-packages/pytorch_lightning/core/lightning.py", line 1105, in backward loss.backward() File "/home/pathania/data-private/longformer/lib/python3.8/site-packages/torch/tensor.py", line 185, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph) File "/home/pathania/data-private/longformer/lib/python3.8/site-packages/torch/autograd/__init__.py", line 125, in backward Variable._execution_engine.run_backward( RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of a module parameter outside the forwardfunction. Please make sure model parameters are not shared across multiple concurrent forward-backward passes2) Reused parameters in multiple reentrant backward passes. For example, if you use multiplecheckpoint functions to wrap the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready multiple times. DDP does not support such use cases yet. Exception raised from mark_variable_ready at /pytorch/torch/csrc/distributed/c10d/reducer.cpp:453 (most recent call first):

If I am using mutiple gpu with
trainer = Trainer(gpus=4,auto_lr_find=True,distributed_backend='dp')
Its give : TypeError: cannot unpack non-iterable NoneType object terminate called after throwing an instance of 'c10::Error' what(): CUDA error: device-side assert triggered
My code : https://colab.research.google.com/drive/1uj3obOeVysbAxpdAuqXWSScx0EOTefDB?usp=sharing

@ibeltagy
Copy link
Collaborator

This seems like a pytroch-ligtning issue, maybe the same issues as #128?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants