You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am trying to train a model and It works fine with GPU = 1. But if I am using more than 1 GPU, or using num_worker more than 0, I am getting following error :
-- Process 3 terminated with the following error: Traceback (most recent call last): File "/home/pathania/data-private/longformer/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap fn(i, *args) File "/home/pathania/data-private/longformer/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_spawn_accelerator.py", line 152, in ddp_train results = self.train_or_test() File "/home/pathania/data-private/longformer/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 66, in train_or_test results = self.trainer.train() File "/home/pathania/data-private/longformer/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 483, in train self.train_loop.run_training_epoch() File "/home/pathania/data-private/longformer/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 541, in run_training_epoch batch_output = self.run_training_batch(batch, batch_idx, dataloader_idx) File "/home/pathania/data-private/longformer/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 673, in run_training_batch opt_closure_result = self.training_step_and_backward( File "/home/pathania/data-private/longformer/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 769, in training_step_and_backward self.backward(result, optimizer, opt_idx) File "/home/pathania/data-private/longformer/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 783, in backward result.closure_loss = self.trainer.accelerator_backend.backward( File "/home/pathania/data-private/longformer/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 89, in backward closure_loss = self.trainer.precision_connector.backend.backward( File "/home/pathania/data-private/longformer/lib/python3.8/site-packages/pytorch_lightning/plugins/native_amp.py", line 32, in backward model.backward(closure_loss, optimizer, opt_idx) File "/home/pathania/data-private/longformer/lib/python3.8/site-packages/pytorch_lightning/core/lightning.py", line 1105, in backward loss.backward() File "/home/pathania/data-private/longformer/lib/python3.8/site-packages/torch/tensor.py", line 185, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph) File "/home/pathania/data-private/longformer/lib/python3.8/site-packages/torch/autograd/__init__.py", line 125, in backward Variable._execution_engine.run_backward( RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of a module parameter outside the forwardfunction. Please make sure model parameters are not shared across multiple concurrent forward-backward passes2) Reused parameters in multiple reentrant backward passes. For example, if you use multiplecheckpoint functions to wrap the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready multiple times. DDP does not support such use cases yet. Exception raised from mark_variable_ready at /pytorch/torch/csrc/distributed/c10d/reducer.cpp:453 (most recent call first):
If I am using mutiple gpu with trainer = Trainer(gpus=4,auto_lr_find=True,distributed_backend='dp')
Its give : TypeError: cannot unpack non-iterable NoneType object terminate called after throwing an instance of 'c10::Error' what(): CUDA error: device-side assert triggered
My code : https://colab.research.google.com/drive/1uj3obOeVysbAxpdAuqXWSScx0EOTefDB?usp=sharing
The text was updated successfully, but these errors were encountered:
I am trying to train a model and It works fine with GPU = 1. But if I am using more than 1 GPU, or using num_worker more than 0, I am getting following error :
-- Process 3 terminated with the following error: Traceback (most recent call last): File "/home/pathania/data-private/longformer/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap fn(i, *args) File "/home/pathania/data-private/longformer/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_spawn_accelerator.py", line 152, in ddp_train results = self.train_or_test() File "/home/pathania/data-private/longformer/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 66, in train_or_test results = self.trainer.train() File "/home/pathania/data-private/longformer/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 483, in train self.train_loop.run_training_epoch() File "/home/pathania/data-private/longformer/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 541, in run_training_epoch batch_output = self.run_training_batch(batch, batch_idx, dataloader_idx) File "/home/pathania/data-private/longformer/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 673, in run_training_batch opt_closure_result = self.training_step_and_backward( File "/home/pathania/data-private/longformer/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 769, in training_step_and_backward self.backward(result, optimizer, opt_idx) File "/home/pathania/data-private/longformer/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 783, in backward result.closure_loss = self.trainer.accelerator_backend.backward( File "/home/pathania/data-private/longformer/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 89, in backward closure_loss = self.trainer.precision_connector.backend.backward( File "/home/pathania/data-private/longformer/lib/python3.8/site-packages/pytorch_lightning/plugins/native_amp.py", line 32, in backward model.backward(closure_loss, optimizer, opt_idx) File "/home/pathania/data-private/longformer/lib/python3.8/site-packages/pytorch_lightning/core/lightning.py", line 1105, in backward loss.backward() File "/home/pathania/data-private/longformer/lib/python3.8/site-packages/torch/tensor.py", line 185, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph) File "/home/pathania/data-private/longformer/lib/python3.8/site-packages/torch/autograd/__init__.py", line 125, in backward Variable._execution_engine.run_backward( RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of a module parameter outside the
forwardfunction. Please make sure model parameters are not shared across multiple concurrent forward-backward passes2) Reused parameters in multiple reentrant backward passes. For example, if you use multiple
checkpointfunctions to wrap the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready multiple times. DDP does not support such use cases yet. Exception raised from mark_variable_ready at /pytorch/torch/csrc/distributed/c10d/reducer.cpp:453 (most recent call first):
If I am using mutiple gpu with
trainer = Trainer(gpus=4,auto_lr_find=True,distributed_backend='dp')
Its give :
TypeError: cannot unpack non-iterable NoneType object terminate called after throwing an instance of 'c10::Error' what(): CUDA error: device-side assert triggered
My code : https://colab.research.google.com/drive/1uj3obOeVysbAxpdAuqXWSScx0EOTefDB?usp=sharing
The text was updated successfully, but these errors were encountered: