CUDA OOM when initializing DDP #4705

dedeswim · 2020-11-17T09:43:01Z

🐛 Bug

Hey everyone,

I am trying to train a model on the GPU workstation of our lab (that has 10 GPUs, of which 1 only is usually in use) using Lightning ad DDP. I have tried with several models (including the BoringModel) without success. In particular, I get a CUDA OOM error when DDP initializes. I tried BoringModel with the following Trainer configuration:

trainer = Trainer(
        default_root_dir=os.getcwd(),
        limit_train_batches=1,
        limit_val_batches=1,
        max_epochs=1,
        weights_summary=None,
        gpus=2,
        accelerator="ddp",
        auto_select_gpus=True
)

And the output I get is the following:

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7,8,9]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7,8,9]
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/2
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/2
Traceback (most recent call last):
  File "boring_model.py", line 138, in <module>
    run_test()
  File "boring_model.py", line 133, in run_test
    trainer.fit(model, train_data, val_data)
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 444, in fit
    results = self.accelerator_backend.train()
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 148, in train
    results = self.ddp_train(process_idx=self.task_idx, model=model)
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 238, in ddp_train
    self.init_ddp_connection(
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 215, in init_ddp_connection
    torch_distrib.init_process_group(
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 442, in init_process_group
    barrier()
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1947, in barrier
    work = _default_pg.barrier()
RuntimeError: CUDA error: out of memory
Traceback (most recent call last):
  File "/home/edoardo.debenedetti/projects/gans-mia-unlearning/boring_model.py", line 138, in <module>
    run_test()
  File "/home/edoardo.debenedetti/projects/gans-mia-unlearning/boring_model.py", line 133, in run_test
    trainer.fit(model, train_data, val_data)
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 444, in fit
    results = self.accelerator_backend.train()
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 148, in train
    results = self.ddp_train(process_idx=self.task_idx, model=model)
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 238, in ddp_train
    self.init_ddp_connection(
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 215, in init_ddp_connection
    torch_distrib.init_process_group(
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 442, in init_process_group
    barrier()
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1947, in barrier
    work = _default_pg.barrier()
RuntimeError: Broken pipe

The script with the BoringModel I run on our workstation is in this gist.

However, this doesn't happen on Colab using your BoringModel notebook (my version can be found here).

I also tried to run locally the same notebook as Colab, and the result at the first attempt is the following:

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7,8,9]
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/1
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-11-1f9f6fbe4f6c> in <module>
----> 1 test_x(tmpdir)

<ipython-input-10-d400f0366266> in test_x(tmpdir)
     16 
     17     # Train the model ⚡
---> 18     trainer.fit(model, train, val)
     19 
     20     trainer.test(test_dataloaders=test)

~/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py in fit(self, model, train_dataloader, val_dataloaders, datamodule)
    442         self.call_hook('on_fit_start')
    443 
--> 444         results = self.accelerator_backend.train()
    445         self.accelerator_backend.teardown()
    446 

~/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py in train(self)
    146         model = self.trainer.model
    147 
--> 148         results = self.ddp_train(process_idx=self.task_idx, model=model)
    149         if 'WORLD_SIZE' in os.environ:
    150             del os.environ['WORLD_SIZE']

~/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py in ddp_train(self, process_idx, model)
    236         # where to store ip_table
    237         model.trainer = self.trainer
--> 238         self.init_ddp_connection(
    239             self.trainer.global_rank,
    240             self.trainer.world_size,

~/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py in init_ddp_connection(self, global_rank, world_size, is_slurm_managing_tasks)
    213                 f"initializing ddp: GLOBAL_RANK: {global_rank}, MEMBER: {global_rank + 1}/{world_size}"
    214             )
--> 215             torch_distrib.init_process_group(
    216                 torch_backend, rank=global_rank, world_size=world_size
    217             )

~/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py in init_process_group(backend, init_method, timeout, world_size, rank, store, group_name)
    440     # process groups including global variables are updated correctly on all
    441     # ranks.
--> 442     barrier()
    443 
    444 def _new_process_group_helper(world_size,

~/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py in barrier(group, async_op)
   1945     if group == GroupMember.WORLD:
   1946         _check_default_pg()
-> 1947         work = _default_pg.barrier()
   1948     else:
   1949         work = group.barrier()

RuntimeError: CUDA error: out of memory

At the second attempt, though, it works, as expected (i.e. the model trains with no errors, even with multiple GPUs)! So in the script, I tried to do the following to attempt the fit twice as in the notebook:

try:
	trainer.fit(model, train_data, val_data)
except:
	trainer.fit(model, train_data, val_data)

As a result, I get this stack trace:

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7,8,9]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7,8,9]
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/2
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/2
Traceback (most recent call last):
  File "boring_model.py", line 135, in run_test
    trainer.fit(model, train_data, val_data)
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 444, in fit
    results = self.accelerator_backend.train()
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 148, in train
    results = self.ddp_train(process_idx=self.task_idx, model=model)
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 238, in ddp_train
    self.init_ddp_connection(
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 215, in init_ddp_connection
    torch_distrib.init_process_group(
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 442, in init_process_group
    barrier()
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1947, in barrier
    work = _default_pg.barrier()
RuntimeError: CUDA error: out of memory

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "boring_model.py", line 143, in <module>
    run_test()
  File "boring_model.py", line 137, in run_test
    trainer.fit(model, train_data, val_data)
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 444, in fit
    results = self.accelerator_backend.train()
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 148, in train
    results = self.ddp_train(process_idx=self.task_idx, model=model)
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 275, in ddp_train
    model = self.configure_ddp(model, device_ids)
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 292, in configure_ddp
    model = self.ddp_plugin.configure_ddp(model, device_ids)
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/plugins/ddp_plugin.py", line 59, in configure_ddp
    model = LightningDistributedDataParallel(
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 410, in __init__
    self._sync_params_and_buffers(authoritative_rank=0)
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 417, in _sync_params_and_buffers
    self._distributed_broadcast_coalesced(
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 978, in _distributed_broadcast_coalesced
    dist._broadcast_coalesced(
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1603729009598/work/torch/lib/c10d/ProcessGroupNCCL.cpp:784, invalid usage, NCCL version 2.7.8
Traceback (most recent call last):
  File "/home/edoardo.debenedetti/projects/gans-mia-unlearning/boring_model.py", line 135, in run_test
    trainer.fit(model, train_data, val_data)
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 444, in fit
    results = self.accelerator_backend.train()
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 148, in train
    results = self.ddp_train(process_idx=self.task_idx, model=model)
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 238, in ddp_train
    self.init_ddp_connection(
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 215, in init_ddp_connection
    torch_distrib.init_process_group(
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 442, in init_process_group
    barrier()
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1947, in barrier
    work = _default_pg.barrier()
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1603729009598/work/torch/lib/c10d/ProcessGroupNCCL.cpp:784, invalid usage, NCCL version 2.7.8

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/edoardo.debenedetti/projects/gans-mia-unlearning/boring_model.py", line 143, in <module>
    run_test()
  File "/home/edoardo.debenedetti/projects/gans-mia-unlearning/boring_model.py", line 137, in run_test
    trainer.fit(model, train_data, val_data)
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 444, in fit
    results = self.accelerator_backend.train()
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 148, in train
    results = self.ddp_train(process_idx=self.task_idx, model=model)
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 275, in ddp_train
    model = self.configure_ddp(model, device_ids)
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 292, in configure_ddp
    model = self.ddp_plugin.configure_ddp(model, device_ids)
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/plugins/ddp_plugin.py", line 59, in configure_ddp
    model = LightningDistributedDataParallel(
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 410, in __init__
    self._sync_params_and_buffers(authoritative_rank=0)
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 417, in _sync_params_and_buffers
    self._distributed_broadcast_coalesced(
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 978, in _distributed_broadcast_coalesced
    dist._broadcast_coalesced(
RuntimeError: Broken pipe

Expected behavior

The models should train without issues.

Environment

CUDA:
- GPU:
  - TITAN V
  - TITAN V
  - TITAN V
  - TITAN V
  - TITAN V
  - TITAN V
  - TITAN V
  - TITAN V
  - TITAN V
  - TITAN V
- available: True
- version: 10.1
Packages:
- numpy: 1.19.2
- pyTorch_debug: True
- pyTorch_version: 1.7.0
- pytorch-lightning: 1.0.6
- tqdm: 4.52.0
System:
- OS: Linux
- architecture:
  - 64bit
  - ELF
- processor: x86_64
- python: 3.8.5
- version: Proposal for help #1 SMP Fri Oct 18 17:15:30 UTC 2019

Additional context

I tried installing torch, torchvision and pl with both Conda and PIP with fresh environments, and still no solution to this problem.

This happens also if I select (free) GPUs manually by specifying them in the gpus flag as a List[int]. Also interestingly, if I run this tutorial notebook by PyTorch that uses vanilla PyTorch DDP, I have no issues whatsoever. Final interesting fact, setting accelerator="dp"I have no issues.

Thanks in advance!

The text was updated successfully, but these errors were encountered:

justusschock · 2020-11-17T12:33:05Z

Hi @dedeswim can you try to use ddp_spawn accelerator, while I try to reproduce this?

dedeswim · 2020-11-17T14:15:59Z

Hey @justusschock , thanks for getting back!

When running the script I am getting the following output + exception:

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7,8,9]
Traceback (most recent call last):
  File "boring_model.py", line 143, in <module>
    run_test()
  File "boring_model.py", line 135, in run_test
    trainer.fit(model, train_data, val_data)
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 444, in fit
    results = self.accelerator_backend.train()
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_spawn_accelerator.py", line 74, in train
    mp.spawn(self.ddp_train, nprocs=self.nprocs, args=(self.mp_queue, model,))
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 199, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 148, in start_processes
    process.start()
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/multiprocessing/context.py", line 284, in _Popen
    return Popen(process_obj)
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 47, in _launch
    reduction.dump(process_obj, fp)
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/multiprocessing/reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
AttributeError: Can't pickle local object 'run_test.<locals>.TestModel'

When running in the notebook I get this (also trying a second time as I did with ddp):

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7,8,9]

---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
<ipython-input-12-1f9f6fbe4f6c> in <module>
----> 1 test_x(tmpdir)

<ipython-input-11-7a609093ff03> in test_x(tmpdir)
     16 
     17     # Train the model ⚡
---> 18     trainer.fit(model, train, val)
     19 
     20     trainer.test(test_dataloaders=test)

~/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py in fit(self, model, train_dataloader, val_dataloaders, datamodule)
    442         self.call_hook('on_fit_start')
    443 
--> 444         results = self.accelerator_backend.train()
    445         self.accelerator_backend.teardown()
    446 

~/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_spawn_accelerator.py in train(self)
     72 
     73         # train in children process
---> 74         mp.spawn(self.ddp_train, nprocs=self.nprocs, args=(self.mp_queue, model,))
     75 
     76         # restore main state with best weights

~/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/multiprocessing/spawn.py in spawn(fn, args, nprocs, join, daemon, start_method)
    197                ' torch.multiprocessing.start_process(...)' % start_method)
    198         warnings.warn(msg)
--> 199     return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')

~/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/multiprocessing/spawn.py in start_processes(fn, args, nprocs, join, daemon, start_method)
    155 
    156     # Loop on join until it returns True or raises an exception.
--> 157     while not context.join():
    158         pass
    159 

~/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/multiprocessing/spawn.py in join(self, timeout)
    108                 )
    109             else:
--> 110                 raise Exception(
    111                     "process %d terminated with exit code %d" %
    112                     (error_index, exitcode)

Exception: process 0 terminated with exit code 1

justusschock · 2020-11-17T16:13:09Z

@dedeswim the pickling issue in ddp-spawn is because the class TestModel is defined within a function which makes it local and this cannot be pickled. Can you try again defining it outside the function?

dedeswim · 2020-11-17T16:20:49Z

Oh sorry, didn't think about that. I changed the model to BoringModel directly, and got this output + exception:

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7,8,9]
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/2
/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/utilities/distributed.py:45: UserWarning: MASTER_ADDR environment variable is not defined. Set as localhost
  warnings.warn(*args, **kwargs)
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/2
Traceback (most recent call last):
  File "boring_model.py", line 143, in <module>
    run_test()
  File "boring_model.py", line 135, in run_test
    trainer.fit(model, train_data, val_data)
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 444, in fit
    results = self.accelerator_backend.train()
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_spawn_accelerator.py", line 74, in train
    mp.spawn(self.ddp_train, nprocs=self.nprocs, args=(self.mp_queue, model,))
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 199, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 157, in start_processes
    while not context.join():
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 118, in join
    raise Exception(msg)
Exception:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
    fn(i, *args)
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_spawn_accelerator.py", line 115, in ddp_train
    self.init_ddp_connection(
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 215, in init_ddp_connection
    torch_distrib.init_process_group(
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 442, in init_process_group
    barrier()
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1947, in barrier
    work = _default_pg.barrier()
RuntimeError: CUDA error: out of memory

justusschock · 2020-11-18T13:36:04Z

Hi @dedeswim

I tried to reproduce your issue, and I used your script, which I only modified to enable pickling like discussed above.

I don't have access to the exact GPUs you have, but on my machine with 2 2080Ti, the script ran without any problem with both accelerators.

Here is the example output from my workstation:

>>> python test_script.py

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
Missing logger folder: /home/staff/schock/lightning_logs
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/2
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1]
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/2
UserWarning: The dataloader, val dataloader 0, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 16 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
  warnings.warn(*args, **kwargs)
UserWarning: The dataloader, train dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 16 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
  warnings.warn(*args, **kwargs)
override
Epoch 0:   0%|                                            | 0/2 [00:00<?, ?it/s]override
Epoch 0: 100%|███████████████| 2/2 [00:00<00:00, 15.18it/s, loss=1.216, v_num=0]
UserWarning: The dataloader, test dataloader 0, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 16 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
  warnings.warn(*args, **kwargs)
Testing: 0it [00:00, ?it/s]--------------------------------------------------------------------------------
Testing: 100%|████████████████████████████████| 32/32 [00:00<00:00, 2025.99it/s]

Can you maybe try to reboot your machine? could this also be related to anything else? Unfortunately I don't know how to debug this.

dedeswim · 2020-11-18T15:09:56Z

Hi @justusschock, thanks for trying to reproduce the issue!

Unfortunately, I have no direct control of the cluster, but I could try to ask the admin to reboot it. However, there is a long-running job which I fear I can not interrupt.

Meanwhile, I tried to use pytorch_lightning==0.7.6, with a slightly modified script (as the APIs changed since then) which you can find in this gist. It looks like it works, with the following output:

GPU available: True, used: True
No environment variable for node rank defined. Set as 0.
CUDA_VISIBLE_DEVICES: [8,9]
MASTER_ADDR environment variable is not defined. Set as localhost
initializing proc_rank 0 world 2
MASTER_ADDR environment variable is not defined. Set as localhost
initializing proc_rank 1 world 2
Epoch 1: 100%|███████████████████████████████████████████████████████████████████████████| 64/64 [00:08<00:00,  7.58it/s, loss=1.080, v_num=20/home/edoardo.debenedetti/.pyenv/versions/pyenv-pytorch/lib/python3.8/site-packages/pytorch_lightning/utilities/distributed.py:23: RuntimeWarning: The metric you returned None must be a Torch.Tensor instance, checkpoint not saved HINT: what is the value of val_loss in validation_end()?
  warnings.warn(*args, **kwargs)
/home/edoardo.debenedetti/.pyenv/versions/pyenv-pytorch/lib/python3.8/site-packages/pytorch_lightning/utilities/distributed.py:23: RuntimeWarning: Can save best model only with val_loss available, skipping.
  warnings.warn(*args, **kwargs)
Epoch 1: 100%|███████████████████████████████████████████████████████████████████████████| 64/64 [00:08<00:00,  7.58it/s, loss=1.080, v_num=20]

The first thing I noticed in the outputs (also as a difference from your output), is the line with CUDA_VISIBLE_DEVICES. As a matter of fact, setting gpus=[8, 9]:

On my machine, using 0.7.6, I have: CUDA_VISIBLE_DEVICES: [8,9] and it works.
On my machine, using 0.8.1, I have: CUDA_VISIBLE_DEVICES: [8,9] and a CUDA OOM.
On my machine, using 0.10.0, I have: CUDA_VISIBLE_DEVICES: [8,9] and a CUDA OOM.
On my machine, using 1.0.6, I have: CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7,8,9] and a CUDA OOM.
On your machine, you have CUDA_VISIBLE_DEVICES: [0,1] (however you have 2 GPUs, so nothing's weird).

And, as a matter of fact, GPU 0 is busy and being used (the memory being used is 11528MiB / 12036MiB). So, one guess could be that something weird is happening when selecting the GPUs, and there is no real "selection". Thus, PL tries to load the model on GPU 0, which is full and indeed it raises a CUDA OOM.

However, this issue happens also with 0.8.1 <= PL <= 0.10.0 where the printed CUDA_VISIBLE_DEVICES shows just the selected devices.

The only idea I have for debugging is trying to see if there was some major change in the GPU management between 0.7.x and 0.8.x

I hope this information can be helpful for debugging! Of course, let me know If I can do anything else.

justusschock · 2020-11-19T11:37:21Z

@dedeswim
I think when you first want to move anything to a cuda device, torch tries to create some CUDA context on the root gpu.

Usually this should be fine with CUDA_VISIBLE_DEVICES.

For debugging purposes: can you do the following:
At the begging of your script (before even importing anything else), can you execute the following lines and give me their output:

import os
print(os.environ.get('CUDA_VISIBLE_DEVICES', []))
import torch
print(torch.cuda.get_device())

That way I want to find out, if the devices are set by us (which seems to be okay) or if they are already preset.

If at that time CUDA_VISIBLE_DEVICES still includes 0, torch may use this as it's root gpu to create the cuda context on. In that case it could help to either set it manually before starting your script or to use

import torch
torch.cuda.set_device(8)

before running any other code (even imports).

Let me know if that helps.

dedeswim · 2020-11-19T14:28:50Z

Hi @justusschock, thanks for your answer.

I tried putting, at the very beginning of the script, this:

import os
print(os.environ.get('CUDA_VISIBLE_DEVICES', []))
import torch
print(torch.cuda.device_count())
print(torch.cuda.get_device_name())

The output I get is

[]
10
TITAN V
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7,8,9]
[]
10
TITAN V
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7,8,9]
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/2
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/2

Plus the usual exception.

So I guess, as you say, that the devices are set by you.

Unfortunately, even using torch.cuda.set_device(8), I still get the same exception.

However, if I launch the script as

CUDA_VISIBLE_DEVICES=1,2,3,4,5,6,7,8,9 python boring_model.py

And it works, with the following output:

➜ python boring_model.py
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [1,2,3,4,5,6,7,8,9]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [1,2,3,4,5,6,7,8,9]
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/2
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/2
Validation sanity check: 100%|███████████████████████████████████████████████████████████████████████████████████| 1/1 [00:02<00:00,  2.30s/it]/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/utilities/distributed.py:45: UserWarning: The validation_epoch_end should not return anything as of 9.1.to log, use self.log(...) or self.write(...) directly in the LightningModule
  warnings.warn(*args, **kwargs)
Epoch 0: 100%|█████████████████████████████████████████████████████████████████████████████| 2/2 [00:04<00:00,  2.15s/it, loss=0.817, v_num=24]

(FYI I had to change the GPUs to use from [8, 9] to [7, 8], because ofc the visible GPUs are 9 now)

I guess as a temporary workaround I can export CUDA_VISIBLE_DEVICES=1,2,3,4,5,6,7,8,9 in my .zshrc.

Is there anything else I can do to help to debug?

justusschock · 2020-11-19T14:56:42Z

This seems to be related to #958 .

can you also try to run the script with CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7,8,9 , gpu=[7,8] and ddp_spawn?

dedeswim · 2020-11-19T15:51:16Z

With CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7,8,9 I get the usual OOM exception.

With CUDA_VISIBLE_DEVICES=1,2,3,4,5,6,7,8,9, instead, it works as with ddp.

justusschock · 2020-12-01T09:20:39Z

@dedeswim this should be fixed by @awaelchli in #4297 . Mind trying on master?

dedeswim · 2020-12-01T09:55:10Z

Tried with the version installed in a fresh environment via

pip install git+https://github.com/PytorchLightning/pytorch-lightning.git@master --upgrade

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7,8,9 python boring_model.py is still not working, while CUDA_VISIBLE_DEVICES=1,2,3,4,5,6,7,8,9 python boring_model.py does. This still happens both with ddp and ddp_spawn.

awaelchli · 2020-12-03T21:30:15Z

ok, it seems this has nothing todo with the selection of the backend.
Running the examples on CPU (i.e. Trainer(gpus=None)) allocates 600MB of memory in my GPU and setting CUDA_VISIBLE_DEVICES="" prevents this.

If I take however a pure pytorch script like this
https://github.com/pytorch/examples/blob/master/mnist/main.py
and run it on CPU, it does not allocate any extra memory on the GPUs so there is clearly something in Lightning that communicates with CUDA even in CPU mode.

That's all I have in terms of analysis for now. Don't yet have a clue where to look...

justusschock · 2020-12-07T09:09:29Z

@awaelchli ~~I think, that might be because we check stuff with calling torch.cuda.xx (like torch.cuda.device_count) and maybe the first call to torch.cuda is creating cuda context?~~

@awaelchli I tried with 1.1.0rc1 and the following script:

import torch
import pytorch_lightning as pl

class RandomDataset(torch.utils.data.Dataset):
    def __init__(self, size, length):
        self.len = length
        self.data = torch.randn(length, size)

    def __getitem__(self, index):
        return self.data[index]

    def __len__(self):
        return self.len


class BoringModel(pl.LightningModule):

    def __init__(self):
        """
        Testing PL Module

        Use as follows:
        - subclass
        - modify the behavior for what you want

        class TestModel(BaseTestModel):
            def training_step(...):
                # do your own thing

        or:

        model = BaseTestModel()
        model.training_epoch_end = None

        """
        super().__init__()
        self.layer = torch.nn.Linear(32, 2)

    def forward(self, x):
        return self.layer(x)

    def loss(self, batch, prediction):
        # An arbitrary loss to have a loss that updates the model weights during `Trainer.fit` calls
        return torch.nn.functional.mse_loss(prediction, torch.ones_like(prediction))

    def step(self, x):
        x = self(x)
        out = torch.nn.functional.mse_loss(x, torch.ones_like(x))
        return out

    def training_step(self, batch, batch_idx):
        output = self.layer(batch)
        loss = self.loss(batch, output)
        return {"loss": loss}

    def training_step_end(self, training_step_outputs):
        return training_step_outputs

    def training_epoch_end(self, outputs) -> None:
        torch.stack([x["loss"] for x in outputs]).mean()

    def validation_step(self, batch, batch_idx):
        output = self.layer(batch)
        loss = self.loss(batch, output)
        return {"x": loss}

    def validation_epoch_end(self, outputs) -> None:
        torch.stack([x['x'] for x in outputs]).mean()

    def test_step(self, batch, batch_idx):
        output = self.layer(batch)
        loss = self.loss(batch, output)
        return {"y": loss}

    def test_epoch_end(self, outputs) -> None:
        torch.stack([x["y"] for x in outputs]).mean()

    def configure_optimizers(self):
        optimizer = torch.optim.SGD(self.layer.parameters(), lr=0.1)
        lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=1)
        return [optimizer], [lr_scheduler]

    def train_dataloader(self):
        return torch.utils.data.DataLoader(RandomDataset(32, 64))

    def val_dataloader(self):
        return torch.utils.data.DataLoader(RandomDataset(32, 64))

    def test_dataloader(self):
        return torch.utils.data.DataLoader(RandomDataset(32, 64))

if __name__ == '__main__':
    pl.Trainer(gpus=None, max_epochs=20).fit(BoringModel(), torch.utils.data.DataLoader(RandomDataset(32, 500)))

And I could not observe any memory leaking on gpu when training without.

dedeswim · 2020-12-07T11:15:42Z

So, on my side, I did a couple of trials.

~~It looks like your script (@justusschock) is working also if I include 0 into CUDA_VISIBLE_DEVICES (so no CUDA OOM).~~ For some reason it is not working anymore, I have OOM again.

However, if I set the random dataset dimensions to 64 * 64 * 3 (the same as CelebA, the dataset I am working with), I do have the OOM. This happens if I set the batch size to 1 and to 128. See this gist for reference.

Interestingly, though, if I set gpus=None (and of course I don't specify accelerator), I get no CUDA OOM.

Edit: if I train using on GPU 1 only (so without DDP) and with CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7,8,9, I can observe on nvidia-smi that the process that runs on GPU 1 is using ~500MB on GPU 0. It's not BoringModel, but DCGAN being trained on CelebA. In the afternoon I will try to reproduce using BoringModel and RandomDataset.

justusschock · 2020-12-07T12:09:17Z

@dedeswim That is probably torch.device not set correctly (early enough) thus torch creating a cuda context on the default device (which is 0 by default). If you can provide a reproduction script, I can then try to find out, where this happens. However, with DDP I still have no clue.

dedeswim · 2020-12-07T12:11:35Z

I forgot to mention that it also takes memory on GPU 1, so it is taking memory on both GPUs (around 10 GiB on GPU1 and 500 MiB on GPU0). In the afternoon (CET timezone) I'll do my best to create a reproducible script.

aleSuglia · 2021-04-05T07:47:41Z

Are there any updates on this? I believe this is somehow related with the issue here: https://discuss.pytorch.org/t/strange-number-of-processes-per-gpu/116927

Looks to me that there are N additional CUDA context processes for every GPU is used where N is the number of GPU used. Is there any solution to this?

EDIT: when starting with "ddp_spawn" I can see a single context process on GPU but the training is basically stuck at the very beginning. I guess that's due to the fact I'm using num_workers > 0.

awaelchli · 2021-04-07T16:06:40Z

@aleSuglia so far we are clueless about it. I'll try to pick this up again.
You don't have, by any chance, also a script that I can run to reproduce?
And if you don't use the lastest Lightning and PyTorch version, please let me know so that I can install the appropriate version. That would help me a lot :)

EDIT: when starting with "ddp_spawn" I can see a single context process on GPU but the training is basically stuck at the very beginning. I guess that's due to the fact I'm using num_workers > 0.

It is possible that it gets stuck for a while because of all the num_workers, this is normal for ddp_spawn. Alternative is just regular ddp, maybe try that too.

aleSuglia · 2021-04-08T00:39:48Z

Does it have anything to do with the fact that I'm using a custom function for splitting batches for truncated backprop?

    def tbptt_split_batch(self, batch: Dict[str, Tensor], split_size: int, split_dim=1,
                          ignore_keys=("metadata", "instructions")) -> list:
        splittable_keys = [k for k in batch.keys() if k not in ignore_keys]
        key = splittable_keys[0]
        batch_size = batch[key].shape[0]
        total_axis_size = batch[key].shape[split_dim]
        num_splits = total_axis_size // split_size + (total_axis_size % split_size != 0)
        splits = [{} for _ in range(num_splits)]

        for k in splittable_keys:
            v = batch[k]
            # these are safe keys we add first
            splitted_batch = torch.split(v, split_size_or_sections=split_size, dim=split_dim)

            if k == "interactive_actions_mask":
                base_idx = [0 for _ in range(batch_size)]
                for i, split in enumerate(splitted_batch):
                    splits[i][k] = split.contiguous()

                    mask_positions = torch.nonzero(split).cpu().numpy()
                    masks = []

                    for x, y in zip(mask_positions[:, 0], mask_positions[:, 1]):
                        curr_mask = batch["metadata"][x]["manipulation_masks"][base_idx[x]].unsqueeze(0)
                        masks.append(curr_mask)
                        base_idx[x] += 1

                    masks = torch.cat(masks, 0) if masks else None
                    splits[i]["manipulation_masks"] = masks
            else:
                for i, split in enumerate(splitted_batch):
                    splits[i][k] = split.contiguous()

        for i in range(num_splits):
            for k in ignore_keys:
                splits[i][k] = batch[k]
        return splits

awaelchli · 2021-04-08T17:13:46Z

no it should not be, unless you have a memory leak in this logic. But you don't observe memory growing, you get OOM at initialization.

Looks to me that there are N additional CUDA context processes for every GPU is used where N is the number of GPU used. Is there any solution to this?

that's normal, because in ddp each process sees just one gpu. in nvidia-smi you should see N processes and a few MB at the very beginning per gpu before training starts.

aleSuglia · 2021-04-08T19:25:22Z

no it should not be, unless you have a memory leak in this logic. But you don't observe memory growing, you get OOM at initialization.

How would I know this?

Looks to me that there are N additional CUDA context processes for every GPU is used where N is the number of GPU used. Is there any solution to this?

that's normal, because in ddp each process sees just one gpu. in nvidia-smi you should see N processes and a few MB at the very beginning per gpu before training starts.

Shouldn't I have a single CUDA process per GPU instead of 4 per GPU? I have 4 GPU and there seems to be 4 CUDA processes for each of them!?

awaelchli · 2021-04-08T20:00:56Z

Yes, you should have 4 processes not 16 :(
Before calling trainer.fit, could you print out os.environ and report here?
And can you run our bug report model with ddp or does that also cause the same problems?

aleSuglia · 2021-04-08T22:37:16Z

@awaelchli this is the output of os.environ:

environ({'CONDA_SHLVL': '1', 'LS_COLORS': 'rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=00:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:
*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*.z=01;31:*.Z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.zst=01;31:*.tzst=01;31:*.bz2=01;31:*.bz=0
1;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.wim=01;31:*.swm=01;31:*.dwm=01;31:*.esd=01
;31:*.jpg=01;35:*.jpeg=01;35:*.mjpg=01;35:*.mjpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.
mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*
.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.m4a=00;36:*.mid=00;36:*.midi=00;36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36:*.oga=00;36:*.
opus=00;36:*.spx=00;36:*.xspf=00;36:', 'LD_LIBRARY_PATH': '/usr/local/cuda/lib:/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64:/opt/amazon/efa/lib:/opt/amazon/openmpi/lib:/usr/local/lib:/usr/lib', 'CONDA_EXE': '/home/ubuntu/anaconda3/bin/conda
',  'LESSCLOSE': '/usr/bin/lesspipe %s %s', 'LANG': 'C.UTF-8', 'CONDA_PREFIX': '/home/ubuntu/anaconda3/envs/embai', '_CE_M': '', 'XDG_SESSION_ID': '3', 'MODULES_CMD': '/usr/lib/x86_64-linux-gnu/m
odulecmd.tcl', 'USER': 'ubuntu', 'ENV': '/usr/share/modules/init/profile.sh', 'PWD': '/home/ubuntu/workspace/EmbodiedAI-Alfred', 'HOME': '/home/ubuntu', 'CONDA_PYTHON_EXE': '/home/ubuntu/anaconda3/bin/python', 'LC_TERMINAL': 'iTerm2', 'SSH_CLIENT': '20
5.251.233.179 22408 22', 'TMUX': '/tmp/tmux-1000/default,3481,0', 'LC_TERMINAL_VERSION': '3.4.4', 'BASH_ENV': '/usr/share/modules/init/bash', 'XDG_DATA_DIRS': '/usr/local/share:/usr/share:/var/lib/snapd/desktop', '_CE_CONDA': '', 'LOADEDMODULES': '', '
CONDA_PROMPT_MODIFIER': '(embai) ', 'SSH_TTY': '/dev/pts/0', 'MAIL': '/var/mail/ubuntu', 'TERM': 'screen', 'SHELL': '/bin/bash', 'TMUX_PANE': '%0', 'SHLVL': '2', 'MANPATH': '/usr/share/man:/opt/aws/neuron/share/man', 'MODULEPATH': '/etc/environment-mod
ules/modules:/usr/share/modules/versions:/usr/share/modules/$MODULE_VERSION/modulefiles:/usr/share/modules/modulefiles', 'LOGNAME': 'ubuntu', 'XDG_RUNTIME_DIR': '/run/user/1000', 'MODULEPATH_modshare': '/usr/share/modules/$MODULE_VERSION/modulefiles:1:
/etc/environment-modules/modules:1:/usr/share/modules/modulefiles:1:/usr/share/modules/versions:1', 'PATH': '/home/ubuntu/anaconda3/envs/embai/bin:/opt/amazon/openmpi/bin:/opt/amazon/efa/bin:/home/ubuntu/anaconda3/condabin:/opt/amazon/openmpi/bin:/opt/
amazon/efa/bin:/home/ubuntu/.dl_binaries/bin:/usr/local/cuda/bin:/opt/aws/neuron/bin:/home/ubuntu/anaconda3/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin', 'MODULESHOME': '/usr/share/modules', 'C
ONDA_DEFAULT_ENV': 'embai', 'PKG_CONFIG_PATH': '/usr/local/lib/pkgconfig', 'LESSOPEN': '| /usr/bin/lesspipe %s', 'BASH_FUNC_module%%': '() {  _moduleraw "$*" 2>&1\n}', 'BASH_FUNC_switchml%%': '() {  typeset swfound=1;\n if [ "${MODULES_USE_COMPAT_VERSI
ON:-0}" = \'1\' ]; then\n typeset swname=\'main\';\n if [ -e /usr/lib/x86_64-linux-gnu/modulecmd.tcl ]; then\n typeset swfound=0;\n unset MODULES_USE_COMPAT_VERSION;\n fi;\n else\n typeset swname=\'compatibility\';\n if [ -e /usr/lib/x86_64-linux-gnu/m
odulecmd-compat ]; then\n typeset swfound=0;\n MODULES_USE_COMPAT_VERSION=1;\n export MODULES_USE_COMPAT_VERSION;\n fi;\n fi;\n if [ $swfound -eq 0 ]; then\n echo "Switching to Modules $swname version";\n source /usr/share/modules/init/bash;\n else\n e
cho "Cannot switch to Modules $swname version, command not found";\n return 1;\n fi\n}', 'BASH_FUNC__moduleraw%%': '() {  unset _mlre _mlIFS _mlshdbg;\n if [ "${MODULES_SILENT_SHELL_DEBUG:-0}" = \'1\' ]; then\n case "$-" in \n *v*x*)\n set +vx;\n _mlsh
dbg=\'vx\'\n ;;\n *v*)\n set +v;\n _mlshdbg=\'v\'\n ;;\n *x*)\n set +x;\n _mlshdbg=\'x\'\n ;;\n *)\n _mlshdbg=\'\'\n ;;\n esac;\n fi;\n if [ -n "${IFS+x}" ]; then\n _mlIFS=$IFS;\n fi;\n IFS=\' \';\n for _mlv in ${MODULES_RUN_QUARANTINE:-};\n do\n if [
"${_mlv}" = "${_mlv##*[!A-Za-z0-9_]}" -a "${_mlv}" = "${_mlv#[0-9]}" ]; then\n if [ -n "`eval \'echo ${\'$_mlv\'+x}\'`" ]; then\n _mlre="${_mlre:-}${_mlv}_modquar=\'`eval \'echo ${\'$_mlv\'}\'`\' ";\n fi;\n _mlrv="MODULES_RUNENV_${_mlv}";\n _mlre="${_m
lre:-}${_mlv}=\'`eval \'echo ${\'$_mlrv\':-}\'`\' ";\n fi;\n done;\n if [ -n "${_mlre:-}" ]; then\n _mlre="eval ${_mlre}";\n fi;\n eval `${_mlre:-}/usr/bin/tclsh /usr/lib/x86_64-linux-gnu/modulecmd.tcl bash $*`;\n _mlstatus=$?;\n if [ -n "${_mlIFS+x}"
]; then\n IFS=$_mlIFS;\n else\n unset IFS;\n fi;\n if [ -n "${_mlshdbg:-}" ]; then\n set -$_mlshdbg;\n fi;\n unset _mlre _mlv _mlrv _mlIFS _mlshdbg;\n return $_mlstatus\n}', 'OLDPWD': '/home/ubuntu', '_': '/home/ubuntu/anaconda3/envs/embai/bin/allennlp
', 'KMP_DUPLICATE_LIB_OK': 'True', 'KMP_INIT_AT_FORK': 'FALSE', 'CRC32C_SW_MODE': 'auto', 'TOKENIZERS_PARALLELISM': 'false', 'QT_QPA_PLATFORM_PLUGIN_PATH': '/home/ubuntu/anaconda3/envs/embai/lib/python3.8/site-packages/cv2/qt/plugins', 'QT_QPA_FONTDIR'
: '/home/ubuntu/anaconda3/envs/embai/lib/python3.8/site-packages/cv2/qt/fonts'})

would it be because I'm using 4 workers for my data loader?

justusschock · 2021-04-09T07:05:09Z

@aleSuglia No, these shouldn't be listed there since they only operate on CPU and don't require GPU memory/access.

This has to be something from where we spawn processes.
Can you give it a try with DDP instead of DDP spawn?

aleSuglia · 2021-04-09T08:29:44Z

@justusschock well actually this happens with ddp.

awaelchli · 2021-04-09T08:39:38Z

Your environment is free of any ddp related env variables, that's good.
Do you launch the processes with just a regular python call python train.py ...?
Do your dataloader workers perform cuda operations (this should be avoided)? Please check that your dataloading is entirely on cpu.

justusschock · 2021-04-09T09:43:07Z

Also It would be really helpful for us to have a complete minimal example to reproduce this along with an output of nvidia-smi to see the processes.

aleSuglia · 2021-04-09T14:07:32Z

Do you launch the processes with just a regular python call python train.py ...?

Yes that's correct.

Do your dataloader workers perform cuda operations (this should be avoided)? Please check that your dataloading is entirely on cpu.

I'm using the default PyTorch data loader. The only thing that might be using CUDA are some operations like torch.nonzero and torch.cat that are in the collate_fn. Also, when I generate an example, I call torch.load() to load some features.

justusschock · 2021-04-11T12:01:02Z

But do you load them on CPU and all your tensors are on CPU as well? Then there should be no CUDA ops involved

aleSuglia · 2021-04-15T13:06:09Z

Sorry I missed this. Yes, everything is on CPU for sure. I'll try with the BoringModel as well.

In the meantime, I can confirm that ddp has mysterious processes running while ddp_spawn doesn't (it only has a single extract process for the CUDA context).

EDIT: I can't reproduce it with the BoringModel. I'm trying to understand whether it depends on my custom Dataset...

aleSuglia · 2021-05-09T09:44:25Z

@justusschock so I think the reason why it works with ddp_spawn and not with the ddp version is because of the setup_environment method that starts my original script multiple times therefore reinitialising all the CUDA context multiple times. It looks to me that this isn't done in the ddp_spawn version. Any ideas about how to fix this?

edenlightning · 2021-05-24T13:33:50Z

@justusschock ping :)

scarecrow1123 · 2021-05-24T13:44:15Z

I'm able to reproduce the same issue with lightning == 1.3.2 with bug_report_model.py modified to use gpus=2 and accelerator=ddp. When I remove GPU 0 from CUDA_VISIBLE_DEVICES, the error goes away.

Edit: This happens only in versions >= 1.3.0 . The script works fine in 1.2.9

justusschock · 2021-05-26T07:27:32Z

@aleSuglia This shouldn't be the case. In the setup_environment we just call the script multiple times (as you mentioned) and pass in the parsed devices from before (which are the same as the devices in ddp_spawn).

@aleSuglia @scarecrow1123
Unfortunately I don't have access to a multi-gpu workstation right now which means I cannot really debug this.

Could you try to add

print(os.environ.get("PL_TRAINER_GPUS", None))

to the beginning of your script?

This should is the environment variable we set when calling the script multiple times. Saying this, the variable should be empty the first time the script is originally called but should be filled afterwards.

scarecrow1123 · 2021-05-29T16:05:02Z

This error disappeared after I rebooted the machine. However, till I rebooted, this kept happening even when nvidia-smi showed 0 memory usage for all the GPUs. Will get back if this occurs again. Thanks for the help.

awaelchli · 2021-06-28T16:20:54Z

Hi
Not what OP reported, but in case you land here because you had OOM with PL 1.3.x, it may be because you were running a script on gpu index > 0 (not the first gpu) and PL would still allocate some memory on GPU 0. If you had another experiment running on GPU 0 it could throw OOM. The issue was fixed in #8165.

justusschock · 2021-06-29T08:23:04Z

Closing this for now.

shyTNT · 2021-11-04T08:53:57Z

same issue like this:
With CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 I get the usual OOM exception.
With CUDA_VISIBLE_DEVICES=1,2,3,4,5,6,7, instead, it works as with ddp.
how to solve it?

collinmccarthy · 2022-03-22T18:33:12Z

For future reference, I was able to solve this by rebooting the machine that was having the issue. This may not be the root of the problem, but it could be a solution for some particular cases.

dw-dengwei · 2022-04-15T02:20:08Z

sudo fuser /dev/nvidia*

and kill the processes listed above via pid.

In my computer, the process is nvidia-smi. I don't know why the issue happen. But I really solve it in my compter.

Hope it is helpful.

MattYoon · 2022-09-03T17:02:45Z

In my case there was a self.model.to('cuda') in my model for some reason.
It seems like Lightning creates N * 2 processes for ddp, in which half are used by CUDA, and half are not. However, my code had that terrible self.model.to('cuda').
So GPU0 got N+1 processes and the rest of the GPUs got 1 process, total N * 2 processes on GPUs.

dedeswim added bug Something isn't working help wanted Open to be worked on labels Nov 17, 2020

justusschock self-assigned this Nov 17, 2020

Borda added distributed Generic distributed-related topic priority: 1 Medium priority task labels Nov 17, 2020

edenlightning added waiting on author Waiting on user action, correction, or update bug Something isn't working and removed bug Something isn't working labels Nov 17, 2020

edenlightning removed the waiting on author Waiting on user action, correction, or update label Nov 17, 2020

edenlightning assigned awaelchli Nov 23, 2020

awaelchli mentioned this issue Dec 7, 2020

DDP not moving batch to device? #4987

Closed

mpaepper mentioned this issue Dec 9, 2020

ddp opens processes on GPUs (and consumes memory) not assigned to the Trainer #5041

Closed

edenlightning removed the priority: 1 Medium priority task label Dec 11, 2020

stale bot removed the won't fix This will not be worked on label Apr 5, 2021

edenlightning removed the waiting on author Waiting on user action, correction, or update label Apr 7, 2021

edenlightning removed the priority: 0 High priority task label Apr 27, 2021

awaelchli mentioned this issue Jun 28, 2021

fix NCCL error with non-consecutive trainer gpus #8165

Merged

11 tasks

justusschock closed this as completed Jun 29, 2021

CUDA OOM when initializing DDP #4705

CUDA OOM when initializing DDP #4705

Comments

dedeswim commented Nov 17, 2020 • edited Loading

🐛 Bug

Expected behavior

Environment

Additional context

justusschock commented Nov 17, 2020

dedeswim commented Nov 17, 2020 • edited Loading

justusschock commented Nov 17, 2020

dedeswim commented Nov 17, 2020

justusschock commented Nov 18, 2020

dedeswim commented Nov 18, 2020

justusschock commented Nov 19, 2020

dedeswim commented Nov 19, 2020

justusschock commented Nov 19, 2020

dedeswim commented Nov 19, 2020

justusschock commented Dec 1, 2020

dedeswim commented Dec 1, 2020

awaelchli commented Dec 3, 2020 • edited Loading

justusschock commented Dec 7, 2020 • edited Loading

dedeswim commented Dec 7, 2020 • edited Loading

justusschock commented Dec 7, 2020

dedeswim commented Dec 7, 2020

aleSuglia commented Apr 5, 2021 • edited Loading

awaelchli commented Apr 7, 2021

aleSuglia commented Apr 8, 2021

awaelchli commented Apr 8, 2021 • edited Loading

aleSuglia commented Apr 8, 2021

awaelchli commented Apr 8, 2021 • edited Loading

aleSuglia commented Apr 8, 2021 • edited Loading

justusschock commented Apr 9, 2021

aleSuglia commented Apr 9, 2021

awaelchli commented Apr 9, 2021 • edited Loading

justusschock commented Apr 9, 2021

aleSuglia commented Apr 9, 2021 • edited Loading

justusschock commented Apr 11, 2021

aleSuglia commented Apr 15, 2021 • edited Loading

aleSuglia commented May 9, 2021

edenlightning commented May 24, 2021

scarecrow1123 commented May 24, 2021 • edited Loading

justusschock commented May 26, 2021

scarecrow1123 commented May 29, 2021

awaelchli commented Jun 28, 2021

justusschock commented Jun 29, 2021

shyTNT commented Nov 4, 2021

collinmccarthy commented Mar 22, 2022 • edited Loading

dw-dengwei commented Apr 15, 2022

MattYoon commented Sep 3, 2022

dedeswim commented Nov 17, 2020 •

edited

Loading

dedeswim commented Nov 17, 2020 •

edited

Loading

awaelchli commented Dec 3, 2020 •

edited

Loading

justusschock commented Dec 7, 2020 •

edited

Loading

dedeswim commented Dec 7, 2020 •

edited

Loading

aleSuglia commented Apr 5, 2021 •

edited

Loading

awaelchli commented Apr 8, 2021 •

edited

Loading

awaelchli commented Apr 8, 2021 •

edited

Loading

aleSuglia commented Apr 8, 2021 •

edited

Loading

awaelchli commented Apr 9, 2021 •

edited

Loading

aleSuglia commented Apr 9, 2021 •

edited

Loading

aleSuglia commented Apr 15, 2021 •

edited

Loading

scarecrow1123 commented May 24, 2021 •

edited

Loading

collinmccarthy commented Mar 22, 2022 •

edited

Loading