DistributedDataParallel with nccl backend produces zombie processes #2913

leonvarga · 2020-08-11T09:50:02Z

Hey lightning community,

first I want to thank you for this nice project. It helped me a lot to improve my research code and I'm happy to recommend it to my colleagues whenever they are complaining about their code mess.

I have a problem with some kind of racing condition which is reproducible in a slurm environment with multiple GPUs. I used DistributedDataParallel with the 'nccl'-backend.
The default implementation of PyTorch-lightning can produce zombie processes, which reserve GPU memory and prevent further usage of the memory. It happens mainly if the main process is stopped or crashes.

You can reproduce this with the given code. If the training starts and the main process is killed with a strict signal like SIGKILL, the child processes stay persistent in most cases.

I could solve the problem for me by overwriting init_ddp_connection. The overwritten method sets the NCCL_BLOCKING_WAIT=1 and reduced the timeout additionally. (The torch documentation mentions that the timeout is only used for nccl if NCCL_BLOCKING_WAIT is 1)

Is there a better way to get rid of this problem?
Or should we adjust the default behavior of lightning for the nccl backend?

Best regrads
Leon Varga

Code

import pytorch_lightning as lightning
import torch
from torch.utils.data import DataLoader


class MyModule(lightning.LightningModule):
    def __init__(self):
        super(MyModule, self).__init__()

        self.model = torch.nn.Linear(1000, 1000)
        self.criterion = torch.nn.MSELoss()

    def train_dataloader(self) -> DataLoader:
        data = torch.randn((int(1e5), 2, 1000))

        training_generator = DataLoader(data)
        return training_generator

    def configure_optimizers(self):
        optimizer = torch.optim.Adam(self.model.parameters())

        return [optimizer]

    def training_step(self, batch, batch_idx):
        pred = self.model(batch[:, 0])
        loss = self.criterion(pred, batch[:, 1])

        return {
            'loss': loss,
        }

    def forward(self, x):
        return self.model(x)



if __name__ == "__main__":
    model = MyModule()
    trainer = lightning.Trainer(default_root_dir='/tmp/test',
                                max_epochs=100, gpus=-1,
                                distributed_backend='ddp')
    trainer.fit(model)

What's your environment?

Linux
Pytorch 1.5 / 1.6
PytorchLightning 0.8.5

The text was updated successfully, but these errors were encountered:

github-actions · 2020-08-11T09:50:42Z

Hi! thanks for your contribution!, great first issue!

awaelchli · 2020-08-11T12:23:46Z

Hi, I had a PR here #2165 in which I attempted to add proper signal handling that would kill the processes. Did you simply set the NCCL_BLOCKING_WAIT env variable in the init_ddp_connection?

leonvarga · 2020-08-11T13:46:32Z

Hi awaelchli,
your signal handling seems to be the cleaning up part, which is currently missing. I also see there are some similar issues linked to your PR.

Yes, I only set NCCL_BLOCKING_WAIT to 1 in my overwritten init_ddp_connection, which allows the timeout handling of the sub-processes. So it's not as clean as your signal handling, but maybe a good fallback in case of a crash.

Best regards
Leon Varga

leonvarga added the question Further information is requested label Aug 11, 2020

awaelchli mentioned this issue Aug 11, 2020

[blocked by refactor] [WIP] graceful shutdown signal handling #2165

Closed

6 tasks

awaelchli mentioned this issue Sep 22, 2020

Training with DDP ended error and process in GPU died #3605

Closed

This was referenced Sep 30, 2020

[WIP] ref: decoupled ddp, ddp spawn #3733

Closed

[WIP] ref: decoupled ddp, ddp spawn (finish 3733) #3819

Merged

williamFalcon closed this as completed in #3819 Oct 3, 2020

Borda added bug Something isn't working and removed question Further information is requested labels Dec 23, 2020

ymohamedahmed mentioned this issue Mar 31, 2023

Refactor _SubprocessScriptLauncher process launching strategy #17248

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DistributedDataParallel with nccl backend produces zombie processes #2913

DistributedDataParallel with nccl backend produces zombie processes #2913

leonvarga commented Aug 11, 2020

github-actions bot commented Aug 11, 2020

awaelchli commented Aug 11, 2020

leonvarga commented Aug 11, 2020

DistributedDataParallel with nccl backend produces zombie processes #2913

DistributedDataParallel with nccl backend produces zombie processes #2913

Comments

leonvarga commented Aug 11, 2020

Code

What's your environment?

github-actions bot commented Aug 11, 2020

awaelchli commented Aug 11, 2020

leonvarga commented Aug 11, 2020