Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DistributedDataParallel with nccl backend produces zombie processes #2913

Closed
leonvarga opened this issue Aug 11, 2020 · 3 comments · Fixed by #3819
Closed

DistributedDataParallel with nccl backend produces zombie processes #2913

leonvarga opened this issue Aug 11, 2020 · 3 comments · Fixed by #3819
Labels
bug Something isn't working

Comments

@leonvarga
Copy link

Hey lightning community,

first I want to thank you for this nice project. It helped me a lot to improve my research code and I'm happy to recommend it to my colleagues whenever they are complaining about their code mess.

I have a problem with some kind of racing condition which is reproducible in a slurm environment with multiple GPUs. I used DistributedDataParallel with the 'nccl'-backend.
The default implementation of PyTorch-lightning can produce zombie processes, which reserve GPU memory and prevent further usage of the memory. It happens mainly if the main process is stopped or crashes.

You can reproduce this with the given code. If the training starts and the main process is killed with a strict signal like SIGKILL, the child processes stay persistent in most cases.

I could solve the problem for me by overwriting init_ddp_connection. The overwritten method sets the NCCL_BLOCKING_WAIT=1 and reduced the timeout additionally. (The torch documentation mentions that the timeout is only used for nccl if NCCL_BLOCKING_WAIT is 1)

Is there a better way to get rid of this problem?
Or should we adjust the default behavior of lightning for the nccl backend?

Best regrads
Leon Varga

Code

import pytorch_lightning as lightning
import torch
from torch.utils.data import DataLoader


class MyModule(lightning.LightningModule):
    def __init__(self):
        super(MyModule, self).__init__()

        self.model = torch.nn.Linear(1000, 1000)
        self.criterion = torch.nn.MSELoss()

    def train_dataloader(self) -> DataLoader:
        data = torch.randn((int(1e5), 2, 1000))

        training_generator = DataLoader(data)
        return training_generator

    def configure_optimizers(self):
        optimizer = torch.optim.Adam(self.model.parameters())

        return [optimizer]

    def training_step(self, batch, batch_idx):
        pred = self.model(batch[:, 0])
        loss = self.criterion(pred, batch[:, 1])

        return {
            'loss': loss,
        }

    def forward(self, x):
        return self.model(x)



if __name__ == "__main__":
    model = MyModule()
    trainer = lightning.Trainer(default_root_dir='/tmp/test',
                                max_epochs=100, gpus=-1,
                                distributed_backend='ddp')
    trainer.fit(model)

What's your environment?

  • Linux
  • Pytorch 1.5 / 1.6
  • PytorchLightning 0.8.5
@leonvarga leonvarga added the question Further information is requested label Aug 11, 2020
@github-actions
Copy link
Contributor

Hi! thanks for your contribution!, great first issue!

@awaelchli
Copy link
Contributor

Hi, I had a PR here #2165 in which I attempted to add proper signal handling that would kill the processes. Did you simply set the NCCL_BLOCKING_WAIT env variable in the init_ddp_connection?

@leonvarga
Copy link
Author

Hi awaelchli,
your signal handling seems to be the cleaning up part, which is currently missing. I also see there are some similar issues linked to your PR.

Yes, I only set NCCL_BLOCKING_WAIT to 1 in my overwritten init_ddp_connection, which allows the timeout handling of the sub-processes. So it's not as clean as your signal handling, but maybe a good fallback in case of a crash.

Best regards
Leon Varga

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
3 participants