-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DistributedDataParallel with nccl backend produces zombie processes #2913
Comments
Hi! thanks for your contribution!, great first issue! |
Hi, I had a PR here #2165 in which I attempted to add proper signal handling that would kill the processes. Did you simply set the NCCL_BLOCKING_WAIT env variable in the init_ddp_connection? |
Hi awaelchli, Yes, I only set NCCL_BLOCKING_WAIT to 1 in my overwritten init_ddp_connection, which allows the timeout handling of the sub-processes. So it's not as clean as your signal handling, but maybe a good fallback in case of a crash. Best regards |
Hey lightning community,
first I want to thank you for this nice project. It helped me a lot to improve my research code and I'm happy to recommend it to my colleagues whenever they are complaining about their code mess.
I have a problem with some kind of racing condition which is reproducible in a slurm environment with multiple GPUs. I used DistributedDataParallel with the 'nccl'-backend.
The default implementation of PyTorch-lightning can produce zombie processes, which reserve GPU memory and prevent further usage of the memory. It happens mainly if the main process is stopped or crashes.
You can reproduce this with the given code. If the training starts and the main process is killed with a strict signal like SIGKILL, the child processes stay persistent in most cases.
I could solve the problem for me by overwriting init_ddp_connection. The overwritten method sets the NCCL_BLOCKING_WAIT=1 and reduced the timeout additionally. (The torch documentation mentions that the timeout is only used for nccl if NCCL_BLOCKING_WAIT is 1)
Is there a better way to get rid of this problem?
Or should we adjust the default behavior of lightning for the nccl backend?
Best regrads
Leon Varga
Code
What's your environment?
The text was updated successfully, but these errors were encountered: