-
Notifications
You must be signed in to change notification settings - Fork 6.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[tune] PyTorch Lightning 1.7 with Ray Tune hangs #28197
Comments
Does this always happen for you? Can you use |
Hi Kai, Please download this I tried 10-trials tuning for several times but wasn't able to reproduce it. Then the problem occurred for every running of 50 trials that I ran.
|
I can reproduce the error (sometimes...). The main training thread of a hanging process looks like this:
This hangs in pytorch lightnings device lookup here: https://github.com/Lightning-AI/lightning/blob/master/src/pytorch_lightning/utilities/device_parser.py#L334 The respective process looks like this:
Generally it seems the multiprocessing pool interacts with Ray (i.e. triggers a worker exit), which blocks a GCS update forever as the main process does not exit. I'll try to create a minimal repro for this. |
In the meantime, you should be able to use this as a workaround:
(add somewhere at the top) |
Yes it works! |
What happened + What you expected to happen
tune
stops to run new trials while all computational resources are free. The program seems dead.The
RUNNING
trial doesn't seem running as there are no metrics get updated for a very long time. (Completing a full trial won't even take so much time.)Versions / Dependencies
Reproduction script
Just run the official example https://github.com/ray-project/ray/blob/master/python/ray/tune/examples/mnist_pytorch_lightning.py
python mnist_pytorch_lightning.py
Issue Severity
High: It blocks me from completing my task.
The text was updated successfully, but these errors were encountered: