Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distributed Training not starting #3117

Closed
bluesky314 opened this issue Aug 24, 2020 · 4 comments · Fixed by #3819
Closed

Distributed Training not starting #3117

bluesky314 opened this issue Aug 24, 2020 · 4 comments · Fixed by #3819
Assignees
Labels
bug Something isn't working help wanted Open to be worked on waiting on author Waiting on user action, correction, or update
Milestone

Comments

@bluesky314
Copy link

bluesky314 commented Aug 24, 2020

My code works and training starts with the progress bar with 1 gpu. However when I switch to 2, either using ddp or ddp_spawn the process gets stuck at:

image

Nothing happens but around 1.5 gb is taken on each gpu(the process takes around 8 on a single gpu)

Here is my code:

Confidential Code

System

Python 3.6 
Pytorch 1.6
Linux
pip installed lightning
CUDA Version: 10.1
Nvidia-V100
@bluesky314 bluesky314 added bug Something isn't working help wanted Open to be worked on labels Aug 24, 2020
@Borda
Copy link
Member

Borda commented Aug 25, 2020

the example seems fine to me, mind share complete example even with dummy data, eg colab notebook?

@bluesky314
Copy link
Author

Colab does not allow ddp I read in the docs. How can I give you a workable example then?

@Borda
Copy link
Member

Borda commented Aug 25, 2020

true, so jut share script so we can run it...

@edenlightning edenlightning added the waiting on author Waiting on user action, correction, or update label Sep 1, 2020
@awaelchli
Copy link
Contributor

awaelchli commented Sep 20, 2020

@bluesky314 yes please, if you can share the script. I will try to run it and help debug.
how do you launch your script? do you have CUDA_VISIBLE_DEVICES environment variable set?
Also, could you run with NCCL_DEBUG=INFO and post the info that is printed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Open to be worked on waiting on author Waiting on user action, correction, or update
Projects
None yet
4 participants