Distributed Training not starting #3117
Labels
bug
Something isn't working
help wanted
Open to be worked on
waiting on author
Waiting on user action, correction, or update
Milestone
My code works and training starts with the progress bar with 1 gpu. However when I switch to 2, either using ddp or ddp_spawn the process gets stuck at:
Nothing happens but around 1.5 gb is taken on each gpu(the process takes around 8 on a single gpu)
Here is my code:
Confidential Code
System
The text was updated successfully, but these errors were encountered: