Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distributed Data Training #3034

Closed
bluesky314 opened this issue Aug 18, 2020 · 15 comments
Closed

Distributed Data Training #3034

bluesky314 opened this issue Aug 18, 2020 · 15 comments
Assignees
Labels
help wanted Open to be worked on priority: 0 High priority task
Milestone

Comments

@bluesky314
Copy link

bluesky314 commented Aug 18, 2020

confidential code

I want to do Distributed Training with 2 gpus with data parallel.
This works just fine with one gpu. But when I do trainer = pl.Trainer(gpus=2) or trainer = pl.Trainer(gpus=[0,1]) I get an error:
Screen Shot 2020-08-18 at 8 12 18 PM

System

  • 3.6
  • Linux
  • pip
  • Build command you used (if compiling from source):
  • Python version:
  • 10.1
  • Nvidia-100
@bluesky314 bluesky314 added bug Something isn't working help wanted Open to be worked on labels Aug 18, 2020
@github-actions
Copy link
Contributor

Hi! thanks for your contribution!, great first issue!

@ananyahjha93
Copy link
Contributor

@bluesky314 I am not able to reproduce the same bug, can you provide the lightning version and some code to reproduce this exactly?

@ananyahjha93 ananyahjha93 added the priority: 0 High priority task label Aug 18, 2020
@bluesky314
Copy link
Author

bluesky314 commented Aug 18, 2020

I just installed lightning via pip install command given in the docs. My pytorch version is 1.6. The code worked with gpu was set to 1. As you can see in the message up and below, the error traces internally to multiprocessing files when trainer.fit is called and has not much to do with my code. The run time error also cannot be caused by any of my definitions.

Before this I was using wandb to log experiments with single gpu, after I changed to multiple gpu I got a wandb error saying wandb.init was not called. (wandb was not actually used to log anything, just to view the terminal log from wandb page so it did not interact with Lightning) I think copying the program to multiple cpus is leading to a problem.

Here is the extended error:
Screen Shot 2020-08-18 at 10 11 36 PM

@ananyahjha93 ananyahjha93 removed their assignment Aug 18, 2020
@awaelchli
Copy link
Contributor

awaelchli commented Aug 18, 2020

I just installed lightning via pip install command given in the docs

You will probably get 0.8.5. Could you try to install the 0.9.0rc16 release, or confirm that you have it already?
pip install pytorch-lightning==0.9.0rc16 --upgrade

@edenlightning edenlightning added this to the 0.9.0 milestone Aug 18, 2020
@awaelchli awaelchli self-assigned this Aug 18, 2020
@bluesky314
Copy link
Author

I updated it as above but am getting the same error. The difference is that the print commands at the top of the files are printed twice, meaning that it is making 2 copies of the code but the Runtime error remains

@awaelchli
Copy link
Contributor

I think you need to put

if __name__ == "__main__"
    ...

before the start of your script. This is a python multiprocessing thing, it imports the module before it runs it, so you need to guard the entry point of the program. Closing this as my suspicion is very high this is the problem here. Let me know if this fixes it for you.

@awaelchli awaelchli removed the bug Something isn't working label Aug 19, 2020
@bluesky314
Copy link
Author

bluesky314 commented Aug 19, 2020

Thanks, adding:

def main():
    train_loader = get_loader([size,size])['train']
    model = LightningModel()

    trainer = pl.Trainer(gpus=2,distributed_backend='ddp')
    trainer.fit(model, train_loader)
  
if __name__ ==  '__main__':
    main()
    

has removed the previous error but now the process gets stuck at:
image

The training progress does not start showing, but 1.6gb on the 2 gpus gets used up.

@awaelchli awaelchli reopened this Aug 19, 2020
@awaelchli
Copy link
Contributor

awaelchli commented Aug 19, 2020

And what about distributed_backend="ddp_spawn"?
What is dso_loader.cc? Make sure this actually supports running on multi-GPU. It seems to come from the tensorboard library.
Have you trained this code before in plain PyTorch in multi-gpu?

@bluesky314
Copy link
Author

bluesky314 commented Aug 19, 2020

I am using pytorch so not sure why tensorboard is getting involved. But it seems that those operations were successful anyways. dso_loader.cc is not my file. This is an AWS instance with 2gpus.

Yes, I have used this code with nn.DataParallel and it worked there. To remove the single cpu bottleneck I wanted to do Distributed DataParallel so someone recommended I try Lightning as it is easier to set up.

@awaelchli
Copy link
Contributor

ok, is distributed_backend="ddp_spawn" also getting stuck?

@bluesky314
Copy link
Author

Let me try and get back

@williamFalcon
Copy link
Contributor

yeah, this original issue was not using main which is a pytorch requirement.

@pgg1610
Copy link

pgg1610 commented Aug 22, 2020

I am having an issue with multiple GPU training:

mnist_model = MNISTModel()
trainer = pl.Trainer(max_epochs = 1, gpus=-1, distributed_backend='ddp')   
trainer.fit(mnist_model, DataLoader(train, num_workers=10), DataLoader(val, num_workers=10))  

This does not start the training but setting gpus=1 does it on a single GPU.
System:

  • PyTorch: 1.4.0
  • Pytorch Lightning: 0.9.0rc2
  • CUDA: 10.0
  • GPU: Tesla V100-SXM2-16GB

Frozen output image attached.
image

@williamFalcon
Copy link
Contributor

as the docs state... multi gpu is not supported on jupyter or colab. this is a limitation of those platforms, not lightning

@bluesky314
Copy link
Author

@awaelchli @williamFalcon I have created a new issue for the new error: #3117

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Open to be worked on priority: 0 High priority task
Projects
None yet
Development

No branches or pull requests

6 participants