Distributed Data Training #3034

bluesky314 · 2020-08-18T14:43:33Z

confidential code

I want to do Distributed Training with 2 gpus with data parallel.
This works just fine with one gpu. But when I do trainer = pl.Trainer(gpus=2) or trainer = pl.Trainer(gpus=[0,1]) I get an error:

System

3.6
Linux
pip
Build command you used (if compiling from source):
Python version:
10.1
Nvidia-100

The text was updated successfully, but these errors were encountered:

github-actions · 2020-08-18T14:44:17Z

Hi! thanks for your contribution!, great first issue!

ananyahjha93 · 2020-08-18T16:17:57Z

@bluesky314 I am not able to reproduce the same bug, can you provide the lightning version and some code to reproduce this exactly?

bluesky314 · 2020-08-18T16:48:00Z

I just installed lightning via pip install command given in the docs. My pytorch version is 1.6. The code worked with gpu was set to 1. As you can see in the message up and below, the error traces internally to multiprocessing files when trainer.fit is called and has not much to do with my code. The run time error also cannot be caused by any of my definitions.

Before this I was using wandb to log experiments with single gpu, after I changed to multiple gpu I got a wandb error saying wandb.init was not called. (wandb was not actually used to log anything, just to view the terminal log from wandb page so it did not interact with Lightning) I think copying the program to multiple cpus is leading to a problem.

Here is the extended error:

awaelchli · 2020-08-18T18:08:22Z

I just installed lightning via pip install command given in the docs

You will probably get 0.8.5. Could you try to install the 0.9.0rc16 release, or confirm that you have it already?
pip install pytorch-lightning==0.9.0rc16 --upgrade

bluesky314 · 2020-08-19T07:27:36Z

I updated it as above but am getting the same error. The difference is that the print commands at the top of the files are printed twice, meaning that it is making 2 copies of the code but the Runtime error remains

awaelchli · 2020-08-19T11:18:28Z

I think you need to put

if __name__ == "__main__"
    ...

before the start of your script. This is a python multiprocessing thing, it imports the module before it runs it, so you need to guard the entry point of the program. Closing this as my suspicion is very high this is the problem here. Let me know if this fixes it for you.

bluesky314 · 2020-08-19T12:04:27Z

Thanks, adding:

def main():
    train_loader = get_loader([size,size])['train']
    model = LightningModel()

    trainer = pl.Trainer(gpus=2,distributed_backend='ddp')
    trainer.fit(model, train_loader)
  
if __name__ ==  '__main__':
    main()

has removed the previous error but now the process gets stuck at:

The training progress does not start showing, but 1.6gb on the 2 gpus gets used up.

awaelchli · 2020-08-19T12:28:07Z

And what about distributed_backend="ddp_spawn"?
What is dso_loader.cc? Make sure this actually supports running on multi-GPU. It seems to come from the tensorboard library.
Have you trained this code before in plain PyTorch in multi-gpu?

bluesky314 · 2020-08-19T12:32:44Z

I am using pytorch so not sure why tensorboard is getting involved. But it seems that those operations were successful anyways. dso_loader.cc is not my file. This is an AWS instance with 2gpus.

Yes, I have used this code with nn.DataParallel and it worked there. To remove the single cpu bottleneck I wanted to do Distributed DataParallel so someone recommended I try Lightning as it is easier to set up.

awaelchli · 2020-08-19T12:34:34Z

ok, is distributed_backend="ddp_spawn" also getting stuck?

bluesky314 · 2020-08-19T12:34:49Z

Let me try and get back

williamFalcon · 2020-08-19T23:12:04Z

yeah, this original issue was not using main which is a pytorch requirement.

pgg1610 · 2020-08-22T17:29:46Z

I am having an issue with multiple GPU training:

mnist_model = MNISTModel()
trainer = pl.Trainer(max_epochs = 1, gpus=-1, distributed_backend='ddp')   
trainer.fit(mnist_model, DataLoader(train, num_workers=10), DataLoader(val, num_workers=10))

This does not start the training but setting gpus=1 does it on a single GPU.
System:

PyTorch: 1.4.0
Pytorch Lightning: 0.9.0rc2
CUDA: 10.0
GPU: Tesla V100-SXM2-16GB

Frozen output image attached.

williamFalcon · 2020-08-22T17:34:12Z

as the docs state... multi gpu is not supported on jupyter or colab. this is a limitation of those platforms, not lightning

bluesky314 · 2020-08-24T10:18:20Z

@awaelchli @williamFalcon I have created a new issue for the new error: #3117

bluesky314 added bug Something isn't working help wanted Open to be worked on labels Aug 18, 2020

ananyahjha93 self-assigned this Aug 18, 2020

ananyahjha93 added the information needed label Aug 18, 2020

ananyahjha93 added the priority: 0 High priority task label Aug 18, 2020

ananyahjha93 removed their assignment Aug 18, 2020

edenlightning added this to the 0.9.0 milestone Aug 18, 2020

awaelchli self-assigned this Aug 18, 2020

awaelchli closed this as completed Aug 19, 2020

awaelchli removed the bug Something isn't working label Aug 19, 2020

awaelchli reopened this Aug 19, 2020

williamFalcon closed this as completed Aug 19, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distributed Data Training #3034

Distributed Data Training #3034

bluesky314 commented Aug 18, 2020 •

edited

Loading

github-actions bot commented Aug 18, 2020

ananyahjha93 commented Aug 18, 2020

bluesky314 commented Aug 18, 2020 •

edited

Loading

awaelchli commented Aug 18, 2020 •

edited

Loading

bluesky314 commented Aug 19, 2020

awaelchli commented Aug 19, 2020

bluesky314 commented Aug 19, 2020 •

edited

Loading

awaelchli commented Aug 19, 2020 •

edited

Loading

bluesky314 commented Aug 19, 2020 •

edited

Loading

awaelchli commented Aug 19, 2020

bluesky314 commented Aug 19, 2020

williamFalcon commented Aug 19, 2020

pgg1610 commented Aug 22, 2020

williamFalcon commented Aug 22, 2020

bluesky314 commented Aug 24, 2020

Distributed Data Training #3034

Distributed Data Training #3034

Comments

bluesky314 commented Aug 18, 2020 • edited Loading

confidential code

github-actions bot commented Aug 18, 2020

ananyahjha93 commented Aug 18, 2020

bluesky314 commented Aug 18, 2020 • edited Loading

awaelchli commented Aug 18, 2020 • edited Loading

bluesky314 commented Aug 19, 2020

awaelchli commented Aug 19, 2020

bluesky314 commented Aug 19, 2020 • edited Loading

awaelchli commented Aug 19, 2020 • edited Loading

bluesky314 commented Aug 19, 2020 • edited Loading

awaelchli commented Aug 19, 2020

bluesky314 commented Aug 19, 2020

williamFalcon commented Aug 19, 2020

pgg1610 commented Aug 22, 2020

williamFalcon commented Aug 22, 2020

bluesky314 commented Aug 24, 2020

bluesky314 commented Aug 18, 2020 •

edited

Loading

bluesky314 commented Aug 18, 2020 •

edited

Loading

awaelchli commented Aug 18, 2020 •

edited

Loading

bluesky314 commented Aug 19, 2020 •

edited

Loading

awaelchli commented Aug 19, 2020 •

edited

Loading

bluesky314 commented Aug 19, 2020 •

edited

Loading