-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Distributed Data Training #3034
Comments
Hi! thanks for your contribution!, great first issue! |
@bluesky314 I am not able to reproduce the same bug, can you provide the lightning version and some code to reproduce this exactly? |
I just installed lightning via pip install command given in the docs. My pytorch version is 1.6. The code worked with gpu was set to 1. As you can see in the message up and below, the error traces internally to multiprocessing files when trainer.fit is called and has not much to do with my code. The run time error also cannot be caused by any of my definitions. Before this I was using wandb to log experiments with single gpu, after I changed to multiple gpu I got a wandb error saying wandb.init was not called. (wandb was not actually used to log anything, just to view the terminal log from wandb page so it did not interact with Lightning) I think copying the program to multiple cpus is leading to a problem. |
You will probably get 0.8.5. Could you try to install the 0.9.0rc16 release, or confirm that you have it already? |
I updated it as above but am getting the same error. The difference is that the print commands at the top of the files are printed twice, meaning that it is making 2 copies of the code but the Runtime error remains |
I think you need to put if __name__ == "__main__"
... before the start of your script. This is a python multiprocessing thing, it imports the module before it runs it, so you need to guard the entry point of the program. Closing this as my suspicion is very high this is the problem here. Let me know if this fixes it for you. |
And what about distributed_backend="ddp_spawn"? |
I am using pytorch so not sure why tensorboard is getting involved. But it seems that those operations were successful anyways. dso_loader.cc is not my file. This is an AWS instance with 2gpus. Yes, I have used this code with nn.DataParallel and it worked there. To remove the single cpu bottleneck I wanted to do Distributed DataParallel so someone recommended I try Lightning as it is easier to set up. |
ok, is |
Let me try and get back |
yeah, this original issue was not using main which is a pytorch requirement. |
I am having an issue with multiple GPU training: mnist_model = MNISTModel()
trainer = pl.Trainer(max_epochs = 1, gpus=-1, distributed_backend='ddp')
trainer.fit(mnist_model, DataLoader(train, num_workers=10), DataLoader(val, num_workers=10)) This does not start the training but setting
|
as the docs state... multi gpu is not supported on jupyter or colab. this is a limitation of those platforms, not lightning |
@awaelchli @williamFalcon I have created a new issue for the new error: #3117 |
confidential code
I want to do Distributed Training with 2 gpus with data parallel.
This works just fine with one gpu. But when I do
trainer = pl.Trainer(gpus=2)
ortrainer = pl.Trainer(gpus=[0,1])
I get an error:System
The text was updated successfully, but these errors were encountered: