Are there any plans to develop a distributed version? #23

Leiwx52 · 2020-08-05T03:49:39Z

Hi! Thank you for your great contribution to this repo. Actually I found it really convenient to reproduce various GANs with the help of mimicry.

However, when I read the code in the training part of the examples, I found it might work well for single GPU scenario but some how not quite suitable for distributed training. I had a glance at the source code of Logger, which enables the visualization using tensorboard, and it turns out that if one adopted distributed training with torch.nn.parallel.DistributedDataParallel, each process(one process for one GPU) will create a new file to record the information of that GPU/process. Apparently, this is not what we want. A possible solution is to create a Tensorboard file only if that process is rank 0 and only to record the average metric. If you are going to improve this, refer to torch.distributed and see the example of imagenet training given by official Pytorch.

The text was updated successfully, but these errors were encountered:

kwotsin · 2020-08-05T05:02:14Z

Hi @WingsleyLui thank you for the comments! Indeed this is something I'm looking at, especially since there are some models (e.g. SAGAN) that seems to only work if I use a large batch size, which is only possible with multiple GPUs (or a really big one). Your suggestions are fantastic, and I will keep these in mind while I work on it -- will keep this issue open and update when it is done!

kwotsin added the enhancement New feature or request label Aug 5, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Are there any plans to develop a distributed version? #23

Are there any plans to develop a distributed version? #23

Leiwx52 commented Aug 5, 2020

kwotsin commented Aug 5, 2020

Are there any plans to develop a distributed version? #23

Are there any plans to develop a distributed version? #23

Comments

Leiwx52 commented Aug 5, 2020

kwotsin commented Aug 5, 2020