You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi! Thank you for your great contribution to this repo. Actually I found it really convenient to reproduce various GANs with the help of mimicry.
However, when I read the code in the training part of the examples, I found it might work well for single GPU scenario but some how not quite suitable for distributed training. I had a glance at the source code of Logger, which enables the visualization using tensorboard, and it turns out that if one adopted distributed training with torch.nn.parallel.DistributedDataParallel, each process(one process for one GPU) will create a new file to record the information of that GPU/process. Apparently, this is not what we want. A possible solution is to create a Tensorboard file only if that process is rank 0 and only to record the average metric. If you are going to improve this, refer to torch.distributed and see the example of imagenet training given by official Pytorch.
The text was updated successfully, but these errors were encountered:
Hi @WingsleyLui thank you for the comments! Indeed this is something I'm looking at, especially since there are some models (e.g. SAGAN) that seems to only work if I use a large batch size, which is only possible with multiple GPUs (or a really big one). Your suggestions are fantastic, and I will keep these in mind while I work on it -- will keep this issue open and update when it is done!
Hi! Thank you for your great contribution to this repo. Actually I found it really convenient to reproduce various GANs with the help of
mimicry
.However, when I read the code in the training part of the examples, I found it might work well for single GPU scenario but some how not quite suitable for distributed training. I had a glance at the source code of
Logger
, which enables the visualization using tensorboard, and it turns out that if one adopted distributed training withtorch.nn.parallel.DistributedDataParallel
, each process(one process for one GPU) will create a new file to record the information of that GPU/process. Apparently, this is not what we want. A possible solution is to create a Tensorboard file only if that process is rank 0 and only to record the average metric. If you are going to improve this, refer totorch.distributed
and see the example of imagenet training given by official Pytorch.The text was updated successfully, but these errors were encountered: