-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Using DistributedDataParallel for multi GPU training #2536
Comments
we recently refactored the so probably my recommendation would be to implement the simplest possible does that make sense? |
Thanks for the heads up @joelgrus . I did a preliminary analysis on how something like a
|
So I've been fiddling with the code and hacked around to make Dataset: Librispeech 100 hour
I'll post my WIP code over the weekend and I know only then I may be able to receive some insights. However, my main concern here is about DP working so well while I've seen people complaining about no speedups with DP. One major difference with AllenNLP's DP that I've observed is that the loss calculation happens at respective GPUs unlike the standard PyTorch way of calculating the criterion in a single GPU. This might reduce some tensor reductions and hence there is an enhanced performance even with DP? Another advice that I would like to get is what could be a benchmark experiment that I can perform so that the numbers are right? This is my first time working with multiple GPUs and I guess I need to get the load right so as to report valid numbers. Would BiDAF be a proper load to gather the numbers? Or a language model perhaps? Would like to hear from folks @2200 too! |
@scarecrow1123 Did you ever get this working completely? I'd be curious to hear whether you were able to get the speedups you had hoped for or any strategies you used for implementation. |
@mihail911 I have a working version in my fork here. This tries to patch the existing trainer with distributed & fp16 cases. Following are the broad changes that I've done to make it work: LoggingModify the current logging setup so that the workers also can log to
|
@brendan-ai2, @DeNeutoy, I think you two have more context on this one than @joelgrus or me. What do you think? |
@scarecrow1123 I took a look at this - looks good. Here are my top level comments:
One thing that I didn't quite follow with your diff was the logging? Could you give some examples of what that looks like? It might be nice to be able to stream the logs from different workers to different files, or something like that. I'm in two minds about whether this needs to wait until we switch to the torch The After we've discussed this a little more, I might suggest the following PRs: PR 1: Distributed logging and metric aggregation PR 2: Changes to trainer and PR 3: additional changes to support AMP @brendan-ai2, would be good to get your review of @scarecrow1123's diff too! Just as a note I am on holiday from 31st Sept - 6th Oct, so I might be a bit slow replying here. However, @scarecrow1123 your work is very much welcome and i'd love to find a way to merge it into allennlp. |
@scarecrow1123, thanks for working on this! The diff will take me a bit of time to go over carefully, but I'll try to have some high level feedback tomorrow. |
Thanks for the detailed feedback @DeNeutoy .
I think I was not clear in when I mentioned about the number of batches. From my understanding, model forward/backward are synchronization points in
So let's take this case of a 2 GPU setup and say Another option to make this work would be to duplicate few training examples to even out the number of batches. This is what PyTorch's
Workers logging to their respective log files would be pretty straightforward with AllenNLP's existing logging setup. However it could be more useful if they logged to the same stdout/stderr logs. Here's a gist with an example program to test the logging setup present in the fork. I've written a mini tutorial of sorts to explain this case here. Basically if the workers need to log to a same file,
Refer to the original docs for QueueHandler, QueueListener and a list of useful handlers.
My only qualm is the current
I'll try to run existing AllenNLP experiments and share you the results in a few days. Your PR suggestions make sense. I'll wait for your further comments and Brendan's comments too. |
@brendan-ai2 @DeNeutoy Were you able to have a closer look at the code? I'm looking forward to start making some PRs for this. I've also started doing some comparisons between the current implementation and the one in the fork. You can find the numbers and the corresponding experiments in this repo. The numbers are in no way complete and I'll be adding more of them along with accuracy stats. |
@scarecrow1123 Looking good! I'll do another review of this tomorrow before we go ahead. In the meantime, would you mind getting the single GPU number filled out for Esim in your repo, so we have that baseline? Sorry for the delay on this! I'm back from holiday now so I should be more responsive. |
@DeNeutoy I've added single GPU numbers and also comparison for Bidaf experiment. |
@scarecrow1123 sweet! Looks like a good speed up. Just a sanity check - what numbers are you getting for Acc/F1 for those runs? I'm assuming they are similar to the single gpu numbers. If so, we can start the PRs I think, starting with the logging, then the changes to the trainer/dataset readers. In particular, we should leave out the |
@scarecrow1123, thanks again for this! Could you upload the model output folder for those runs, so we can take a peak at the models and vocabs produced? Also, just to confirm, the numbers from https://github.com/scarecrow1123/allennlp-distributed-training/blob/master/README.md under "4x Data Parallel" are produced using AllenNLP's current (limited) multi-GPU support, correct? (That is, the support available simply by using a list of device ids for |
@DeNeutoy Right now I'm seeing a bit of overfitting happening in the 4x experiments, presumably because of 4x larger batches. I'll have to do single GPU experiments with 4x larger batches to get a proper comparison. I'll try to get the accuracy stats for all these combinations and get back to you along with the model outputs as @brendan-ai2 has asked.
Yes my runs do not include amp. I omitted it mainly because of
Yes that's right. Those numbers are using upstream AllenNLP HEAD. Only the distributed version is run from the fork.
Fair point. Let me try running some experiments with multiprocess reader and get back. |
@scarecrow1123 I did some benchmarking with the new torch dataset loaders vs our current multiprocess implementation, and it doesn't look like the multiprocessing in allennlp will provide a speedup for your code. So don't worry about running those experiments which include using the @brendan-ai2 and I will set up an upstream branch that we will collect these big changes into, before we merge them into master. This branch will also include the changes I make to support the |
Thanks for running those benchmarks, @DeNeutoy! In case anyone is curious about the details (and for my own memory), what Mark showed with master...DeNeutoy:benchmark was that the time spent indexing and tensorizing with the |
Full details of that benchmark available in #3079 💯 |
@scarecrow1123, please open your PRs against https://github.com/allenai/allennlp/tree/torch-distributed when ready. :) It's even with |
@brendan-ai2 @DeNeutoy Thanks for the clarification(one item cleared from my todo! ;) ). I'm still seeing an accuracy dip with |
The dip in accuracy/loss metrics has been entirely due to inconsistent vocabulary indices across workers. To explain it in detail, I had modified the # distributed snli reader
def _read(self):
...
for idx, line in enumerate(snli_file):
if idx % world_size == rank:
continue
yield instance This works well during the actual training. However, this selective filtering will also be applied during vocabulary creation before the training actually starts and hence the workers will be dealing with different vocabularies. I'm not sure how to actually get over this issue. As To test the trainer code, I just created the vocabulary before hand using |
We may have to move vocabulary creation out of |
This has landed in #3529. Huge thanks to @scarecrow1123!! |
I'm looking into the codebase to possibly use torch's
DistributedDataParallel
for multi GPU training. Based on the docs, it certainly would improve the training speed when compared withDataParallel
. I'm only looking into single node - multi GPU case for now. I would like to get a heads up on possible caveats that I could face when integrating it into the currentTrainer
class. I am only worried about goofing up any stateful parts of the code that could result in leaks. Would be great if someone could give pointers to such instances if any.The text was updated successfully, but these errors were encountered: