-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Add Gradient accumulation support to the default trainer #2721
Add Gradient accumulation support to the default trainer #2721
Conversation
Sync with allennlp master
Pull from AllenNLP Master
Pull from allennlp master
`time.strftime` does not account for number of days more than 31. Changing it to `datetime.timedelta` and using its `str` representation for printing epoch duration as well as training duration.
Gradient accumulation is computing multiple mini batches before doing a gradient update to accomodate for larger batches. This commit adds a new key to the trainer config `num_steps_to_accumulate`. The trainer performs an optimizer step only after every specified number of mini batch computation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a few questions / comments
- this needs a test
- is it really the case that gradient accumulation is incompatible with multiple GPUs? you raise a configuration error (which means it will crash if you try to use both) but the message just says "this may not be needed")
- I feel like the if/then where
self.batch_loss
is called is not as clean as it could be, but that depends to some degree on the answer to 2
There is nothing that stops from having a multi GPU training + gradient accumulation. I only doubt if people use it practically when they are anyway going to train with multiple GPUs. But it is only an assumption and I agree that they are not really incompatible. Perhaps just a note could be logged before training. |
This removes checks that previously allowed gradient accumulation only for single GPU training. This condition was not really necessary.
I've added a simple test.
I've addressed these points in the recent commits. Please review them @joelgrus |
allennlp/training/trainer.py
Outdated
@@ -290,14 +305,14 @@ def _train_epoch(self, epoch: int) -> Dict[str, float]: | |||
# Set the model to "train" mode. | |||
self.model.train() | |||
|
|||
num_gpus = len(self._cuda_devices) | |||
num_batch_groups = self._num_gradient_accumulation_steps * len(self._cuda_devices) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a comment explaining what num_batch_groups
is and why it's computed the way it is, it's not obvious to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realized that num_batch_groups
is a misnomer. It denotes the length of each batch_group
that gets into the training loop. I've renamed it as batch_group_length
. This is just an extension of the flow that is being used for the DataParallel
case. I've added the explanation as comments.
# will be 8. This batch_group is split into 2 chunks each for a step, with each | ||
# chunk consisting of 4 batches, 1 for each gpu. | ||
batch_group_for_stepwise_accumulation = lazy_groups_of(iter(batch_group), len(self._cuda_devices)) | ||
for batch_for_step in batch_group_for_stepwise_accumulation: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am having a really tough time understanding what this code is doing here and why it's doing it, is there a way to clarify it / add comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies if it is not succinct enough to understand. I've added few comments there. Hope it clarifies a bit.
The basic idea is something like this. So far batch_group
is being used to do a forward pass for both single and multi GPU cases. In the multi GPU case, the tensors in batch_group
are aggregated to form a single batch. I just wanted to extend the same flow for gradient accumulation too. The inner for
loop in train_epoch
tries to call batch_loss
for num_steps_to_accumulate
times. In each iteration, we use a chunk of the original batch_group
to call batch_loss
. As usual, the length of the chunk that gets passed in each iteration would be equal to the number of GPUs configured for training. Let me know if this makes sense.
assert trainer._num_gradient_accumulation_steps == 2 | ||
assert trainer._accumulate_gradients | ||
|
||
trainer.train() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would really be better if this test was checking that the gradient accumulation was doing the right thing, this is really just testing that it doesn't crash
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added conditions to check the effective number of batches trained.
Gradient accumulation reduces the effective number of batches from what is configured. The added check tests just that.
Accumulating for a fixed number of batches is one way to implement grad accumulation. An alternate more efficient way when the batch size is changing is to accumulate up to a given total effective batch size over a possibly different number of sub batches for each gradient update. This way, we can combine it with something like the bucket iterator that automatically adjusts the batch size depending on the length for maximum efficiency. It's especially helpful for transformer based models (eg. Bert) where the O(N*2) self attention requires a much smaller batch size for long sequences. I have an example implementation of this approach here: https://github.com/matt-peters/allennlp/blob/fp16_e/allennlp/training/trainer.py#L335 |
@joelgrus, it looks like the changes you requested have been made. Is this good to merge? How does it interact with the callback trainer stuff you've been working on? |
@joelgrus is going to take a look. |
Because @scarecrow1123 deleted their repo, I had to make another PR to get this in: #3512 |
This adds support to accumulate gradients for a specified number of steps during training. Number of steps to accumulate can be configured in the trainer using the key
num_steps_to_accumulate
. This also fixes #2112EDIT: Typo