Skip to content
This repository has been archived by the owner on Dec 16, 2022. It is now read-only.

Add Gradient accumulation support to the default trainer #2721

Closed
wants to merge 17 commits into from
Closed

Add Gradient accumulation support to the default trainer #2721

wants to merge 17 commits into from

Conversation

scarecrow1123
Copy link
Contributor

@scarecrow1123 scarecrow1123 commented Apr 17, 2019

This adds support to accumulate gradients for a specified number of steps during training. Number of steps to accumulate can be configured in the trainer using the key num_steps_to_accumulate. This also fixes #2112

EDIT: Typo

Sync with allennlp master
Pull from AllenNLP Master
Pull from allennlp master
This fixes Issue [@2717](#2717) to include day count in `training_duration` key in metrics.
`time.strftime` does not account for number of days more than
31. Changing it to `datetime.timedelta` and using its `str`
representation for printing epoch duration as well as training
duration.
Gradient accumulation is computing multiple mini batches
before doing a gradient update to accomodate for larger
batches. This commit adds a new key to the trainer config
`num_steps_to_accumulate`. The trainer performs an optimizer
step only after every specified number of mini batch
computation.
@matt-gardner matt-gardner requested a review from joelgrus April 17, 2019 02:33
Copy link
Contributor

@joelgrus joelgrus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a few questions / comments

  1. this needs a test
  2. is it really the case that gradient accumulation is incompatible with multiple GPUs? you raise a configuration error (which means it will crash if you try to use both) but the message just says "this may not be needed")
  3. I feel like the if/then where self.batch_loss is called is not as clean as it could be, but that depends to some degree on the answer to 2

@scarecrow1123
Copy link
Contributor Author

scarecrow1123 commented Apr 18, 2019

is it really the case that gradient accumulation is incompatible with multiple GPUs? you raise a configuration error (which means it will crash if you try to use both) but the message just says "this may not be needed")

There is nothing that stops from having a multi GPU training + gradient accumulation. I only doubt if people use it practically when they are anyway going to train with multiple GPUs. But it is only an assumption and I agree that they are not really incompatible. Perhaps just a note could be logged before training.

This removes checks that previously allowed gradient accumulation
only for single GPU training. This condition was not really necessary.
@scarecrow1123
Copy link
Contributor Author

  1. this needs a test

I've added a simple test.

  1. is it really the case that gradient accumulation is incompatible with multiple GPUs? you raise a configuration error (which means it will crash if you try to use both) but the message just says "this may not be needed")
  2. I feel like the if/then where self.batch_loss is called is not as clean as it could be, but that depends to some degree on the answer to 2

I've addressed these points in the recent commits. Please review them @joelgrus

@@ -290,14 +305,14 @@ def _train_epoch(self, epoch: int) -> Dict[str, float]:
# Set the model to "train" mode.
self.model.train()

num_gpus = len(self._cuda_devices)
num_batch_groups = self._num_gradient_accumulation_steps * len(self._cuda_devices)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a comment explaining what num_batch_groups is and why it's computed the way it is, it's not obvious to me

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realized that num_batch_groups is a misnomer. It denotes the length of each batch_group that gets into the training loop. I've renamed it as batch_group_length. This is just an extension of the flow that is being used for the DataParallel case. I've added the explanation as comments.

# will be 8. This batch_group is split into 2 chunks each for a step, with each
# chunk consisting of 4 batches, 1 for each gpu.
batch_group_for_stepwise_accumulation = lazy_groups_of(iter(batch_group), len(self._cuda_devices))
for batch_for_step in batch_group_for_stepwise_accumulation:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am having a really tough time understanding what this code is doing here and why it's doing it, is there a way to clarify it / add comments

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies if it is not succinct enough to understand. I've added few comments there. Hope it clarifies a bit.

The basic idea is something like this. So far batch_group is being used to do a forward pass for both single and multi GPU cases. In the multi GPU case, the tensors in batch_group are aggregated to form a single batch. I just wanted to extend the same flow for gradient accumulation too. The inner for loop in train_epoch tries to call batch_loss for num_steps_to_accumulate times. In each iteration, we use a chunk of the original batch_group to call batch_loss. As usual, the length of the chunk that gets passed in each iteration would be equal to the number of GPUs configured for training. Let me know if this makes sense.

assert trainer._num_gradient_accumulation_steps == 2
assert trainer._accumulate_gradients

trainer.train()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would really be better if this test was checking that the gradient accumulation was doing the right thing, this is really just testing that it doesn't crash

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added conditions to check the effective number of batches trained.

Gradient accumulation reduces the effective number of batches from
what is configured. The added check tests just that.
@matt-peters
Copy link
Contributor

Accumulating for a fixed number of batches is one way to implement grad accumulation. An alternate more efficient way when the batch size is changing is to accumulate up to a given total effective batch size over a possibly different number of sub batches for each gradient update. This way, we can combine it with something like the bucket iterator that automatically adjusts the batch size depending on the length for maximum efficiency. It's especially helpful for transformer based models (eg. Bert) where the O(N*2) self attention requires a much smaller batch size for long sequences.

I have an example implementation of this approach here: https://github.com/matt-peters/allennlp/blob/fp16_e/allennlp/training/trainer.py#L335

@matt-gardner
Copy link
Contributor

@joelgrus, it looks like the changes you requested have been made. Is this good to merge? How does it interact with the callback trainer stuff you've been working on?

@schmmd
Copy link
Member

schmmd commented Jul 12, 2019

@joelgrus is going to take a look.

eladsegal referenced this pull request in eladsegal/allennlp Nov 29, 2019
@dirkgr dirkgr self-assigned this Dec 11, 2019
@dirkgr dirkgr mentioned this pull request Dec 11, 2019
@dirkgr
Copy link
Member

dirkgr commented Dec 11, 2019

Because @scarecrow1123 deleted their repo, I had to make another PR to get this in: #3512

@dirkgr dirkgr closed this Dec 11, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Accumulating gradients
6 participants