Add Gradient accumulation support to the default trainer #2721

scarecrow1123 · 2019-04-17T02:27:19Z

This adds support to accumulate gradients for a specified number of steps during training. Number of steps to accumulate can be configured in the trainer using the key num_steps_to_accumulate. This also fixes #2112

EDIT: Typo

Sync with allennlp master

Pull from AllenNLP Master

Pull from allennlp master

@2717

This fixes Issue [@2717](#2717) to include day count in `training_duration` key in metrics.

`time.strftime` does not account for number of days more than 31. Changing it to `datetime.timedelta` and using its `str` representation for printing epoch duration as well as training duration.

Gradient accumulation is computing multiple mini batches before doing a gradient update to accomodate for larger batches. This commit adds a new key to the trainer config `num_steps_to_accumulate`. The trainer performs an optimizer step only after every specified number of mini batch computation.

joelgrus

I have a few questions / comments

this needs a test
is it really the case that gradient accumulation is incompatible with multiple GPUs? you raise a configuration error (which means it will crash if you try to use both) but the message just says "this may not be needed")
I feel like the if/then where self.batch_loss is called is not as clean as it could be, but that depends to some degree on the answer to 2

scarecrow1123 · 2019-04-18T03:54:01Z

is it really the case that gradient accumulation is incompatible with multiple GPUs? you raise a configuration error (which means it will crash if you try to use both) but the message just says "this may not be needed")

There is nothing that stops from having a multi GPU training + gradient accumulation. I only doubt if people use it practically when they are anyway going to train with multiple GPUs. But it is only an assumption and I agree that they are not really incompatible. Perhaps just a note could be logged before training.

This removes checks that previously allowed gradient accumulation only for single GPU training. This condition was not really necessary.

scarecrow1123 · 2019-04-23T12:36:44Z

this needs a test

I've added a simple test.

is it really the case that gradient accumulation is incompatible with multiple GPUs? you raise a configuration error (which means it will crash if you try to use both) but the message just says "this may not be needed")

I feel like the if/then where self.batch_loss is called is not as clean as it could be, but that depends to some degree on the answer to 2

I've addressed these points in the recent commits. Please review them @joelgrus

joelgrus · 2019-04-24T02:37:26Z

allennlp/training/trainer.py

@@ -290,14 +305,14 @@ def _train_epoch(self, epoch: int) -> Dict[str, float]:
        # Set the model to "train" mode.
        self.model.train()

-        num_gpus = len(self._cuda_devices)
+        num_batch_groups = self._num_gradient_accumulation_steps * len(self._cuda_devices)


can you add a comment explaining what num_batch_groups is and why it's computed the way it is, it's not obvious to me

I realized that num_batch_groups is a misnomer. It denotes the length of each batch_group that gets into the training loop. I've renamed it as batch_group_length. This is just an extension of the flow that is being used for the DataParallel case. I've added the explanation as comments.

joelgrus · 2019-04-24T02:38:56Z

allennlp/training/trainer.py

+            # will be 8. This batch_group is split into 2 chunks each for a step, with each
+            # chunk consisting of 4 batches, 1 for each gpu.
+            batch_group_for_stepwise_accumulation = lazy_groups_of(iter(batch_group), len(self._cuda_devices))
+            for batch_for_step in batch_group_for_stepwise_accumulation:


I am having a really tough time understanding what this code is doing here and why it's doing it, is there a way to clarify it / add comments

Apologies if it is not succinct enough to understand. I've added few comments there. Hope it clarifies a bit.

The basic idea is something like this. So far batch_group is being used to do a forward pass for both single and multi GPU cases. In the multi GPU case, the tensors in batch_group are aggregated to form a single batch. I just wanted to extend the same flow for gradient accumulation too. The inner for loop in train_epoch tries to call batch_loss for num_steps_to_accumulate times. In each iteration, we use a chunk of the original batch_group to call batch_loss. As usual, the length of the chunk that gets passed in each iteration would be equal to the number of GPUs configured for training. Let me know if this makes sense.

joelgrus · 2019-04-24T02:39:32Z

allennlp/tests/training/trainer_test.py

+        assert trainer._num_gradient_accumulation_steps == 2
+        assert trainer._accumulate_gradients
+
+        trainer.train()


it would really be better if this test was checking that the gradient accumulation was doing the right thing, this is really just testing that it doesn't crash

I've added conditions to check the effective number of batches trained.

Gradient accumulation reduces the effective number of batches from what is configured. The added check tests just that.

matt-peters · 2019-05-01T17:55:58Z

Accumulating for a fixed number of batches is one way to implement grad accumulation. An alternate more efficient way when the batch size is changing is to accumulate up to a given total effective batch size over a possibly different number of sub batches for each gradient update. This way, we can combine it with something like the bucket iterator that automatically adjusts the batch size depending on the length for maximum efficiency. It's especially helpful for transformer based models (eg. Bert) where the O(N*2) self attention requires a much smaller batch size for long sequences.

I have an example implementation of this approach here: https://github.com/matt-peters/allennlp/blob/fp16_e/allennlp/training/trainer.py#L335

matt-gardner · 2019-06-14T15:16:35Z

@joelgrus, it looks like the changes you requested have been made. Is this good to merge? How does it interact with the callback trainer stuff you've been working on?

schmmd · 2019-07-12T22:10:32Z

@joelgrus is going to take a look.

dirkgr · 2019-12-11T23:28:42Z

Because @scarecrow1123 deleted their repo, I had to make another PR to get this in: #3512

scarecrow1123 added 8 commits November 5, 2018 13:03

Added default predictor for bimpm model

9573e55

Merge pull request #1 from allenai/master

563647a

Sync with allennlp master

Merge pull request #2 from allenai/master

5ccf8ad

Pull from AllenNLP Master

Merge pull request #3 from allenai/master

aaeddef

Pull from allennlp master

Fix #2717: Add day count in training duration

96db26f

This fixes Issue [@2717](#2717) to include day count in `training_duration` key in metrics.

Modify elapsed time format to use timedelta

64552a0

`time.strftime` does not account for number of days more than 31. Changing it to `datetime.timedelta` and using its `str` representation for printing epoch duration as well as training duration.

Merge branch 'master' of git://github.com/allenai/allennlp

d0ac4ca

matt-gardner requested a review from joelgrus April 17, 2019 02:33

scarecrow1123 added 2 commits April 17, 2019 08:33

Add doc for gradient accumulation

ad94b1f

Fix linter errors

a33865b

joelgrus suggested changes Apr 17, 2019

View reviewed changes

scarecrow1123 added 3 commits April 23, 2019 17:36

Fix gradient accumulation to work for multi GPU

f7a8ff7

This removes checks that previously allowed gradient accumulation only for single GPU training. This condition was not really necessary.

Add test for gradient accumulation

6f45aa3

Merge 'upstream/master' into gradient-accumulation

955e2c4

joelgrus reviewed Apr 24, 2019

View reviewed changes

scarecrow1123 added 4 commits April 24, 2019 09:08

Rename num_batch_groups and clarify usage

e2e48e3

Add comments to clarify gradient accumulation

12a87ca

Add more checks in gradient accumulation test

71398f1

Gradient accumulation reduces the effective number of batches from what is configured. The added check tests just that.

Fix linter error

07cd9b2

kernelmachine mentioned this pull request May 29, 2019

Reduce model size when using Bert Embedder #2897

Closed

MaksymDel mentioned this pull request Nov 19, 2019

[Feature request] Gradient Accumulation #3469

Closed

eladsegal referenced this pull request in eladsegal/allennlp Nov 28, 2019

add gradient accumulation - tested on a single GPU

9bf0282

eladsegal referenced this pull request in eladsegal/allennlp Nov 29, 2019

use different gradient accumulation

f7fa29c

dirkgr self-assigned this Dec 11, 2019

dirkgr mentioned this pull request Dec 11, 2019

Gradient accumulation #3512

Closed

dirkgr closed this Dec 11, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Gradient accumulation support to the default trainer #2721

Add Gradient accumulation support to the default trainer #2721

scarecrow1123 commented Apr 17, 2019 •

edited

Loading

joelgrus left a comment

scarecrow1123 commented Apr 18, 2019 •

edited

Loading

scarecrow1123 commented Apr 23, 2019

joelgrus Apr 24, 2019

scarecrow1123 Apr 24, 2019

joelgrus Apr 24, 2019

scarecrow1123 Apr 24, 2019

joelgrus Apr 24, 2019

scarecrow1123 Apr 24, 2019

matt-peters commented May 1, 2019

matt-gardner commented Jun 14, 2019

schmmd commented Jul 12, 2019

dirkgr commented Dec 11, 2019

Add Gradient accumulation support to the default trainer #2721

Add Gradient accumulation support to the default trainer #2721

Conversation

scarecrow1123 commented Apr 17, 2019 • edited Loading

joelgrus left a comment

Choose a reason for hiding this comment

scarecrow1123 commented Apr 18, 2019 • edited Loading

scarecrow1123 commented Apr 23, 2019

joelgrus Apr 24, 2019

Choose a reason for hiding this comment

scarecrow1123 Apr 24, 2019

Choose a reason for hiding this comment

joelgrus Apr 24, 2019

Choose a reason for hiding this comment

scarecrow1123 Apr 24, 2019

Choose a reason for hiding this comment

joelgrus Apr 24, 2019

Choose a reason for hiding this comment

scarecrow1123 Apr 24, 2019

Choose a reason for hiding this comment

matt-peters commented May 1, 2019

matt-gardner commented Jun 14, 2019

schmmd commented Jul 12, 2019

dirkgr commented Dec 11, 2019

scarecrow1123 commented Apr 17, 2019 •

edited

Loading

scarecrow1123 commented Apr 18, 2019 •

edited

Loading