This repository has been archived by the owner on Dec 16, 2022. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Add Gradient accumulation support to the default trainer #2721
Closed
scarecrow1123
wants to merge
17
commits into
allenai:master
from
scarecrow1123:gradient-accumulation
Closed
Changes from all commits
Commits
Show all changes
17 commits
Select commit
Hold shift + click to select a range
9573e55
Added default predictor for bimpm model
scarecrow1123 563647a
Merge pull request #1 from allenai/master
scarecrow1123 5ccf8ad
Merge pull request #2 from allenai/master
scarecrow1123 aaeddef
Merge pull request #3 from allenai/master
scarecrow1123 96db26f
Fix #2717: Add day count in training duration
scarecrow1123 64552a0
Modify elapsed time format to use `timedelta`
scarecrow1123 d0ac4ca
Merge branch 'master' of git://github.com/allenai/allennlp
scarecrow1123 383bf6d
Add gradient accumulation support
scarecrow1123 ad94b1f
Add doc for gradient accumulation
scarecrow1123 a33865b
Fix linter errors
scarecrow1123 f7a8ff7
Fix gradient accumulation to work for multi GPU
scarecrow1123 6f45aa3
Add test for gradient accumulation
scarecrow1123 955e2c4
Merge 'upstream/master' into gradient-accumulation
scarecrow1123 e2e48e3
Rename `num_batch_groups` and clarify usage
scarecrow1123 12a87ca
Add comments to clarify gradient accumulation
scarecrow1123 71398f1
Add more checks in gradient accumulation test
scarecrow1123 07cd9b2
Fix linter error
scarecrow1123 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am having a really tough time understanding what this code is doing here and why it's doing it, is there a way to clarify it / add comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies if it is not succinct enough to understand. I've added few comments there. Hope it clarifies a bit.
The basic idea is something like this. So far
batch_group
is being used to do a forward pass for both single and multi GPU cases. In the multi GPU case, the tensors inbatch_group
are aggregated to form a single batch. I just wanted to extend the same flow for gradient accumulation too. The innerfor
loop intrain_epoch
tries to callbatch_loss
fornum_steps_to_accumulate
times. In each iteration, we use a chunk of the originalbatch_group
to callbatch_loss
. As usual, the length of the chunk that gets passed in each iteration would be equal to the number of GPUs configured for training. Let me know if this makes sense.