Allow GradientAccumulationPlugin to be configured from AcceleratorConfig #29589

fabianlim · 2024-03-11T14:19:04Z

What does this PR do?

Fixes #29425. Also please refer to the accompanying PR huggingface/accelerate#2531 which implements an extra control sync_each_batch for GradientAccumulationPlugin. Before these changes, GradientAccumulationPlugin is configured by Trainer with a fixed set of hardcodes. This PR allows the user to set sync_each_batch if memory issues are faced when using FSDP with no_sync.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@muellerzr

src/transformers/trainer_pt_utils.py

tests/trainer/test_trainer.py

src/transformers/trainer.py

muellerzr

Overall this is a good step, however we have some issues with versioning that need to be taken care of first.

pacman100

Hello @fabianlim , Thank you for the PR of Accelerate and this for reducing the memory usage with FSDP by forcing gradient synchronization at each step.
An overall comment: What if we change the default in Accelerate to not enable no_sync when preparing FSDP model. That way user would not have to pass an extra argument.

fabianlim · 2024-03-12T05:44:40Z

Hello @fabianlim , Thank you for the PR of Accelerate and this for reducing the memory usage with FSDP by forcing gradient synchronization at each step.
An overall comment: What if we change the default in Accelerate to not enable no_sync when preparing FSDP model. That way user would not have to pass an extra argument.

@pacman100 thanks for the message. I considered this, but decided not too, as i did some experiments with smaller models like llama7b and saw that turning the flag on or off made very little difference in terms of peak memory consumption. This is probably the case that it was activation dominated.

Thus I concluded that this is use case dependent, and not a good idea to set it to True by default. It wouldnt be so good for the llama7b user to now explicity sync_each_batch=False to improve the throughput. (and its probably more insiduous because the user wouldnt be aware that its turned on, and that the training time could actually be faster at no impact if it were turned off)

pacman100 · 2024-03-12T05:50:07Z

thanks for the message. I considered this, but decided not too, as i did some experiments with smaller models like llama7b and saw that turning the flag on or off made very little difference in terms of peak memory consumption. This is probably the case that it was activation dominated.

Thus I concluded that this is use case dependent, and not a good idea to set it to True by default. It wouldnt be so good for the llama7b user to now explicity sync_each_batch=False to improve the throughput. (and its probably more insiduous because the user wouldnt be aware that its turned on, and that the training time could actually be faster at no impact if it were turned off)

Got it, makes sense and thank you for all the details. Then maybe more documentation on the Accelerate docs and Transformers docs would help when this flag makes a difference. I think the info about negligible peak memory change for 7B model should be added to the Accelerate docs.

HuggingFaceDocBuilderDev · 2024-03-12T12:26:45Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

muellerzr

Thanks! We're getting closer.

Also, no need to force-push your commits, we squash the commit history (and force-push makes it harder for us to track things)

src/transformers/trainer.py

src/transformers/trainer_pt_utils.py

tests/trainer/test_trainer.py

Co-authored-by: Zach Mueller <muellerzr@gmail.com>

fabianlim · 2024-03-13T02:47:08Z

Thanks! We're getting closer.
Also, no need to force-push your commits, we squash the commit history (and force-push makes it harder for us to track things)

Got it. Yes I mostly have a force-push habit to not bloat my commits. However I did it once to rebase, because I noticed the tests_torch were failing due to daily PR merges.

After my latest changes the tests are failing again, but I shall refrain from rebasing for now.

fabianlim · 2024-03-13T06:26:43Z

Got it, makes sense and thank you for all the details. Then maybe more documentation on the Accelerate docs and Transformers docs would help when this flag makes a difference. I think the info about negligible peak memory change for 7B model should be added to the Accelerate docs.

@pacman100 how about this addition to the accelerate docs. I decided to have the example about llama13b instead of 7b. The PR has already been merged so this will have to be another PR.

muellerzr

I'm not sure how I feel about mixing in tests in tests with these conditionals, generally tests should be fully independent and test very specific things, so I'm not sure if I'm a fan of having if self.GRAD_ACCUM_KWARGS_VERSION_AVAILABLE and test X.

For the most part it's fine, however I did leave a note on one specific test that is built to just change one default and leave everything else normal.

Also left some nits on simplifying the require_version logic.

We're almost there 🤗

muellerzr · 2024-03-13T20:15:51Z

tests/trainer/test_trainer.py

@@ -791,6 +793,9 @@ def test_tf32(self):
 @require_sentencepiece
 @require_tokenizers
 class TrainerIntegrationTest(TestCasePlus, TrainerIntegrationCommon):
+    GRAD_ACCUM_KWARGS_VERSION_AVAILABLE = is_accelerate_available("0.28")
+    require_accelerate_version = partial(require_accelerate, min_version="0.28")


No need for this logic, let's just use @require_accelerate(min_version="0.28.0") on the tests that need it

@muellerzr sorry can I clarify this remark?

you highlighted both lines 796 and 797. Are you saying you want to remove the boolean GRAD_ACCUM_KWARGS_VERSION_AVAILABLE and use is_accelerate_available("0.28") directly in the conditional?

are you saying you want to skip the partial bind and directly decorate the test as @is_accelerate_available("0.28")? If so, then I do not think we can follow the require_fsdp function style. Will need to rewrire require_accelerate into a "builder"

def require_accelerate(min_version: str == "0.28"): def _require(test): return unittest.skipUnless(is_accelerate_available(), "test requires accelerate")(test_case) return _require

But the issue with this is the null pattern @require_accelerate doesnt work anymore, we need to instead write @require_accelerate(), i.e., with the extra () brackets.

Just mimic what's done in test_fsdp.py, make the partial decorator in the test file:

require_fsdp_version = require_fsdp if is_accelerate_available(): ... require_fsdp_version = partial(require_fsdp, min_version=FSDP_PYTORCH_VERSION)

Here, you can then just use @require_fsdp_version on the test

muellerzr · 2024-03-13T20:16:51Z

tests/trainer/test_trainer.py

+            accelerator_config = {
+                "split_batches": True,
+            }
+            if self.GRAD_ACCUM_KWARGS_VERSION_AVAILABLE:
+                accelerator_config["gradient_accumulation_kwargs"] = {"sync_each_batch": True}


This changes what the test does. Let's not do this please.

Yes agree, in this case I would think its best to revert this test to its previous state and not introduce gradient_accumulation_kwargs at all in this particular test.

Since gradient_accumulation_kwargs being specified is already non-default, then we have a host of other tests that consider this.

Do you agree?

muellerzr · 2024-03-13T20:19:47Z

(You may also need to rebase from main for the test failures)

fabianlim · 2024-03-13T23:49:14Z

~~> (You may also need to rebase from main for the test failures)~~

~~@muellerzr should I rebase now or wait till the end until we resolve most of the changes? because if I rebase I will need to force push and that makes it harder to track?~~
I pulled main's changes in, but probably one more pull is needed after #29647 is merged.

I agree that conditionals are not preferred in tests, and there are other ways like using parameterized, but we need the rework the self.assert lines to check conditionally (e.g., if grad_accum_kwargs is not specified, etc etc.). Also we will need some conditionals on how to populate the dicts going into the @parameterized decorator. So it does come with its own complexities as well.

Yes conditionals generally not best practice, but in this case I feel the usage it quite minor and feel that it does not affect readability very much.

I have incorporated your suggestions to follow the FSDP style of require_fsdp. In an attempt to be more consistent i have also moved GRAD_ACCUM_KWARGS_VERSION_AVAILABLE to the top of the file. I have tried to put comments. Note there is one more use of require_accelerate in the same test_trainer.py file that I cannot replace, because it changes the logic.

tests/trainer/test_trainer.py

Co-authored-by: Zach Mueller <muellerzr@gmail.com>

muellerzr · 2024-03-14T11:42:07Z

@fabianlim you'll need to run make style; make quality; to fix the style tests

muellerzr

Thanks for the back and forth! This is looking great now! handing off to @amyeroberts for a final review 🤗

fabianlim · 2024-03-14T12:25:14Z

its been a pleasure @muellerzr! i have just pulled the latest changes from main!

fabianlim · 2024-03-18T16:01:49Z

Hi @amyeroberts looking forward to your review! If there is anything I can address please feel free to let me know. FYI: @muellerzr

amyeroberts

Thanks for adding!

Main comment is about the version handling in the tests

src/transformers/trainer_pt_utils.py

src/transformers/trainer.py

tests/trainer/test_trainer.py

fabianlim · 2024-03-22T04:47:32Z

@amyeroberts I pulled main again have updated the code to conform to @muellerzr's changes in #29779

amyeroberts

Thanks for adding and iterating on this!

Just two small nits

tests/trainer/test_trainer.py

src/transformers/trainer.py

fabianlim mentioned this pull request Mar 11, 2024

Allow Gradients to be Synced Each Data Batch While Performing Gradient Accumulation huggingface/accelerate#2531

Merged

5 tasks