Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow GradientAccumulationPlugin to be configured from AcceleratorConfig #29589

Merged
merged 11 commits into from
Mar 28, 2024

Conversation

fabianlim
Copy link
Contributor

What does this PR do?

Fixes #29425. Also please refer to the accompanying PR huggingface/accelerate#2531 which implements an extra control sync_each_batch for GradientAccumulationPlugin. Before these changes, GradientAccumulationPlugin is configured by Trainer with a fixed set of hardcodes. This PR allows the user to set sync_each_batch if memory issues are faced when using FSDP with no_sync.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@muellerzr

Copy link
Contributor

@muellerzr muellerzr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this is a good step, however we have some issues with versioning that need to be taken care of first.

Copy link
Contributor

@pacman100 pacman100 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @fabianlim , Thank you for the PR of Accelerate and this for reducing the memory usage with FSDP by forcing gradient synchronization at each step.
An overall comment: What if we change the default in Accelerate to not enable no_sync when preparing FSDP model. That way user would not have to pass an extra argument.

@fabianlim
Copy link
Contributor Author

Hello @fabianlim , Thank you for the PR of Accelerate and this for reducing the memory usage with FSDP by forcing gradient synchronization at each step.
An overall comment: What if we change the default in Accelerate to not enable no_sync when preparing FSDP model. That way user would not have to pass an extra argument.

@pacman100 thanks for the message. I considered this, but decided not too, as i did some experiments with smaller models like llama7b and saw that turning the flag on or off made very little difference in terms of peak memory consumption. This is probably the case that it was activation dominated.

Thus I concluded that this is use case dependent, and not a good idea to set it to True by default. It wouldnt be so good for the llama7b user to now explicity sync_each_batch=False to improve the throughput. (and its probably more insiduous because the user wouldnt be aware that its turned on, and that the training time could actually be faster at no impact if it were turned off)

@pacman100
Copy link
Contributor

pacman100 commented Mar 12, 2024

thanks for the message. I considered this, but decided not too, as i did some experiments with smaller models like llama7b and saw that turning the flag on or off made very little difference in terms of peak memory consumption. This is probably the case that it was activation dominated.

Thus I concluded that this is use case dependent, and not a good idea to set it to True by default. It wouldnt be so good for the llama7b user to now explicity sync_each_batch=False to improve the throughput. (and its probably more insiduous because the user wouldnt be aware that its turned on, and that the training time could actually be faster at no impact if it were turned off)

Got it, makes sense and thank you for all the details. Then maybe more documentation on the Accelerate docs and Transformers docs would help when this flag makes a difference. I think the info about negligible peak memory change for 7B model should be added to the Accelerate docs.

@fabianlim fabianlim force-pushed the feature-disable-no-sync branch 4 times, most recently from 5ef33f2 to 37413ea Compare March 12, 2024 10:55
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@fabianlim fabianlim force-pushed the feature-disable-no-sync branch from 37413ea to 3d1a3d5 Compare March 12, 2024 12:39
Copy link
Contributor

@muellerzr muellerzr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! We're getting closer.

Also, no need to force-push your commits, we squash the commit history (and force-push makes it harder for us to track things)

@fabianlim
Copy link
Contributor Author

Thanks! We're getting closer.
Also, no need to force-push your commits, we squash the commit history (and force-push makes it harder for us to track things)

Got it. Yes I mostly have a force-push habit to not bloat my commits. However I did it once to rebase, because I noticed the tests_torch were failing due to daily PR merges.

After my latest changes the tests are failing again, but I shall refrain from rebasing for now.

@fabianlim
Copy link
Contributor Author

fabianlim commented Mar 13, 2024

Got it, makes sense and thank you for all the details. Then maybe more documentation on the Accelerate docs and Transformers docs would help when this flag makes a difference. I think the info about negligible peak memory change for 7B model should be added to the Accelerate docs.

@pacman100 how about this addition to the accelerate docs. I decided to have the example about llama13b instead of 7b. The PR has already been merged so this will have to be another PR.

Copy link
Contributor

@muellerzr muellerzr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how I feel about mixing in tests in tests with these conditionals, generally tests should be fully independent and test very specific things, so I'm not sure if I'm a fan of having if self.GRAD_ACCUM_KWARGS_VERSION_AVAILABLE and test X.

For the most part it's fine, however I did leave a note on one specific test that is built to just change one default and leave everything else normal.

Also left some nits on simplifying the require_version logic.

We're almost there 🤗

@@ -791,6 +793,9 @@ def test_tf32(self):
@require_sentencepiece
@require_tokenizers
class TrainerIntegrationTest(TestCasePlus, TrainerIntegrationCommon):
GRAD_ACCUM_KWARGS_VERSION_AVAILABLE = is_accelerate_available("0.28")
require_accelerate_version = partial(require_accelerate, min_version="0.28")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for this logic, let's just use @require_accelerate(min_version="0.28.0") on the tests that need it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@muellerzr sorry can I clarify this remark?

  1. you highlighted both lines 796 and 797. Are you saying you want to remove the boolean GRAD_ACCUM_KWARGS_VERSION_AVAILABLE and use is_accelerate_available("0.28") directly in the conditional?
  2. are you saying you want to skip the partial bind and directly decorate the test as @is_accelerate_available("0.28")? If so, then I do not think we can follow the require_fsdp function style. Will need to rewrire require_accelerate into a "builder"
  def require_accelerate(min_version: str == "0.28"):
      def _require(test): 
          return unittest.skipUnless(is_accelerate_available(), "test requires accelerate")(test_case)
      return _require

But the issue with this is the null pattern @require_accelerate doesnt work anymore, we need to instead write @require_accelerate(), i.e., with the extra () brackets.

Copy link
Contributor

@muellerzr muellerzr Mar 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just mimic what's done in test_fsdp.py, make the partial decorator in the test file:

require_fsdp_version = require_fsdp
if is_accelerate_available():
    ...
    require_fsdp_version = partial(require_fsdp, min_version=FSDP_PYTORCH_VERSION)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, you can then just use @require_fsdp_version on the test

Comment on lines 2631 to 2635
accelerator_config = {
"split_batches": True,
}
if self.GRAD_ACCUM_KWARGS_VERSION_AVAILABLE:
accelerator_config["gradient_accumulation_kwargs"] = {"sync_each_batch": True}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This changes what the test does. Let's not do this please.

Copy link
Contributor Author

@fabianlim fabianlim Mar 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes agree, in this case I would think its best to revert this test to its previous state and not introduce gradient_accumulation_kwargs at all in this particular test.

Since gradient_accumulation_kwargs being specified is already non-default, then we have a host of other tests that consider this.

Do you agree?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup

@muellerzr
Copy link
Contributor

(You may also need to rebase from main for the test failures)

@fabianlim
Copy link
Contributor Author

fabianlim commented Mar 13, 2024

> (You may also need to rebase from main for the test failures)

@muellerzr should I rebase now or wait till the end until we resolve most of the changes? because if I rebase I will need to force push and that makes it harder to track?
I pulled main's changes in, but probably one more pull is needed after #29647 is merged.

I agree that conditionals are not preferred in tests, and there are other ways like using parameterized, but we need the rework the self.assert lines to check conditionally (e.g., if grad_accum_kwargs is not specified, etc etc.). Also we will need some conditionals on how to populate the dicts going into the @parameterized decorator. So it does come with its own complexities as well.

Yes conditionals generally not best practice, but in this case I feel the usage it quite minor and feel that it does not affect readability very much.

I have incorporated your suggestions to follow the FSDP style of require_fsdp. In an attempt to be more consistent i have also moved GRAD_ACCUM_KWARGS_VERSION_AVAILABLE to the top of the file. I have tried to put comments. Note there is one more use of require_accelerate in the same test_trainer.py file that I cannot replace, because it changes the logic.

Co-authored-by: Zach Mueller <muellerzr@gmail.com>
@muellerzr
Copy link
Contributor

@fabianlim you'll need to run make style; make quality; to fix the style tests

Copy link
Contributor

@muellerzr muellerzr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the back and forth! This is looking great now! handing off to @amyeroberts for a final review 🤗

@muellerzr muellerzr requested a review from amyeroberts March 14, 2024 11:43
@fabianlim
Copy link
Contributor Author

its been a pleasure @muellerzr! i have just pulled the latest changes from main!

@fabianlim
Copy link
Contributor Author

Hi @amyeroberts looking forward to your review! If there is anything I can address please feel free to let me know. FYI: @muellerzr

Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding!

Main comment is about the version handling in the tests

@fabianlim fabianlim force-pushed the feature-disable-no-sync branch from 8888c52 to 968d415 Compare March 22, 2024 04:26
@fabianlim
Copy link
Contributor Author

fabianlim commented Mar 22, 2024

@amyeroberts I pulled main again have updated the code to conform to @muellerzr's changes in #29779

Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding and iterating on this!

Just two small nits

@amyeroberts amyeroberts merged commit 4df5b9b into huggingface:main Mar 28, 2024
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow Trainer to Sync Gradients Each Batch When Performing Gradient Accumulation
5 participants