-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow GradientAccumulationPlugin to be configured from AcceleratorConfig #29589
Allow GradientAccumulationPlugin to be configured from AcceleratorConfig #29589
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this is a good step, however we have some issues with versioning that need to be taken care of first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @fabianlim , Thank you for the PR of Accelerate and this for reducing the memory usage with FSDP by forcing gradient synchronization at each step.
An overall comment: What if we change the default in Accelerate to not enable no_sync
when preparing FSDP model. That way user would not have to pass an extra argument.
@pacman100 thanks for the message. I considered this, but decided not too, as i did some experiments with smaller models like llama7b and saw that turning the flag on or off made very little difference in terms of peak memory consumption. This is probably the case that it was activation dominated. Thus I concluded that this is use case dependent, and not a good idea to set it to |
Got it, makes sense and thank you for all the details. Then maybe more documentation on the Accelerate docs and Transformers docs would help when this flag makes a difference. I think the info about negligible peak memory change for 7B model should be added to the Accelerate docs. |
5ef33f2
to
37413ea
Compare
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
37413ea
to
3d1a3d5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! We're getting closer.
Also, no need to force-push your commits, we squash the commit history (and force-push makes it harder for us to track things)
Co-authored-by: Zach Mueller <muellerzr@gmail.com>
Got it. Yes I mostly have a force-push habit to not bloat my commits. However I did it once to rebase, because I noticed the After my latest changes the tests are failing again, but I shall refrain from rebasing for now. |
@pacman100 how about this addition to the accelerate docs. I decided to have the example about llama13b instead of 7b. The PR has already been merged so this will have to be another PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how I feel about mixing in tests in tests with these conditionals, generally tests should be fully independent and test very specific things, so I'm not sure if I'm a fan of having if self.GRAD_ACCUM_KWARGS_VERSION_AVAILABLE
and test X.
For the most part it's fine, however I did leave a note on one specific test that is built to just change one default and leave everything else normal.
Also left some nits on simplifying the require_version logic.
We're almost there 🤗
tests/trainer/test_trainer.py
Outdated
@@ -791,6 +793,9 @@ def test_tf32(self): | |||
@require_sentencepiece | |||
@require_tokenizers | |||
class TrainerIntegrationTest(TestCasePlus, TrainerIntegrationCommon): | |||
GRAD_ACCUM_KWARGS_VERSION_AVAILABLE = is_accelerate_available("0.28") | |||
require_accelerate_version = partial(require_accelerate, min_version="0.28") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for this logic, let's just use @require_accelerate(min_version="0.28.0")
on the tests that need it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@muellerzr sorry can I clarify this remark?
- you highlighted both lines 796 and 797. Are you saying you want to remove the boolean
GRAD_ACCUM_KWARGS_VERSION_AVAILABLE
and useis_accelerate_available("0.28")
directly in the conditional? - are you saying you want to skip the
partial
bind and directly decorate the test as@is_accelerate_available("0.28")
? If so, then I do not think we can follow therequire_fsdp
function style. Will need to rewrirerequire_accelerate
into a "builder"
def require_accelerate(min_version: str == "0.28"):
def _require(test):
return unittest.skipUnless(is_accelerate_available(), "test requires accelerate")(test_case)
return _require
But the issue with this is the null pattern @require_accelerate
doesnt work anymore, we need to instead write @require_accelerate()
, i.e., with the extra ()
brackets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just mimic what's done in test_fsdp.py
, make the partial decorator in the test file:
require_fsdp_version = require_fsdp
if is_accelerate_available():
...
require_fsdp_version = partial(require_fsdp, min_version=FSDP_PYTORCH_VERSION)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, you can then just use @require_fsdp_version
on the test
tests/trainer/test_trainer.py
Outdated
accelerator_config = { | ||
"split_batches": True, | ||
} | ||
if self.GRAD_ACCUM_KWARGS_VERSION_AVAILABLE: | ||
accelerator_config["gradient_accumulation_kwargs"] = {"sync_each_batch": True} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This changes what the test does. Let's not do this please.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes agree, in this case I would think its best to revert this test to its previous state and not introduce gradient_accumulation_kwargs
at all in this particular test.
Since gradient_accumulation_kwargs
being specified is already non-default, then we have a host of other tests that consider this.
Do you agree?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup
(You may also need to rebase from main for the test failures) |
I agree that conditionals are not preferred in tests, and there are other ways like using Yes conditionals generally not best practice, but in this case I feel the usage it quite minor and feel that it does not affect readability very much. I have incorporated your suggestions to follow the FSDP style of |
Co-authored-by: Zach Mueller <muellerzr@gmail.com>
@fabianlim you'll need to run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the back and forth! This is looking great now! handing off to @amyeroberts for a final review 🤗
its been a pleasure @muellerzr! i have just pulled the latest changes from |
Hi @amyeroberts looking forward to your review! If there is anything I can address please feel free to let me know. FYI: @muellerzr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding!
Main comment is about the version handling in the tests
8888c52
to
968d415
Compare
@amyeroberts I pulled main again have updated the code to conform to @muellerzr's changes in #29779 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding and iterating on this!
Just two small nits
What does this PR do?
Fixes #29425. Also please refer to the accompanying PR huggingface/accelerate#2531 which implements an extra control
sync_each_batch
forGradientAccumulationPlugin
. Before these changes,GradientAccumulationPlugin
is configured byTrainer
with a fixed set of hardcodes. This PR allows the user to setsync_each_batch
if memory issues are faced when using FSDP withno_sync
.Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
@muellerzr