-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow Trainer to Sync Gradients Each Batch When Performing Gradient Accumulation #29425
Allow Trainer to Sync Gradients Each Batch When Performing Gradient Accumulation #29425
Comments
Hi! This solution does indeed make sense to me, let's start with a PR to accelerate and then the upstream to transformers? :) Note: for the |
@muellerzr thanks for looking at the issue. I understand I will add the Will have an accelerate PR to review soon. :) |
@muellerzr As discussed I have first begun to draft an accelerate PR . While fixing the tests, I noticed that one of the old tests In the PR i have raised, I have the Finally I have yet to update the docs, if you have any pointers which documentation I should focus on, please let me know. |
There seems to be a bug. If I set sync_each_batch=True, the optimizer will update the gradient every batch, even if I set gradient_accmulation_steps=4. |
Thanks for reporting, but we have unit tests but maybe we overlooked something.
To help me understand better, Do you have a reproduction for what you are seeing? Update: Also just to make sure you are not using CPU_Offload with FSDP and |
Thanks for your clarification. Here is some part of my code:
I only use accelerator in my code and do not use Trainer (it seems to be ok). This code will work smoothly when I think the |
@Nightmare-n thanks for sharing the above code snippet and i see that you follow the gradient accum concept guide, but now im confused with what Accelerator.accumulate does so let me clarify with the maintainers first @muellerzr in the grad accum concept guide, it does say that one can remove the
So did something change in the implementation? If what I said above is correct, then the concept guide is inaccurate, and then I have an explaination for @Nightmare-n 's observation. |
@Nightmare-n were you trying with DDP or FSDP? |
Yes, I am trying DDP and FSDP. I use AcceleratedOptimizer, and the |
@muellerzr @Nightmare-n Oh no my bad. I completely overlooked this. That means this PR as @Nightmare-n said is incorrect I drafted out something quickly here huggingface/accelerate#2790 but I havnt had time to test, let me try to find some time. @Nightmare-n gave a suggestion to fix it inside |
Hi, is this config used in transformers/src/transformers/trainer.py Lines 2475 to 2479 in 052e652
accelerator_config={'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': {'sync_each_batch': True}, 'use_configured_state': False}, to TrainingArguments , I've got the following error.
Traceback (most recent call last):
File "main.py", line 70, in <module>
main()
File "main.py", line 90, in main
trainer.train()
File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 2123, in train
return inner_training_loop(
^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 2480, in _inner_training_loop
with context():
File "/usr/lib/python3.11/contextlib.py", line 137, in __enter__
return next(self.gen)
^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/accelerate/accelerator.py", line 973, in no_sync
with context():
File "/usr/lib/python3.11/contextlib.py", line 137, in __enter__
return next(self.gen)
^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/deepspeed/runtime/engine.py", line 1995, in no_sync
assert not self.zero_optimization_partition_gradients(), \
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: no_sync context manager is incompatible with gradient partitioning logic of ZeRO stage 3 I'm sorry if I'm doing something wrong. I'll create a new issue if necessary. Best regards, |
@nzw0301 I see thanks for reporting this. No you are not doing anything wrong. I didnt realize that deepspeed does not allow the no_sync context manager. It means that this feature is incompatible with deepspeed, and we should maybe have a check to give a warning somewhere. While its not a big deal, will be a good to have. If you would like to raise an issue go ahead.. or if I manage to get around to doing this I will reference your comment. |
@fabianlim Thank you for your comment! I've created an issue. |
Feature request
We propose a feature to allow:
_do_sync
to take aforce
boolean flag, where_do_sync(force=True)
forces a gradient sync.Trainer
/Accelerate
to appropriately pass theforce
flag if the user requests the gradients to sync during accmululation.During the main
_inner_training_loop
, thetraining_step
is run under acontextmanager
created byAccelerator.accumulate
.If we inspect the
contextmanager
, we notice thatAccelerator.accumulate
will return theno_sync
context wheneverself.sync_gradients == True
.On inspection
_do_sync
setsself.sync_gradients == True
only at the end of a gradient accumulation batch. NOTE:Trainer
setssync_with_dataloader = False
and this cannot be changed. Therefore the first clause will never execute.Hence we propose to allow the user to for force
_do_sync
to setself.gradient_state._set_sync_gradients(True)
.Motivation
Not syncing gradients can have adverse effects in distributed training. As it has been warned in
torch
, theno_sync
context manager for FSDP will incur additional memory requirements:Gradient accumulation in FSDP often results in OOM on large models with a moderate number of GPUs. This occurs because
Trainer
by default will activateno_sync
when using gradient accumulation, effectively disabling gradient synchronization to reduce communication across shards. However, this results in high memory usage because parameters and gradients are not resharded. We propose a solution that avoids OOM by allowing the user to enable synchronization of parameters and gradients on all (or some) of the data batches when using gradient accumulation.Setting:
In the table below, we see Mixtral (47B parameters) and CodeLlama (34B parameters) will OOM on 8 A100-80GB when using gradient accumulation. However when we enable synchronization (i.e. disable
no_sync
), then there is no noticeable increase in gpu memory consumption when using gradient accumulation.no_sync
Your contribution
We can help contribute PRs into
transformers
andaccelerate
to effect these changes. We propose to do the following in thetransformer
andaccelerate
packages.Accelerate Repository:
GradientAccumulationPlugin
force
into_do_sync
.Transformers Repository
TrainingArguments
:create_accelerator_and_postprocess
to configureGradientAccumulationPlugin
:Documentation
The text was updated successfully, but these errors were encountered: