-
Notifications
You must be signed in to change notification settings - Fork 27.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storing & logging gradient norm in trainer #27326
Conversation
cc @muellerzr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! This looks good to me, can you rebase from main
to deal with the failing tests hopefully?
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
Thank you for the work on this, @shijie-wu! It may seem like a little PR to some, but this would be a huge step to bring |
Gentle ping @shijie-wu :) |
Found that _grad_norm = self.accelerator.clip_grad_norm_(
model.parameters(),
args.max_grad_norm,
)
if self.accelerator.distributed_type == DistributedType.DEEPSPEED:
grad_norm = model.get_global_grad_norm()
else:
grad_norm = _grad_norm.item() if _grad_norm is not None else None |
sorry for the delay! PTAL @muellerzr @mjbommar |
Gentle ping @muellerzr @mjbommar :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Sorry for the delay!
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
cc @amyeroberts for final review :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding!
not sure if this was mentioned anywhere, but this PR breaks training checkpoint saving because
my fix for this was to patch def save_to_json(self, json_path: str):
"""Save the content of this instance in JSON format inside `json_path`."""
selfd = dataclasses.asdict(self)
for d in selfd['log_history']:
if 'grad_norm' in d: d['grad_norm'] = d['grad_norm'].item()
json_string = json.dumps(selfd, indent=2, sort_keys=True) + "\n"
with open(json_path, "w", encoding="utf-8") as f: f.write(json_string) but this is probably not the best approach to doing this |
@152334H it does convert grad_norm to number before passing it into transformers/src/transformers/trainer.py Lines 2010 to 2016 in 831bc25
same for deepspeed what backend were you using? |
Deepspeed zero2. Seems likely that the type hint is not universally correct. The value returned in scaled_global_norm for zero2 is a tensor scalar. That value subsequently assigns _global_grad_norm without any .item(). |
I'm facing the same issue with deepspeed stage 1, can you please fix this. I need to use v4.38.0 for a different fix? |
Can you all try installing with This PR may have fixed this too as well: #29444 |
That fixed it for me! Thanks a lot |
same error here: 11%|████████████████████████▏ | 800/7050 [4:07:59<32:10:41, 18.53s/it]Trainer is attempting to log a value of "2.204314947128296" of type <class 'torch.Tensor'> for key "train/grad_norm" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute. Please tell me how to fix it? |
What does this PR do?
Report gradient norm during training - Fixes #26143
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
@muellerzr @pacman100