storing & logging gradient norm in trainer #27326

shijie-wu · 2023-11-06T17:43:14Z

What does this PR do?

Report gradient norm during training - Fixes #26143

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@muellerzr @pacman100

amyeroberts · 2023-11-06T18:04:46Z

cc @muellerzr

muellerzr

Thanks! This looks good to me, can you rebase from main to deal with the failing tests hopefully?

github-actions · 2024-01-01T08:04:50Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

mjbommar · 2024-01-02T15:12:45Z

Thank you for the work on this, @shijie-wu!

It may seem like a little PR to some, but this would be a huge step to bring transformers closer to parity with projects like gpt-neox for large-scale training.

muellerzr · 2024-01-05T14:39:31Z

Gentle ping @shijie-wu :)

jubgjf · 2024-01-08T10:39:09Z

Found that self.accelerator.clip_grad_norm_ will return None if we are using DeepSpeed with Trainer. In DeepSpeed we should use model.get_global_grad_norm() to get grad_norm:

_grad_norm = self.accelerator.clip_grad_norm_(
    model.parameters(),
    args.max_grad_norm,
)
if self.accelerator.distributed_type == DistributedType.DEEPSPEED:
    grad_norm = model.get_global_grad_norm()
else:
    grad_norm = _grad_norm.item() if _grad_norm is not None else None

shijie-wu · 2024-01-17T22:05:00Z

sorry for the delay! PTAL @muellerzr @mjbommar

shijie-wu · 2024-01-24T19:43:33Z

Gentle ping @muellerzr @mjbommar :)

muellerzr

Thanks! Sorry for the delay!

HuggingFaceDocBuilderDev · 2024-02-15T18:39:42Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

muellerzr · 2024-02-16T18:28:14Z

cc @amyeroberts for final review :)

amyeroberts

Thanks for adding!

152334H · 2024-03-02T13:37:07Z

not sure if this was mentioned anywhere, but this PR breaks training checkpoint saving because

the grad norm is added to TrainerState.log_history as a tensor
TrainerState.save_to_json attempts to jsonify that tensor, which naturally errors out as Tensors can't be jsonified

my fix for this was to patch save_to_json to the following:

    def save_to_json(self, json_path: str):
        """Save the content of this instance in JSON format inside `json_path`."""
        selfd = dataclasses.asdict(self)
        for d in selfd['log_history']:
            if 'grad_norm' in d: d['grad_norm'] = d['grad_norm'].item()
        json_string = json.dumps(selfd, indent=2, sort_keys=True) + "\n"
        with open(json_path, "w", encoding="utf-8") as f: f.write(json_string)

but this is probably not the best approach to doing this

shijie-wu · 2024-03-02T22:39:38Z

@152334H it does convert grad_norm to number before passing it into _maybe_log_save_evaluate

transformers/src/transformers/trainer.py

Lines 2010 to 2016 in 831bc25

    
           if ( 
        
               is_accelerate_available() 
        
               and self.accelerator.distributed_type == DistributedType.DEEPSPEED 
        
           ): 
        
               grad_norm = model.get_global_grad_norm() 
        
           else: 
        
               grad_norm = _grad_norm.item() if _grad_norm is not None else None

same for deepspeed

https://github.com/microsoft/DeepSpeed/blob/bcc617a0009dd27b4e144de59979bd7770eaf57c/deepspeed/runtime/engine.py#L448-L458

what backend were you using?

152334H · 2024-03-02T22:43:51Z

Deepspeed zero2.

Seems likely that the type hint is not universally correct. The value returned in scaled_global_norm for zero2 is a tensor scalar. That value subsequently assigns _global_grad_norm without any .item().

shubhanjan99 · 2024-03-04T19:48:33Z

not sure if this was mentioned anywhere, but this PR breaks training checkpoint saving because

the grad norm is added to TrainerState.log_history as a tensor

TrainerState.save_to_json attempts to jsonify that tensor, which naturally errors out as Tensors can't be jsonified

I'm facing the same issue with deepspeed stage 1, can you please fix this. I need to use v4.38.0 for a different fix?

muellerzr · 2024-03-04T19:56:59Z

Can you all try installing with pip install git+https://github.com/huggingface/transformers@muellerzr-deepspeed-item?

This PR may have fixed this too as well: #29444

shubhanjan99 · 2024-03-04T21:30:05Z

Can you all try installing with pip install git+https://github.com/huggingface/transformers@muellerzr-deepspeed-item?

That fixed it for me! Thanks a lot

lucasjinreal · 2024-03-19T02:38:42Z

same error here:

11%|████████████████████████▏ | 800/7050 [4:07:59<32:10:41, 18.53s/it]Trainer is attempting to log a value of "2.204314947128296" of type <class 'torch.Tensor'> for key "train/grad_norm" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.

Please tell me how to fix it?

huggingface deleted a comment from github-actions bot Dec 7, 2023

ArthurZucker requested a review from muellerzr December 7, 2023 10:00

muellerzr approved these changes Dec 7, 2023

View reviewed changes

report grad_norm during training

fb2c2ce

shijie-wu force-pushed the grad_norm branch from 659ee94 to fb2c2ce Compare January 17, 2024 21:47

support getting grad_norm from deepspeed

111823a

muellerzr approved these changes Feb 15, 2024

View reviewed changes

muellerzr requested a review from amyeroberts February 16, 2024 18:28

amyeroberts approved these changes Feb 19, 2024

View reviewed changes

amyeroberts merged commit 4f09d0f into huggingface:main Feb 19, 2024
21 checks passed

prathikr mentioned this pull request Feb 26, 2024

changes to _maybe_log_save_evaluate() not reflected in optimum repo huggingface/optimum#1723

Open

4 tasks

shijie-wu deleted the grad_norm branch March 2, 2024 22:39

muellerzr mentioned this pull request Mar 4, 2024

Fix test failure on DeepSpeed #29444

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storing & logging gradient norm in trainer #27326

storing & logging gradient norm in trainer #27326

shijie-wu commented Nov 6, 2023

amyeroberts commented Nov 6, 2023

muellerzr left a comment

github-actions bot commented Jan 1, 2024

mjbommar commented Jan 2, 2024

muellerzr commented Jan 5, 2024

jubgjf commented Jan 8, 2024

shijie-wu commented Jan 17, 2024

shijie-wu commented Jan 24, 2024

muellerzr left a comment

HuggingFaceDocBuilderDev commented Feb 15, 2024

muellerzr commented Feb 16, 2024

amyeroberts left a comment

152334H commented Mar 2, 2024 •

edited

Loading

shijie-wu commented Mar 2, 2024

152334H commented Mar 2, 2024 •

edited

Loading

shubhanjan99 commented Mar 4, 2024

muellerzr commented Mar 4, 2024

shubhanjan99 commented Mar 4, 2024

lucasjinreal commented Mar 19, 2024

storing & logging gradient norm in trainer #27326

storing & logging gradient norm in trainer #27326

Conversation

shijie-wu commented Nov 6, 2023

What does this PR do?

Before submitting

Who can review?

amyeroberts commented Nov 6, 2023

muellerzr left a comment

Choose a reason for hiding this comment

github-actions bot commented Jan 1, 2024

mjbommar commented Jan 2, 2024

muellerzr commented Jan 5, 2024

jubgjf commented Jan 8, 2024

shijie-wu commented Jan 17, 2024

shijie-wu commented Jan 24, 2024

muellerzr left a comment

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Feb 15, 2024

muellerzr commented Feb 16, 2024

amyeroberts left a comment

Choose a reason for hiding this comment

152334H commented Mar 2, 2024 • edited Loading

shijie-wu commented Mar 2, 2024

152334H commented Mar 2, 2024 • edited Loading

shubhanjan99 commented Mar 4, 2024

muellerzr commented Mar 4, 2024

shubhanjan99 commented Mar 4, 2024

lucasjinreal commented Mar 19, 2024

152334H commented Mar 2, 2024 •

edited

Loading

152334H commented Mar 2, 2024 •

edited

Loading