Fix galore lr display with schedulers #31710

vasqu · 2024-06-29T10:37:19Z

What does this PR do?

See #31707 for a detailed rundown. Fixes #31707

Tl;dr: Galore still has issues displaying the correct lr due to the lr scheduler this time.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@muellerzr @SunMarc @amyeroberts @Minami-su

vasqu · 2024-06-29T10:46:40Z

Failing tests seem unrelated to me: TF and hub issues.

muellerzr

Thanks, overall this makes sense. Can you add a test in trainer_utils for this by chance? https://github.com/huggingface/transformers/blob/main/tests/trainer/test_trainer_utils.py

HuggingFaceDocBuilderDev · 2024-07-01T18:47:51Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

vasqu · 2024-07-02T10:55:16Z

@muellerzr Wouldn't it make more sense over here?

transformers/tests/trainer/test_trainer.py

Line 1452 in 3345ae7

def test_galore(self):

I would add two tests:

A basic learning rate check without a scheduler.
Cosine with warmup steps and check the lr logs to roughly follow the correct pattern.

Is that reasonable? I'm not sure when I'll have the time tho.

vasqu · 2024-07-02T18:48:51Z

tests/trainer/test_trainer.py

+            # reach given learning rate peak and end with 0 lr
+            self.assertTrue(logs[num_warmup_steps - 2]["learning_rate"] == learning_rate)
+            self.assertTrue(logs[-1]["learning_rate"] == 0)
+
+            # increasing and decreasing pattern of lrs
+            increasing_lrs = [
+                logs[i]["learning_rate"] < logs[i + 1]["learning_rate"]
+                for i in range(len(logs))
+                if i < num_warmup_steps - 2
+            ]
+            decreasing_lrs = [
+                logs[i]["learning_rate"] > logs[i + 1]["learning_rate"]
+                for i in range(len(logs) - 1)
+                if i >= num_warmup_steps - 2
+            ]
+
+            self.assertTrue(all(increasing_lrs))
+            self.assertTrue(all(decreasing_lrs))
+
+            # warm up steps << total steps
+            self.assertTrue(len(decreasing_lrs) > len(increasing_lrs))


Just checking for the general patterns of the cosine scheduler. We could just hardcode the values, but I don't think that's necessary.

Moved the tests in the general trainer tests but could also be moved elsewhere. Thought it was more appropriate over here.

muellerzr

Good job with the test!

cc @amyeroberts for final review

amyeroberts

Thanks for fixing!

Just a comment on the default LR

amyeroberts · 2024-07-05T09:03:04Z

src/transformers/optimization.py

@@ -519,7 +519,7 @@ def scheduler_hook(param):
            if param.requires_grad:
                param.register_post_accumulate_grad_hook(scheduler_hook)

-        return LayerWiseDummyScheduler()
+        return LayerWiseDummyScheduler(optimizer_dict=optimizer_dict, lr=optimizer.defaults.get("lr", 1e-3))


Where does the 1e-3 come from here?

It is like a double fallback. Shouldn't be necessary since the dummy optimizer is guaranteed to have a value.

The 1e-3 itself comes from torch galore as their specific defaults.

amyeroberts · 2024-07-05T09:32:49Z

src/transformers/trainer_pt_utils.py

        last_epoch = -1
        verbose = False
        super().__init__(optimizer, last_epoch, verbose)

    def get_lr(self):
-        return [group["lr"] for group in self.optimizer.param_groups]
+        # default value
+        lrs = [1e-3]


I think we should move the 1e-3 value out to a constant which get_lr and get_scheduler so that we only need to update in one place

The default value is in the dummy optimizer, I'll just save them on the initial creation of the dummy scheduler. This way we won't have the hardcoded value.

amyeroberts

LGTM - thanks for iterating!

Rebasing on main should resolve any timeout issues on the CI runs

vasqu · 2024-07-05T15:13:23Z

@amyeroberts One timeout didn't make it through. Is it just my luck? 😆

amyeroberts · 2024-07-05T17:59:06Z

@vasqu Just bad luck - although we'll need to look into it on our side why these flaky failures are happening. Thankfully some re-runs worked. Thanks for your patience!

muellerzr reviewed Jul 1, 2024

View reviewed changes

vasqu commented Jul 2, 2024

View reviewed changes

muellerzr approved these changes Jul 3, 2024

View reviewed changes

amyeroberts reviewed Jul 5, 2024

View reviewed changes

amyeroberts approved these changes Jul 5, 2024

View reviewed changes

vasqu added 5 commits July 5, 2024 17:01

fix galore lr display with lr schedulers

2817844

style

a2a5f0d

add some tests to check for displayed lrs

48048ad

copy-paste err for warmup steps

bc96ccf

standardize the default lr to be only in the optimizer

230adf6

vasqu force-pushed the fix-galore-lr-display-with-schedulers branch from 55f9d8f to 230adf6 Compare July 5, 2024 15:02

trying out my luck with the reads

a708c54

amyeroberts merged commit a01b033 into huggingface:main Jul 5, 2024
20 checks passed

vasqu deleted the fix-galore-lr-display-with-schedulers branch July 5, 2024 18:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix galore lr display with schedulers #31710

Fix galore lr display with schedulers #31710

vasqu commented Jun 29, 2024 •

edited

Loading

vasqu commented Jun 29, 2024

muellerzr left a comment

HuggingFaceDocBuilderDev commented Jul 1, 2024

vasqu commented Jul 2, 2024

vasqu Jul 2, 2024 •

edited

Loading

muellerzr left a comment

amyeroberts left a comment

amyeroberts Jul 5, 2024

vasqu Jul 5, 2024

amyeroberts Jul 5, 2024

vasqu Jul 5, 2024

amyeroberts left a comment

vasqu commented Jul 5, 2024

amyeroberts commented Jul 5, 2024

Fix galore lr display with schedulers #31710

Fix galore lr display with schedulers #31710

Conversation

vasqu commented Jun 29, 2024 • edited Loading

What does this PR do?

Before submitting

Who can review?

vasqu commented Jun 29, 2024

muellerzr left a comment

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Jul 1, 2024

vasqu commented Jul 2, 2024

vasqu Jul 2, 2024 • edited Loading

Choose a reason for hiding this comment

muellerzr left a comment

Choose a reason for hiding this comment

amyeroberts left a comment

Choose a reason for hiding this comment

amyeroberts Jul 5, 2024

Choose a reason for hiding this comment

vasqu Jul 5, 2024

Choose a reason for hiding this comment

amyeroberts Jul 5, 2024

Choose a reason for hiding this comment

vasqu Jul 5, 2024

Choose a reason for hiding this comment

amyeroberts left a comment

Choose a reason for hiding this comment

vasqu commented Jul 5, 2024

amyeroberts commented Jul 5, 2024

vasqu commented Jun 29, 2024 •

edited

Loading

vasqu Jul 2, 2024 •

edited

Loading