Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix multiple lr scheduler (warmup scheduler) & Add adaptive_patience to the lr scheduler #3035

Closed
wants to merge 24 commits into from

Conversation

sungmanc
Copy link
Contributor

@sungmanc sungmanc commented Mar 5, 2024

Summary

This PR introduces,

  • Resolving the multi LR scheduler issue by overriding the optimizer_step() to enable the warmup scheduling. So, warmup logic will not be enabled by LinearWarmupScheduler anymore.
  • Add adaptive_patience to the LRScheduler
  • Add unit-tests for warmup logic

TODOs,

  • Need to resolve the hard-coded index for the optimizer, and scheduler (i.e. optimizers[0]).
  • For future work, we might use the callback ? to enable the warm-up scheduler, need to check the feasibility. If this way works, we could remove the warmup_steps and warmup_by_epochs at the base OTXLitModule.
    NOTE,
  • I remained the lr_scheduler_configs as list type since I got an error when I changed the type to the dictionary. I think it could be handled at the next phase since it is not important for this PR.

How to test

Checklist

  • I have added unit tests to cover my changes.​
  • I have added integration tests to cover my changes.​
  • I have added e2e tests for validation.
  • I have added the description of my changes into CHANGELOG in my target branch (e.g., CHANGELOG in develop).​
  • I have updated the documentation in my target branch accordingly (e.g., documentation in develop).
  • I have linked related issues.

License

  • I submit my code changes under the same Apache License that covers the project.
    Feel free to contact the maintainers if that's a concern.
  • I have updated the license header for each file (see an example below).
# Copyright (C) 2023 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

@github-actions github-actions bot added TEST Any changes in tests OTX 2.0 labels Mar 5, 2024
@sungmanc sungmanc marked this pull request as ready for review March 7, 2024 02:23
Copy link
Contributor

@harimkang harimkang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The warmup settings seem to be different on models, is this by design? Could you please double check?

@@ -150,6 +157,7 @@ def ensure_list(item: Any) -> list: # noqa: ANN401
optimizer(params=self.parameters()) if callable(optimizer) else optimizer
for optimizer in ensure_list(self.hparams.optimizer)
]
self.init_lr = optimizers[0].param_groups[0]["lr"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self.init_lr = optimizers[0].param_groups[0]["lr"]
# Capture initial_lr
for optimizer in optimizers:
for param_group in optimizer.param_groups:
param_group.setdefault('initial_lr', param_group["lr"])

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, fa86c3b

Comment on lines 183 to 194
def _scale_lr(start_point: int, end_point: int, init_lr: float) -> float:
return min(1.0, float(start_point + 1) / end_point) * init_lr

optimizer.step(closure=closure)

if self.warmup_by_epoch and self.trainer.current_epoch < self.warmup_steps:
for pg in optimizer.param_groups:
pg["lr"] = _scale_lr(self.trainer.current_epoch, self.warmup_steps, self.init_lr)

if not self.warmup_by_epoch and (self.trainer.global_step < self.warmup_steps):
for pg in optimizer.param_groups:
pg["lr"] = _scale_lr(self.trainer.global_step, self.warmup_steps, self.init_lr)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def _scale_lr(start_point: int, end_point: int, init_lr: float) -> float:
return min(1.0, float(start_point + 1) / end_point) * init_lr
optimizer.step(closure=closure)
if self.warmup_by_epoch and self.trainer.current_epoch < self.warmup_steps:
for pg in optimizer.param_groups:
pg["lr"] = _scale_lr(self.trainer.current_epoch, self.warmup_steps, self.init_lr)
if not self.warmup_by_epoch and (self.trainer.global_step < self.warmup_steps):
for pg in optimizer.param_groups:
pg["lr"] = _scale_lr(self.trainer.global_step, self.warmup_steps, self.init_lr)
def _scale_lr(start_point: int, end_point: int, param_group) -> float:
return min(1.0, float(start_point + 1) / end_point) * param_group["initial_lr"]
if self.trainer.current_epoch < self.warmup_steps:
lr_step = self.trainer.current_epoch if self.warmup_by_epoch else self.trainer.global_step
for pg in optimizer.param_groups:
pg["lr"] = _scale_lr(lr_step, self.warmup_steps, pg)
optimizer.step(closure=closure)

Copy link
Contributor

@vinnamkim vinnamkim Mar 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should calling _scale_lr() be in front of optimizer.step(closure=closure) to force it to be effective in top priority?

To validate this behavior, please add integration tests for the following scenario,

With cosine LR scheduler, warmup_steps=10, 5 iterations per 1 epoch, and training for 10 epochs, validate the LR curve is scheduled correctly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition, while writing this comment, I notice that the current implementation can fall into the following "what we don't want".
image
To achieve "what we want", is it better to implement ReduceLROnPlateauWithWarmup and make it to be called for both mode step/epoch?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, fa86c3b
Add test 65a7f91

@sungmanc
Copy link
Contributor Author

sungmanc commented Mar 7, 2024

The warmup settings seem to be different on models, is this by design? Could you please double check?

I just following the setting from the OTX1.x and I don't think we should use the same value for all models since the characteristics could vary

@harimkang
Copy link
Contributor

The warmup settings seem to be different on models, is this by design? Could you please double check?

I just following the setting from the OTX1.x and I don't think we should use the same value for all models since the characteristics could vary

Yes If it's the same as the setting in 1.X, you can ignore it. :)

@sungmanc
Copy link
Contributor Author

sungmanc commented Mar 7, 2024

I also manually checked the warm_up works well.
image

@sungmanc
Copy link
Contributor Author

sungmanc commented Mar 7, 2024

Close this PR, this PR(#3056) will handle the warmup scheduler issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
TEST Any changes in tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants