Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues within the learning rate schedule and optimizer initialization #122

Closed
Adamusen opened this issue Nov 11, 2024 · 3 comments
Closed
Labels
bug Something isn't working

Comments

@Adamusen
Copy link
Contributor

Describe the bug

There are three parameter groups defined in yolo/utils/model_utils.py/create_optimizer. One for biases, one for batch_norm weights and one for the convolutional weights. In the stable implementation, the learning rate for each of these groups is the same during training, they only differ in their weight_decay (only the conv weights have it), however, in the current implementation the bias learning rate starts hight during warmup, then quickly converges towards zero right after the warmup epochs, basically freezing the bias values after a couple of epochs! This issue stems most likely from Line 79 in yolo/utils/model_utils.py/create_optimizer:
optimizer.max_lr = [0.1, 0, 0], where the max_lr is initialized differently for the bias parameters group.

An additional problem is, that the momentum values for the optimizer are hardcoded to be 0.8 instead of using the value found in train.yaml, which is 0.937 in the following code snippet (Lines 55-57):

        {"params": bias_params, "momentum": 0.8, "weight_decay": 0},
        {"params": conv_params, "momentum": 0.8},
        {"params": norm_params, "momentum": 0.8, "weight_decay": 0},

To Reproduce

Steps to reproduce the behavior:

  1. Train a network with tensorboard enabled.
  2. Observe the learning rate of the parameter groups.

Expected behavior

The bias learning rate not to converge towards zero early.

Screenshots

Learning rate schedule for the three parameter group in the current implementation:
learning_rates

Learning rate schedule for the same parameter groups in the "stable" implementation:
lr-stable

Proposed solution

  • Set Line 79 to optimizer.max_lr = [0, 0, 0] for the learning rates to be identical.
  • Remove the fix momentum values from Line 55-57, in which case the optimizer is initialized with value provided in train.yaml
@Adamusen Adamusen added the bug Something isn't working label Nov 11, 2024
@Adamusen Adamusen changed the title Isses within the learning rate schedule and optimizer initialization Issues within the learning rate schedule and optimizer initialization Nov 11, 2024
@henrytsui000
Copy link
Collaborator

Hi

Generally, the warm-up learning rate of the bias term is 0.1 and the others are 0, they will align to 0.01 after the warm-up epoch.
This is my learning rate curve in wandb, it seems to work regularly now, can you tell me your configuration?

$python yolo/lazy.py task=train # the basic configuration
image

best regards,
HenryTsui

@Adamusen
Copy link
Contributor Author

Adamusen commented Nov 11, 2024

Hey,

Yes, I modified the learning rate in train.yaml as "lr: 0.001" from 0.01 and "end_factor: 0.1" from 0.01 for the training on the first screenshot. All else learning rate schedule related should be the same.

Edit: I will check the exact value in tensorboard, to which my bias learning rate converged tomorrow (from the screenshot I can't tell if it's 0.001 or less at this point).

Edit2: It looks like in the screenshot I set the learning rate even to 0.0005. I'm sorry for the confusion, I was trying out different values to see if I get better convergence. I will double check all of this tomorrow.

@Adamusen
Copy link
Contributor Author

Hi @henrytsui000 ,

I am sorry, it is my bad. The bias learning rate indeed converged to the same value as the remaining two groups. It just looked zero like compared to the 0.1 start:
image

I close this issue, as it was a mistake.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants