-
Notifications
You must be signed in to change notification settings - Fork 6.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Tune] Sync experiment-checkpoints more often #30187
[Tune] Sync experiment-checkpoints more often #30187
Conversation
Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome! couple of comments
Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, thanks!
Other than fixing the CI failures, please also run the |
Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
@krfricke Release tests are green, test failures are same as on master. |
In order to improve checkpoint consistency, experiment syncing is now forced if a trial has checkpointed more than `num_to_keep` times since last sync. If `num_to_keep` is not set, nothing is changed. Signed-off-by: Antoni Baum <antoni.baum@protonmail.com> Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Why are these changes needed?
In order to improve checkpoint consistency, experiment syncing is now forced if a trial has checkpointed more than
num_to_keep
times since last sync. Ifnum_to_keep
is not set, nothing is changed.Related issue number
Closes #30073
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.