-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Tune] Fix validation error message and docstring of num_to_keep
in CheckpointConfig
#30782
Conversation
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, though I think it would make more sense to change the actual validation logic here and have a subclass without it for testing. This can get confusing as the error message is now diverged from the logic.
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
… `CheckpointConfig` (ray-project#30782) This PR makes the following fixes: 1. The `CheckpointConfig` docstring should accurately reflect the requirement that `num_to_keep >= 1`. 2. `keep_checkpoints_num` in the error should be `num_to_keep` instead, since the Tune checkpoint manager takes in a `CheckpointConfig` now. Signed-off-by: Justin Yu <justinvyu@berkeley.edu> Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
… `CheckpointConfig` (ray-project#30782) This PR makes the following fixes: 1. The `CheckpointConfig` docstring should accurately reflect the requirement that `num_to_keep >= 1`. 2. `keep_checkpoints_num` in the error should be `num_to_keep` instead, since the Tune checkpoint manager takes in a `CheckpointConfig` now. Signed-off-by: Justin Yu <justinvyu@berkeley.edu> Signed-off-by: tmynn <hovhannes.tamoyan@gmail.com>
Signed-off-by: Justin Yu justinvyu@berkeley.edu
Why are these changes needed?
Background
When reading the
CheckpointConfig
docstring, it seems likenum_to_keep=0
is a valid configuration, where no persistent checkpoints will be kept.However, when setting
num_to_keep=0
in theCheckpointConfig
through either an AIR Trainer or the Tuner, the following error will be thrown:Historically, the old Ray Train allowed
num_to_keep=0
. However, Tune requires at least one persistent checkpoint for fault tolerance, and the Ray Train in AIR now runs as a Tune experiment, sonum_to_keep >= 1
is needed in both cases.Summary
This PR makes the following fixes:
CheckpointConfig
docstring should accurately reflect the requirement thatnum_to_keep >= 1
.keep_checkpoints_num
in the error should benum_to_keep
instead, since the Tune checkpoint manager takes in aCheckpointConfig
now.Related issue number
Clarifies confusion from the documentation as referenced here: #30506
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.