-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Train] LightningTrainer always resumes from the latest AIR checkpoint during restoration. #35617
[Train] LightningTrainer always resumes from the latest AIR checkpoint during restoration. #35617
Conversation
Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>
@woshiyyya Please make sure to also create the cherry pick PR yourself after this is merged! |
Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>
…ng_trainer_restoration_ckpt_path
Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, just 1 nit.
thanks.
Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>
…ng_trainer_restoration_ckpt_path
@@ -541,7 +541,15 @@ def _lightning_train_loop_per_worker(config): | |||
|
|||
# Restore from a previously failed run | |||
checkpoint = session.get_checkpoint() | |||
if checkpoint and "ckpt_path" not in trainer_fit_params: | |||
if checkpoint: | |||
logger.info("Restoring LightningTrainer from a failed run. ") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually is this always true? What happens if the user passes in an AIR checkpoint?
LightningTrainer(..., resume_from_checkpoint=some_checkpoint)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might not be a failed run. Updated the log message.
Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>
Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>
…t during restoration. (ray-project#35617) Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>
…t during restoration. (ray-project#35617) Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>
…t during restoration. (ray-project#35617) Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com> Signed-off-by: e428265 <arvind.chandramouli@lmco.com>
Why are these changes needed?
If users provide a
ckpt_path
in LightningConfigBuilder.fit_param(), when resuming training from a failed run, LightningTrainer will always resume training fromckpt_path
again, instead of the latest AIR checkpoint. In that case, the training won't proceed after restored from failures.Code:
ray/python/ray/train/lightning/lightning_trainer.py
Lines 543 to 549 in 0fd06ad
We should always restore from the latest checkpoint from
session.get_checkpoint()
, otherwise, the training re-start from the beginning after restoration.Details in #35613.
Related issue number
Close #35613
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.