You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If users provide a ckpt_path in LightningConfigBuilder.fit_param(), when resuming training from a failed run, LightningTrainer will always resume training from that ckpt_path again, instead of the latest AIR checkpoint.
Instead, we should always restore from the latest AIR checkpoint, if we can get it from the session, regardless of whether the user provided a ckpt_path or not.
The text was updated successfully, but these errors were encountered:
woshiyyya
added
bug
Something that is supposed to be working; but isn't
triage
Needs triage (eg: priority, bug/not-bug, and owning component)
labels
May 22, 2023
woshiyyya
added
train
Ray Train Related Issue
P0
Issues that should be fixed in short order
and removed
triage
Needs triage (eg: priority, bug/not-bug, and owning component)
labels
May 22, 2023
What happened + What you expected to happen
If users provide a
ckpt_path
inLightningConfigBuilder.fit_param()
, when resuming training from a failed run, LightningTrainer will always resume training from thatckpt_path
again, instead of the latest AIR checkpoint.Code:
ray/python/ray/train/lightning/lightning_trainer.py
Lines 543 to 549 in 0fd06ad
Instead, we should always restore from the latest AIR checkpoint, if we can get it from the session, regardless of whether the user provided a
ckpt_path
or not.Versions / Dependencies
master
Reproduction script
Together with a FailureConfig
Issue Severity
High: It blocks me from completing my task.
The text was updated successfully, but these errors were encountered: