Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Train] LightningTrainer failed to auto-restore if ckpt_path provided in LightningConfigbuilder.fit_param() #35613

Closed
woshiyyya opened this issue May 22, 2023 · 0 comments · Fixed by #35617
Assignees
Labels
bug Something that is supposed to be working; but isn't P0 Issues that should be fixed in short order train Ray Train Related Issue

Comments

@woshiyyya
Copy link
Member

woshiyyya commented May 22, 2023

What happened + What you expected to happen

If users provide a ckpt_path in LightningConfigBuilder.fit_param(), when resuming training from a failed run, LightningTrainer will always resume training from that ckpt_path again, instead of the latest AIR checkpoint.

Code:

checkpoint = session.get_checkpoint()
if checkpoint and "ckpt_path" not in trainer_fit_params:
with checkpoint.as_directory() as ckpt_dir:
trainer_fit_params["ckpt_path"] = f"{ckpt_dir}/{MODEL_KEY}"
trainer.fit(lightning_module, **trainer_fit_params)
else:
trainer.fit(lightning_module, **trainer_fit_params)

Instead, we should always restore from the latest AIR checkpoint, if we can get it from the session, regardless of whether the user provided a ckpt_path or not.

Versions / Dependencies

master

Reproduction script

lightning_config = (
    LightningConfigBuilder()
    .module(
        ...
    )
    .trainer(
        ...
    )
    .fit_params(
        train_dataloaders=train_loader,
        val_dataloaders=val_loader,
        ckpt_path="YOUR_CKPT_PATH"
    )
    .build()
)

Together with a FailureConfig

run_config = RunConfig(
    ...,
    failure_config=FailureConfig(max_failures=3), 
)

Issue Severity

High: It blocks me from completing my task.

@woshiyyya woshiyyya added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels May 22, 2023
@woshiyyya woshiyyya self-assigned this May 22, 2023
@woshiyyya woshiyyya added train Ray Train Related Issue P0 Issues that should be fixed in short order and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels May 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't P0 Issues that should be fixed in short order train Ray Train Related Issue
Projects
None yet
1 participant