Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Air|Tune] CheckpointConfig enable pure memory ckpt #30506

Closed
Qinghao-Hu opened this issue Nov 19, 2022 · 2 comments
Closed

[Air|Tune] CheckpointConfig enable pure memory ckpt #30506

Qinghao-Hu opened this issue Nov 19, 2022 · 2 comments
Labels
enhancement Request for new feature and/or capability P2 Important issue, but not time-critical tune Tune-related issues

Comments

@Qinghao-Hu
Copy link
Contributor

Description

Although I set CheckpointConfig(num_to_keep=1) in Tuner RunConfig, the ckpt storage is still too huge because I set a large trial number. Too many useless trials exceed my storage quota.

I tried to set checkpoint_config=CheckpointConfig(num_to_keep=0). But meet the error:

  File "/home/a/miniconda3/lib/python3.9/site-packages/ray/tune/execution/checkpoint_manager.py", line 39, in __init__
    raise RuntimeError(
RuntimeError: If checkpointing is enabled, Ray Tune requires `keep_checkpoints_num` to be None or a number greater than 0

However, args document of CheckpointConfig says: If this is 0 then no checkpoints will be persisted to disk.

Args:
        num_to_keep: The number of checkpoints to keep
            on disk for this run. If a checkpoint is persisted to disk after
            there are already this many checkpoints, then an existing
            checkpoint will be deleted. If this is ``None`` then checkpoints
            will not be deleted. If this is ``0`` then no checkpoints will be
            persisted to disk.

So is it possible to disable persistent CheckpointStorage?

Use case

Code

tuner = Tuner(
        trainer,
        param_space={"train_loop_config": config},
        tune_config=TuneConfig(
            ...
        ),
        run_config=RunConfig(
            name=experiment_name,
            local_dir="../ray_results",
            log_to_file=True,
            stop={"training_iteration": args.max_epoch, "val_acc": args.target_acc},
            checkpoint_config=CheckpointConfig(num_to_keep=0),
            callbacks=[WandbLoggerCallback(api_key_file="~/.wandb/api_key", project=f"{experiment_name}")],
            failure_config=FailureConfig(fail_fast=True, max_failures=0),
        ),
    )
@Qinghao-Hu Qinghao-Hu added the enhancement Request for new feature and/or capability label Nov 19, 2022
@justinvyu justinvyu added tune Tune-related issues P2 Important issue, but not time-critical air labels Nov 22, 2022
@justinvyu
Copy link
Contributor

Hi @Tonyhao96,

The docstring should say that Tune requires num_to_keep needs to be greater than or equal to 1 here.

Here's a workaround that you could try for now: use a callback to clean up trial checkpoints after they've finished based on a threshold on some metric. You could also extend this to use trials in the callback to see how a certain trial's performance ranks against the others. See ray.tune.callback for more hooks.

Let me know if is sufficient to solve the storage quota problems!

from ray import air, tune
from ray.tune.callback import Callback
from ray.air import Checkpoint, session
from ray.air._internal.checkpoint_manager import _TrackedCheckpoint


class DeleteCallback(Callback):
    def on_trial_complete(
        self, iteration: int, trials, trial, **info,
    ):
        last_result = trial.last_result
        # Filter out which checkpoints to delete on trial completion
        if last_result["score"] < 2:
            try:
                checkpoint: _TrackedCheckpoint = trial.checkpoint
                checkpoint.delete()
            except:
                print(f"Unable to delete checkpoint of {trial}")

if __name__ == "__main__":
    def train_func(config):
        metrics = {"score": config["a"]}
        session.report(metrics, checkpoint=Checkpoint.from_dict(metrics))

    tuner = tune.Tuner(
        train_func,
        param_space={"a": tune.grid_search(list(range(4)))},
        run_config=air.RunConfig(
            checkpoint_config=air.CheckpointConfig(num_to_keep=1),
            callbacks=[DeleteCallback()]
        ),
    )
    results = tuner.fit()

    # Make sure you check that checkpoints exist when working with experiment results
    # these ones got saved (high "score")
    assert results[2].checkpoint and results[3].checkpoint
    # these ones got deleted (low "score")
    assert not results[0].checkpoint and not results[1].checkpoint

@justinvyu
Copy link
Contributor

This is not currently in the roadmap. The error message has already been fixed -- closing for now. See workaround above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Request for new feature and/or capability P2 Important issue, but not time-critical tune Tune-related issues
Projects
None yet
Development

No branches or pull requests

3 participants