[Air|Tune] `CheckpointConfig` enable pure memory ckpt #30506

Qinghao-Hu · 2022-11-19T06:02:29Z

Description

Although I set CheckpointConfig(num_to_keep=1) in Tuner RunConfig, the ckpt storage is still too huge because I set a large trial number. Too many useless trials exceed my storage quota.

I tried to set checkpoint_config=CheckpointConfig(num_to_keep=0). But meet the error:

  File "/home/a/miniconda3/lib/python3.9/site-packages/ray/tune/execution/checkpoint_manager.py", line 39, in __init__
    raise RuntimeError(
RuntimeError: If checkpointing is enabled, Ray Tune requires `keep_checkpoints_num` to be None or a number greater than 0

However, args document of CheckpointConfig says: If this is 0 then no checkpoints will be persisted to disk.

Args:
        num_to_keep: The number of checkpoints to keep
            on disk for this run. If a checkpoint is persisted to disk after
            there are already this many checkpoints, then an existing
            checkpoint will be deleted. If this is ``None`` then checkpoints
            will not be deleted. If this is ``0`` then no checkpoints will be
            persisted to disk.

So is it possible to disable persistent CheckpointStorage?

Use case

Code

tuner = Tuner(
        trainer,
        param_space={"train_loop_config": config},
        tune_config=TuneConfig(
            ...
        ),
        run_config=RunConfig(
            name=experiment_name,
            local_dir="../ray_results",
            log_to_file=True,
            stop={"training_iteration": args.max_epoch, "val_acc": args.target_acc},
            checkpoint_config=CheckpointConfig(num_to_keep=0),
            callbacks=[WandbLoggerCallback(api_key_file="~/.wandb/api_key", project=f"{experiment_name}")],
            failure_config=FailureConfig(fail_fast=True, max_failures=0),
        ),
    )

The text was updated successfully, but these errors were encountered:

justinvyu · 2022-11-29T01:00:30Z

Hi @Tonyhao96,

The docstring should say that Tune requires num_to_keep needs to be greater than or equal to 1 here.

Here's a workaround that you could try for now: use a callback to clean up trial checkpoints after they've finished based on a threshold on some metric. You could also extend this to use trials in the callback to see how a certain trial's performance ranks against the others. See ray.tune.callback for more hooks.

Let me know if is sufficient to solve the storage quota problems!

from ray import air, tune
from ray.tune.callback import Callback
from ray.air import Checkpoint, session
from ray.air._internal.checkpoint_manager import _TrackedCheckpoint


class DeleteCallback(Callback):
    def on_trial_complete(
        self, iteration: int, trials, trial, **info,
    ):
        last_result = trial.last_result
        # Filter out which checkpoints to delete on trial completion
        if last_result["score"] < 2:
            try:
                checkpoint: _TrackedCheckpoint = trial.checkpoint
                checkpoint.delete()
            except:
                print(f"Unable to delete checkpoint of {trial}")

if __name__ == "__main__":
    def train_func(config):
        metrics = {"score": config["a"]}
        session.report(metrics, checkpoint=Checkpoint.from_dict(metrics))

    tuner = tune.Tuner(
        train_func,
        param_space={"a": tune.grid_search(list(range(4)))},
        run_config=air.RunConfig(
            checkpoint_config=air.CheckpointConfig(num_to_keep=1),
            callbacks=[DeleteCallback()]
        ),
    )
    results = tuner.fit()

    # Make sure you check that checkpoints exist when working with experiment results
    # these ones got saved (high "score")
    assert results[2].checkpoint and results[3].checkpoint
    # these ones got deleted (low "score")
    assert not results[0].checkpoint and not results[1].checkpoint

justinvyu · 2023-10-30T20:17:37Z

This is not currently in the roadmap. The error message has already been fixed -- closing for now. See workaround above.

Qinghao-Hu added the enhancement Request for new feature and/or capability label Nov 19, 2022

justinvyu added tune Tune-related issues P2 Important issue, but not time-critical air labels Nov 22, 2022

justinvyu mentioned this issue Nov 30, 2022

[Tune] Fix validation error message and docstring of num_to_keep in CheckpointConfig #30782

Merged

7 tasks

anyscalesam removed the air label Oct 28, 2023

justinvyu closed this as completed Oct 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Air|Tune] `CheckpointConfig` enable pure memory ckpt #30506

[Air|Tune] `CheckpointConfig` enable pure memory ckpt #30506

Qinghao-Hu commented Nov 19, 2022

justinvyu commented Nov 29, 2022

justinvyu commented Oct 30, 2023

[Air|Tune] CheckpointConfig enable pure memory ckpt #30506

[Air|Tune] CheckpointConfig enable pure memory ckpt #30506

Comments

Qinghao-Hu commented Nov 19, 2022

Description

Use case

justinvyu commented Nov 29, 2022

justinvyu commented Oct 30, 2023

[Air|Tune] `CheckpointConfig` enable pure memory ckpt #30506

[Air|Tune] `CheckpointConfig` enable pure memory ckpt #30506