[AIR] `ResourceChangingScheduler` causes tuning to hang with a large trial number #30265

Yard1 · 2022-11-14T22:49:20Z

    Besides, another error (which is not related to `enable_reproducibility`) is that if I set a large trial number (e.g., 200).  Two trials will be paused after one epoch and no trial will continue:

(ResourceChangingScheduler) Using FIFO scheduling algorithm.
Resources requested: 27.0/64 CPUs, 3.0/4 GPUs, 0.0/60.17 GiB heap, 0.0/29.78 GiB objects (0.0/1.0 accelerator_type:G)
Current best trial: 4e458_00003 with val_acc=0.5092 and parameters={'train_loop_config': {'lr': 0.0551, 'momentum': 0.602, 'batch_size': 512, 'gamma': 0.21, 'model': 'resnet18', 'dataset': 'cifar10', 'seed': 1, 'amp': False}}
Number of trials: 75/200 (2 PAUSED, 70 PENDING, 3 RUNNING)
+--------------------------+----------+----------------------+------------------------+------------------------+------------------------+------------------------+--------+------------------+---------+-----------+--------------+
| Trial name               | status   | loc                  |   train_loop_config/ba |   train_loop_config/ga |   train_loop_config/lr |   train_loop_config/mo |   iter |   total time (s) |    loss |   val_acc |   _timestamp |
|                          |          |                      |               tch_size |                    mma |                        |                 mentum |        |                  |         |           |              |
|--------------------------+----------+----------------------+------------------------+------------------------+------------------------+------------------------+--------+------------------+---------+-----------+--------------|
| TorchTrainer_4e458_00001 | RUNNING  | 10.100.73.27:4068812 |                    128 |                   0.36 |                 0.0004 |                  0.546 |        |                  |         |           |              |
| TorchTrainer_4e458_00002 | RUNNING  | 10.100.73.27:4068814 |                    128 |                   0.38 |                 0.0478 |                  0.967 |        |                  |         |           |              |
| TorchTrainer_4e458_00004 | RUNNING  | 10.100.73.27:4072302 |                    256 |                   0.39 |                 0.0137 |                  0.956 |        |                  |         |           |              |
| TorchTrainer_4e458_00000 | PAUSED   | 10.100.73.27:4068581 |                    256 |                   0.28 |                 0.0047 |                  0.859 |      1 |          12.605  | 1.37914 |    0.5059 |   1668417525 |
| TorchTrainer_4e458_00003 | PAUSED   | 10.100.73.27:4068816 |                    512 |                   0.21 |                 0.0551 |                  0.602 |      1 |          15.3606 | 1.41452 |    0.5092 |   1668417530 |
| TorchTrainer_4e458_00005 | PENDING  |                      |                    256 |                   0.87 |                 0.5708 |                  0.888 |        |                  |         |           |              |
| TorchTrainer_4e458_00006 | PENDING  |                      |                    256 |                   0.78 |                 0.0018 |                  0.845 |        |                  |         |           |              |
| TorchTrainer_4e458_00007 | PENDING  |                      |                    512 |                   0.79 |                 0.2073 |                  0.914 |        |                  |         |           |              |
| TorchTrainer_4e458_00008 | PENDING  |                      |                    256 |                   0.38 |                 0.0002 |                  0.71  |        |                  |         |           |              |
| TorchTrainer_4e458_00009 | PENDING  |                      |                    128 |                   0.71 |                 0.0006 |                  0.645 |        |                  |         |           |              |

An error is raised after a while.

2022-11-14 17:21:03,370 ERROR tune.py:773 -- Trials did not complete: [TorchTrainer_4e458_00000, TorchTrainer_4e458_00001, TorchTrainer_4e458_00002, TorchTrainer_4e458_00003, TorchTrainer_4e458_00004, TorchTrainer_4e458_00005, TorchTrainer_4e458_00006, TorchTrainer_4e458_00007, TorchTrainer_4e458_00008, TorchTrainer_4e458_00009, TorchTrainer_4e458_00010, TorchTrainer_4e458_00011, TorchTrainer_4e458_00012, TorchTrainer_4e458_00013, TorchTrainer_4e458_00014, TorchTrainer_4e458_00015, TorchTrainer_4e458_00016, TorchTrainer_4e458_00017, TorchTrainer_4e458_00018, TorchTrainer_4e458_00019, TorchTrainer_4e458_00020, TorchTrainer_4e458_00021, TorchTrainer_4e458_00022, TorchTrainer_4e458_00023, TorchTrainer_4e458_00024, TorchTrainer_4e458_00025, TorchTrainer_4e458_00026, TorchTrainer_4e458_00027, TorchTrainer_4e458_00028, TorchTrainer_4e458_00029, TorchTrainer_4e458_00030, TorchTrainer_4e458_00031, TorchTrainer_4e458_00032, TorchTrainer_4e458_00033, TorchTrainer_4e458_00034, TorchTrainer_4e458_00035, TorchTrainer_4e458_00036, TorchTrainer_4e458_00037, TorchTrainer_4e458_00038, TorchTrainer_4e458_00039, TorchTrainer_4e458_00040, TorchTrainer_4e458_00041, TorchTrainer_4e458_00042, TorchTrainer_4e458_00043, TorchTrainer_4e458_00044, TorchTrainer_4e458_00045, TorchTrainer_4e458_00046, TorchTrainer_4e458_00047, TorchTrainer_4e458_00048, TorchTrainer_4e458_00049, TorchTrainer_4e458_00050, TorchTrainer_4e458_00051, TorchTrainer_4e458_00052, TorchTrainer_4e458_00053, TorchTrainer_4e458_00054, TorchTrainer_4e458_00055, TorchTrainer_4e458_00056, TorchTrainer_4e458_00057, TorchTrainer_4e458_00058, TorchTrainer_4e458_00059, TorchTrainer_4e458_00060, TorchTrainer_4e458_00061, TorchTrainer_4e458_00062, TorchTrainer_4e458_00063, TorchTrainer_4e458_00064, TorchTrainer_4e458_00065, TorchTrainer_4e458_00066, TorchTrainer_4e458_00067, TorchTrainer_4e458_00068, TorchTrainer_4e458_00069, TorchTrainer_4e458_00070, TorchTrainer_4e458_00071, TorchTrainer_4e458_00072, TorchTrainer_4e458_00073, TorchTrainer_4e458_00074]
2022-11-14 17:21:03,371 INFO tune.py:777 -- Total run time: 152.43 seconds (152.00 seconds for the tuning loop).
2022-11-14 17:21:03,371 WARNING tune.py:783 -- Experiment has been interrupted, but the most recent state was saved. You can continue running this experiment by passing `resume=True` to `tune.run()`
Result(metrics={'loss': 1.4145198225975038, 'val_acc': 0.5092, '_timestamp': 1668417530, '_time_this_iter_s': 12.510530948638916, '_training_iteration': 1, 'should_checkpoint': True, 'done': False, 'trial_id': '4e458_00003', 'experiment_tag': '3_batch_size=512,gamma=0.2100,lr=0.0551,momentum=0.6020'}, error=None, log_dir=PosixPath('/home/qhhu/workdir/HPO/hydro/ray_results/resnet18_cifar10_s200_e100_fifo_seed1_ela/TorchTrainer_4e458_00003_3_batch_size=512,gamma=0.2100,lr=0.0551,momentum=0.6020_2022-11-14_17-18-33'))
2022-11-14 17:21:03,527 WARNING experiment_analysis.py:542 -- Couldn't read config from 70 paths

Originally posted by @Tonyhao96 in #30247 (comment)

The text was updated successfully, but these errors were encountered:

Yard1 · 2022-11-14T23:01:24Z

@Tonyhao96 I tried to quickly reproduce that with the AIR example you linked in the first issue and it worked fine for me. Is it possible for you to share a whole script that can reproduce this for you and your cluster setup?

Qinghao-Hu · 2022-11-15T02:50:48Z

import os
import time
import argparse
from pathlib import Path
from filelock import FileLock
import models.cifar as Cifar

import torch
import torch.nn as nn
from torch.utils.data import DataLoader
import torchvision
import torchvision.transforms as transforms
import torchvision.datasets as datasets
from torchvision.models import resnet18


import ray
from ray import tune
from ray.train.torch import TorchTrainer, TorchCheckpoint
import ray.train.torch as ht
from ray.air import session
from ray.air.config import FailureConfig, RunConfig, ScalingConfig, CheckpointConfig
from ray.tune.schedulers import ASHAScheduler, FIFOScheduler
from ray.tune.schedulers.resource_changing_scheduler import (
    ResourceChangingScheduler,
    DistributeResources,
    DistributeResourcesToTopJob,
)
from ray.tune.tuner import Tuner
from ray.tune.tune_config import TuneConfig


SEARCH_SPACE = {
    "lr": tune.qloguniform(1e-4, 1, 1e-4),
    "momentum": tune.quniform(0.5, 0.999, 0.001),
    "batch_size": tune.choice([128, 256, 512]),
    "gamma": tune.quniform(0.01, 0.9, 0.01),
}


def get_datasets(dataset):
    """Data loader for Cifar10/100 & Imagenet"""
    if dataset == "cifar10":
        normalize = transforms.Normalize(mean=[0.4914, 0.4822, 0.4465], std=[0.2470, 0.2435, 0.2616])
        with FileLock(Path("~/data/data.lock").expanduser()):
            train_dataset = datasets.CIFAR10(
                root="~/data",
                train=True,
                download=True,
                transform=transforms.Compose(
                    [transforms.RandomCrop(32, padding=4), transforms.RandomHorizontalFlip(), transforms.ToTensor(), normalize]
                ),
            )
            val_dataset = datasets.CIFAR10(
                root="~/data", train=False, download=False, transform=transforms.Compose([transforms.ToTensor(), normalize])
            )
    return train_dataset, val_dataset


def train_epoch(dataloader, model, loss_fn, optimizer, fusion_num):
    size = len(dataloader.dataset) // session.get_world_size()
    model.train()
    for batch, (X, y) in enumerate(dataloader):
        # Compute prediction error
        pred = model(X)
        loss = loss_fn(pred, y)

        # Backpropagation
        optimizer.zero_grad()
        # loss.backward()
        ht.backward(loss)  # For AMP support
        optimizer.step()


def validate_epoch(dataloader, model, loss_fn, fusion_num):
    size = len(dataloader.dataset) // session.get_world_size()
    num_batches = len(dataloader)
    model.eval()
    test_loss, correct = 0, 0
    with torch.no_grad():
        for X, y in dataloader:
            pred = model(X)
            test_loss += loss_fn(pred, y).item()
            correct += (pred.argmax(1) == y).type(torch.float).sum().item()
    test_loss /= num_batches
    correct /= size
    return {"loss": test_loss, "val_acc": correct}


def train_func(config):
    ht.accelerate(amp=config["amp"])  # For AMP support
    ht.enable_reproducibility(seed=config["seed"])
    fusion_num = config.get("FUSION_N", -1)

    dataset = config.get("dataset")
    if dataset == "imagenet":
        model = torchvision.models.__dict__[config.get("model")]()
    elif dataset == "cifar10" or "cifar100":
        model = resnet18()
    model = ht.prepare_model(model)

    optimizer = torch.optim.SGD(
        model.parameters(),
        lr=config.get("lr", 0.01),
        momentum=config.get("momentum", 0.9),
        weight_decay=config.get("weight_decay", 0.001),
    )
    optimizer = ht.prepare_optimizer(optimizer)

    lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=20, gamma=config.get("gamma", 0.2))

    worker_batch_size = config["batch_size"] // session.get_world_size()
    train_set, val_set = get_datasets(dataset)
    train_loader = DataLoader(train_set, batch_size=worker_batch_size, num_workers=8, pin_memory=True, shuffle=True)
    val_loader = DataLoader(val_set, batch_size=worker_batch_size, num_workers=8, pin_memory=True)
    train_loader = ht.prepare_data_loader(train_loader)
    val_loader = ht.prepare_data_loader(val_loader)

    # Create loss.
    criterion = nn.CrossEntropyLoss()

    for _ in range(10000):
        train_epoch(train_loader, model, criterion, optimizer, fusion_num)
        result = validate_epoch(val_loader, model, criterion, fusion_num)
        lr_scheduler.step()

        session.report(result, checkpoint=TorchCheckpoint.from_model(model))


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--address", default="auto", type=str, help="the address to use for Redis")
    parser.add_argument("--model", type=str, default="resnet18", help="Model to use")
    parser.add_argument("--dataset", type=str, default="cifar10", help="Dataset to use")
    parser.add_argument("--scheduler", default="fifo", choices=["asha", "fifo"], type=str, help="Scheduler Algorithm")
    parser.add_argument("--max-epoch", default=100, type=int, help="Max Epochs")
    parser.add_argument("--max-sample", default=200, type=int, help="Max Samples")
    parser.add_argument("--max-time", default=-1, type=int, help="Max Time (s), -1 for no limit")
    parser.add_argument("--target-acc", default=0.96, type=float, help="Target Validation Accuracy")
    parser.add_argument("--amp", default=False, type=bool, help="Whether enable AMP")
    parser.add_argument("--mps", default=1, type=int, help="Whether enable MPS for GPU Sharing")
    parser.add_argument("--seed", default=1, type=int, help="Fix Random Seed for Reproducing")
    parser.add_argument("--addition-str", default="", type=str, help="Additional String for experiment name")

    ##### ASHA Parameters
    parser.add_argument("--grace", default=3, type=int, help="grace_period")
    parser.add_argument("--reduction", default=3, type=int, help="reduction_factor")
    parser.add_argument("--brackets", default=1, type=int, help="brackets")

    args, _ = parser.parse_known_args()

    # ray.init(address=args.address)
    ray.init(address=None)
    config = SEARCH_SPACE | {
        "model": args.model,
        "dataset": args.dataset,
        "seed": args.seed,
        "amp": args.amp,
    }

    trainer = TorchTrainer(
        train_func,
        train_loop_config=config,
        scaling_config=ScalingConfig(
            num_workers=1,
            use_gpu=True,
            resources_per_worker={"CPU": 8 / args.mps, "GPU": 1 / args.mps},
            _max_cpu_fraction_per_node=0.9,
        ),
    )

    tune_scheduler = FIFOScheduler()
    tune_scheduler = ResourceChangingScheduler(
        base_scheduler=tune_scheduler,
        resources_allocation_function=DistributeResources(add_bundles=True),  # default
    )

    experiment_name = f"{args.model}_{args.dataset}_s{args.max_sample}_e{args.max_epoch}"

    tuner = Tuner(
        trainer,
        param_space={"train_loop_config": config},
        tune_config=TuneConfig(
            num_samples=args.max_sample,
            metric="val_acc",
            mode="max",
            scheduler=tune_scheduler,
            time_budget_s=args.max_time if args.max_time > 0 else None,
        ),
        run_config=RunConfig(
            name=experiment_name,
            local_dir="../ray_results",
            log_to_file=True,
            stop={"training_iteration": args.max_epoch, "val_acc": args.target_acc},
            checkpoint_config=CheckpointConfig(num_to_keep=1),
            # callbacks=[WandbLoggerCallback(api_key_file="~/.wandb/api_key", project=f"{experiment_name}")],
            failure_config=FailureConfig(fail_fast=True, max_failures=0),
        ),
    )

    results = tuner.fit()
    print(results.get_best_result(metric="val_acc", mode="max"))
    df = results.get_dataframe()
    df.to_csv(f"../ray_results/{experiment_name}.csv")

    time.sleep(5)
    os.system("ray stop --force")

Qinghao-Hu · 2022-11-15T02:57:25Z

Thanks for your reply.

Above is an example script for reproducing this issue, And the below screenshot is that it runs on my local server with 4x3090. I also test it on a Slurm cluster with 8xA100. No trial is actually running after a while.

Env: Ray 2.1, Pytorch 1.13, Python 3.9.

Yard1 · 2022-11-15T17:21:05Z

Hey, thanks, I can reproduce the behavior using the script you provided. I believe I have identified the issue. Can you check if removing the _max_cpu_fraction_per_node argument helps?

Qinghao-Hu · 2022-11-16T03:27:58Z

Thank you very much. Removing the _max_cpu_fraction_per_node argument works.

Qinghao-Hu · 2022-11-16T03:31:24Z

I just took a quick check, by changing max-epoch=10 and max-sample=6 on a 4 GPU server. And find another strange issue in finishing the final trial.

Number of trials: 6/6 (1 ERROR, 5 TERMINATED)
+--------------------------+------------+----------------------+------------------------+------------------------+------------------------+------------------------+--------+------------------+----------+-----------+--------------+
| Trial name               | status     | loc                  |   train_loop_config/ba |   train_loop_config/ga |   train_loop_config/lr |   train_loop_config/mo |   iter |   total time (s) |     loss |   val_acc |   _timestamp |
|                          |            |                      |               tch_size |                    mma |                        |                 mentum |        |                  |          |           |              |
|--------------------------+------------+----------------------+------------------------+------------------------+------------------------+------------------------+--------+------------------+----------+-----------+--------------|
| TorchTrainer_38651_00000 | TERMINATED | 10.100.77.179:343446 |                    256 |                   0.28 |                 0.0047 |                  0.859 |     10 |         100.938  | 0.594755 |    0.8013 |   1668568759 |
| TorchTrainer_38651_00001 | TERMINATED | 10.100.77.179:343499 |                    128 |                   0.36 |                 0.0004 |                  0.546 |     10 |         172.939  | 1.0751   |    0.6165 |   1668568834 |
| TorchTrainer_38651_00002 | TERMINATED | 10.100.77.179:343501 |                    128 |                   0.38 |                 0.0478 |                  0.967 |     10 |         166.395  | 0.900563 |    0.6938 |   1668568827 |
| TorchTrainer_38651_00003 | TERMINATED | 10.100.77.179:343503 |                    512 |                   0.21 |                 0.0551 |                  0.602 |     10 |          67.7346 | 0.641064 |    0.7892 |   1668568729 |
| TorchTrainer_38651_00004 | TERMINATED | 10.100.77.179:351670 |                    256 |                   0.39 |                 0.0137 |                  0.956 |     10 |          95.2324 | 0.612997 |    0.7935 |   1668568828 |
| TorchTrainer_38651_00005 | ERROR      | 10.100.77.179:354571 |                    256 |                   0.87 |                 0.5708 |                  0.888 |      7 |          70.6566 | 2.14731  |    0.4166 |   1668568833 |
+--------------------------+------------+----------------------+------------------------+------------------------+------------------------+------------------------+--------+------------------+----------+-----------+--------------+

Failure # 1 (occurred at 2022-11-16_11-13-31)
�[36mray::_Inner.train()�[39m (pid=323188, ip=10.100.77.179, repr=TorchTrainer)
  File "/home/qhhu/miniconda3/lib/python3.9/site-packages/ray/tune/trainable/trainable.py", line 355, in train
    raise skipped from exception_cause(skipped)
  File "/home/qhhu/miniconda3/lib/python3.9/site-packages/ray/train/_internal/utils.py", line 54, in check_for_failure
    ray.get(object_ref)
ray.exceptions.RayTaskError(TypeError): �[36mray::RayTrainWorker._RayTrainWorker__execute()�[39m (pid=323486, ip=10.100.77.179, repr=<ray.train._internal.worker_group.RayTrainWorker object at 0x7f52e012cf10>)
  File "/home/qhhu/miniconda3/lib/python3.9/site-packages/ray/train/_internal/worker_group.py", line 31, in __execute
    raise skipped from exception_cause(skipped)
  File "/home/qhhu/miniconda3/lib/python3.9/site-packages/ray/train/_internal/utils.py", line 129, in discard_return_wrapper
    train_func(*args, **kwargs)
  File "/home/qhhu/workdir/HPO/hydro/workloads/ray_func_tuner_forSH.py", line 125, in train_func
    train_epoch(train_loader, model, criterion, optimizer, fusion_num)
  File "/home/qhhu/workdir/HPO/hydro/workloads/ray_func_tuner_forSH.py", line 40, in train_epoch
    for batch, (X, y) in enumerate(dataloader):
  File "/home/qhhu/miniconda3/lib/python3.9/site-packages/ray/train/torch/train_loop_utils.py", line 641, in __iter__
    self._prefetch_next_batch()
  File "/home/qhhu/miniconda3/lib/python3.9/site-packages/ray/train/torch/train_loop_utils.py", line 636, in _prefetch_next_batch
    next_batch = next(self.dataloader_iter, None)
  File "/home/qhhu/miniconda3/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 628, in __next__
    data = self._next_data()
  File "/home/qhhu/miniconda3/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1333, in _next_data
    return self._process_data(data)
  File "/home/qhhu/miniconda3/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1359, in _process_data
    data.reraise()
  File "/home/qhhu/miniconda3/lib/python3.9/site-packages/torch/_utils.py", line 543, in reraise
    raise exception
TypeError: Caught TypeError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/home/qhhu/miniconda3/lib/python3.9/site-packages/torch/utils/data/_utils/worker.py", line 244, in _worker_loop
    init_fn(worker_id)
  File "/home/qhhu/miniconda3/lib/python3.9/site-packages/ray/train/torch/train_loop_utils.py", line 433, in wrapper
    worker_init_fn(worker_id)
TypeError: 'NoneType' object is not callable

Yard1 · 2022-11-16T09:47:06Z

This seems to be the same issue as in #30247, which should be fixed by #30266

As a workaround, you can specify a dummy worker_init_fn inside the DataLoader

Yard1 self-assigned this Nov 14, 2022

Yard1 added bug Something that is supposed to be working; but isn't tune Tune-related issues P2 Important issue, but not time-critical air labels Nov 14, 2022

Yard1 mentioned this issue Nov 15, 2022

[Tune] Fix ResourceChangingScheduler dropping PGF args #30304

Merged

7 tasks

amogkam closed this as completed in #30304 Nov 18, 2022

richardliaw added the ray-team-created Ray Team created label Dec 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AIR] `ResourceChangingScheduler` causes tuning to hang with a large trial number #30265

[AIR] `ResourceChangingScheduler` causes tuning to hang with a large trial number #30265

Yard1 commented Nov 14, 2022

Yard1 commented Nov 14, 2022 •

edited

Loading

Qinghao-Hu commented Nov 15, 2022 •

edited

Loading

Qinghao-Hu commented Nov 15, 2022

Yard1 commented Nov 15, 2022 •

edited

Loading

Qinghao-Hu commented Nov 16, 2022

Qinghao-Hu commented Nov 16, 2022

Yard1 commented Nov 16, 2022 •

edited

Loading

[AIR] ResourceChangingScheduler causes tuning to hang with a large trial number #30265

[AIR] ResourceChangingScheduler causes tuning to hang with a large trial number #30265

Comments

Yard1 commented Nov 14, 2022

Yard1 commented Nov 14, 2022 • edited Loading

Qinghao-Hu commented Nov 15, 2022 • edited Loading

Qinghao-Hu commented Nov 15, 2022

Yard1 commented Nov 15, 2022 • edited Loading

Qinghao-Hu commented Nov 16, 2022

Qinghao-Hu commented Nov 16, 2022

Yard1 commented Nov 16, 2022 • edited Loading

[AIR] `ResourceChangingScheduler` causes tuning to hang with a large trial number #30265

[AIR] `ResourceChangingScheduler` causes tuning to hang with a large trial number #30265

Yard1 commented Nov 14, 2022 •

edited

Loading

Qinghao-Hu commented Nov 15, 2022 •

edited

Loading

Yard1 commented Nov 15, 2022 •

edited

Loading

Yard1 commented Nov 16, 2022 •

edited

Loading