Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AIR] ResourceChangingScheduler causes tuning to hang with a large trial number #30265

Closed
Yard1 opened this issue Nov 14, 2022 · 7 comments · Fixed by #30304
Closed

[AIR] ResourceChangingScheduler causes tuning to hang with a large trial number #30265

Yard1 opened this issue Nov 14, 2022 · 7 comments · Fixed by #30304
Assignees
Labels
bug Something that is supposed to be working; but isn't P2 Important issue, but not time-critical ray-team-created Ray Team created tune Tune-related issues

Comments

@Yard1
Copy link
Member

Yard1 commented Nov 14, 2022

    Besides, another error (which is not related to `enable_reproducibility`) is that if I set a large trial number (e.g., 200).  Two trials will be paused after one epoch and no trial will continue:
(ResourceChangingScheduler) Using FIFO scheduling algorithm.
Resources requested: 27.0/64 CPUs, 3.0/4 GPUs, 0.0/60.17 GiB heap, 0.0/29.78 GiB objects (0.0/1.0 accelerator_type:G)
Current best trial: 4e458_00003 with val_acc=0.5092 and parameters={'train_loop_config': {'lr': 0.0551, 'momentum': 0.602, 'batch_size': 512, 'gamma': 0.21, 'model': 'resnet18', 'dataset': 'cifar10', 'seed': 1, 'amp': False}}
Number of trials: 75/200 (2 PAUSED, 70 PENDING, 3 RUNNING)
+--------------------------+----------+----------------------+------------------------+------------------------+------------------------+------------------------+--------+------------------+---------+-----------+--------------+
| Trial name               | status   | loc                  |   train_loop_config/ba |   train_loop_config/ga |   train_loop_config/lr |   train_loop_config/mo |   iter |   total time (s) |    loss |   val_acc |   _timestamp |
|                          |          |                      |               tch_size |                    mma |                        |                 mentum |        |                  |         |           |              |
|--------------------------+----------+----------------------+------------------------+------------------------+------------------------+------------------------+--------+------------------+---------+-----------+--------------|
| TorchTrainer_4e458_00001 | RUNNING  | 10.100.73.27:4068812 |                    128 |                   0.36 |                 0.0004 |                  0.546 |        |                  |         |           |              |
| TorchTrainer_4e458_00002 | RUNNING  | 10.100.73.27:4068814 |                    128 |                   0.38 |                 0.0478 |                  0.967 |        |                  |         |           |              |
| TorchTrainer_4e458_00004 | RUNNING  | 10.100.73.27:4072302 |                    256 |                   0.39 |                 0.0137 |                  0.956 |        |                  |         |           |              |
| TorchTrainer_4e458_00000 | PAUSED   | 10.100.73.27:4068581 |                    256 |                   0.28 |                 0.0047 |                  0.859 |      1 |          12.605  | 1.37914 |    0.5059 |   1668417525 |
| TorchTrainer_4e458_00003 | PAUSED   | 10.100.73.27:4068816 |                    512 |                   0.21 |                 0.0551 |                  0.602 |      1 |          15.3606 | 1.41452 |    0.5092 |   1668417530 |
| TorchTrainer_4e458_00005 | PENDING  |                      |                    256 |                   0.87 |                 0.5708 |                  0.888 |        |                  |         |           |              |
| TorchTrainer_4e458_00006 | PENDING  |                      |                    256 |                   0.78 |                 0.0018 |                  0.845 |        |                  |         |           |              |
| TorchTrainer_4e458_00007 | PENDING  |                      |                    512 |                   0.79 |                 0.2073 |                  0.914 |        |                  |         |           |              |
| TorchTrainer_4e458_00008 | PENDING  |                      |                    256 |                   0.38 |                 0.0002 |                  0.71  |        |                  |         |           |              |
| TorchTrainer_4e458_00009 | PENDING  |                      |                    128 |                   0.71 |                 0.0006 |                  0.645 |        |                  |         |           |              |

An error is raised after a while.

2022-11-14 17:21:03,370 ERROR tune.py:773 -- Trials did not complete: [TorchTrainer_4e458_00000, TorchTrainer_4e458_00001, TorchTrainer_4e458_00002, TorchTrainer_4e458_00003, TorchTrainer_4e458_00004, TorchTrainer_4e458_00005, TorchTrainer_4e458_00006, TorchTrainer_4e458_00007, TorchTrainer_4e458_00008, TorchTrainer_4e458_00009, TorchTrainer_4e458_00010, TorchTrainer_4e458_00011, TorchTrainer_4e458_00012, TorchTrainer_4e458_00013, TorchTrainer_4e458_00014, TorchTrainer_4e458_00015, TorchTrainer_4e458_00016, TorchTrainer_4e458_00017, TorchTrainer_4e458_00018, TorchTrainer_4e458_00019, TorchTrainer_4e458_00020, TorchTrainer_4e458_00021, TorchTrainer_4e458_00022, TorchTrainer_4e458_00023, TorchTrainer_4e458_00024, TorchTrainer_4e458_00025, TorchTrainer_4e458_00026, TorchTrainer_4e458_00027, TorchTrainer_4e458_00028, TorchTrainer_4e458_00029, TorchTrainer_4e458_00030, TorchTrainer_4e458_00031, TorchTrainer_4e458_00032, TorchTrainer_4e458_00033, TorchTrainer_4e458_00034, TorchTrainer_4e458_00035, TorchTrainer_4e458_00036, TorchTrainer_4e458_00037, TorchTrainer_4e458_00038, TorchTrainer_4e458_00039, TorchTrainer_4e458_00040, TorchTrainer_4e458_00041, TorchTrainer_4e458_00042, TorchTrainer_4e458_00043, TorchTrainer_4e458_00044, TorchTrainer_4e458_00045, TorchTrainer_4e458_00046, TorchTrainer_4e458_00047, TorchTrainer_4e458_00048, TorchTrainer_4e458_00049, TorchTrainer_4e458_00050, TorchTrainer_4e458_00051, TorchTrainer_4e458_00052, TorchTrainer_4e458_00053, TorchTrainer_4e458_00054, TorchTrainer_4e458_00055, TorchTrainer_4e458_00056, TorchTrainer_4e458_00057, TorchTrainer_4e458_00058, TorchTrainer_4e458_00059, TorchTrainer_4e458_00060, TorchTrainer_4e458_00061, TorchTrainer_4e458_00062, TorchTrainer_4e458_00063, TorchTrainer_4e458_00064, TorchTrainer_4e458_00065, TorchTrainer_4e458_00066, TorchTrainer_4e458_00067, TorchTrainer_4e458_00068, TorchTrainer_4e458_00069, TorchTrainer_4e458_00070, TorchTrainer_4e458_00071, TorchTrainer_4e458_00072, TorchTrainer_4e458_00073, TorchTrainer_4e458_00074]
2022-11-14 17:21:03,371 INFO tune.py:777 -- Total run time: 152.43 seconds (152.00 seconds for the tuning loop).
2022-11-14 17:21:03,371 WARNING tune.py:783 -- Experiment has been interrupted, but the most recent state was saved. You can continue running this experiment by passing `resume=True` to `tune.run()`
Result(metrics={'loss': 1.4145198225975038, 'val_acc': 0.5092, '_timestamp': 1668417530, '_time_this_iter_s': 12.510530948638916, '_training_iteration': 1, 'should_checkpoint': True, 'done': False, 'trial_id': '4e458_00003', 'experiment_tag': '3_batch_size=512,gamma=0.2100,lr=0.0551,momentum=0.6020'}, error=None, log_dir=PosixPath('/home/qhhu/workdir/HPO/hydro/ray_results/resnet18_cifar10_s200_e100_fifo_seed1_ela/TorchTrainer_4e458_00003_3_batch_size=512,gamma=0.2100,lr=0.0551,momentum=0.6020_2022-11-14_17-18-33'))
2022-11-14 17:21:03,527 WARNING experiment_analysis.py:542 -- Couldn't read config from 70 paths

Originally posted by @Tonyhao96 in #30247 (comment)

@Yard1 Yard1 self-assigned this Nov 14, 2022
@Yard1 Yard1 added bug Something that is supposed to be working; but isn't tune Tune-related issues P2 Important issue, but not time-critical air labels Nov 14, 2022
@Yard1
Copy link
Member Author

Yard1 commented Nov 14, 2022

@Tonyhao96 I tried to quickly reproduce that with the AIR example you linked in the first issue and it worked fine for me. Is it possible for you to share a whole script that can reproduce this for you and your cluster setup?

@Qinghao-Hu
Copy link
Contributor

Qinghao-Hu commented Nov 15, 2022

import os
import time
import argparse
from pathlib import Path
from filelock import FileLock
import models.cifar as Cifar

import torch
import torch.nn as nn
from torch.utils.data import DataLoader
import torchvision
import torchvision.transforms as transforms
import torchvision.datasets as datasets
from torchvision.models import resnet18


import ray
from ray import tune
from ray.train.torch import TorchTrainer, TorchCheckpoint
import ray.train.torch as ht
from ray.air import session
from ray.air.config import FailureConfig, RunConfig, ScalingConfig, CheckpointConfig
from ray.tune.schedulers import ASHAScheduler, FIFOScheduler
from ray.tune.schedulers.resource_changing_scheduler import (
    ResourceChangingScheduler,
    DistributeResources,
    DistributeResourcesToTopJob,
)
from ray.tune.tuner import Tuner
from ray.tune.tune_config import TuneConfig


SEARCH_SPACE = {
    "lr": tune.qloguniform(1e-4, 1, 1e-4),
    "momentum": tune.quniform(0.5, 0.999, 0.001),
    "batch_size": tune.choice([128, 256, 512]),
    "gamma": tune.quniform(0.01, 0.9, 0.01),
}


def get_datasets(dataset):
    """Data loader for Cifar10/100 & Imagenet"""
    if dataset == "cifar10":
        normalize = transforms.Normalize(mean=[0.4914, 0.4822, 0.4465], std=[0.2470, 0.2435, 0.2616])
        with FileLock(Path("~/data/data.lock").expanduser()):
            train_dataset = datasets.CIFAR10(
                root="~/data",
                train=True,
                download=True,
                transform=transforms.Compose(
                    [transforms.RandomCrop(32, padding=4), transforms.RandomHorizontalFlip(), transforms.ToTensor(), normalize]
                ),
            )
            val_dataset = datasets.CIFAR10(
                root="~/data", train=False, download=False, transform=transforms.Compose([transforms.ToTensor(), normalize])
            )
    return train_dataset, val_dataset


def train_epoch(dataloader, model, loss_fn, optimizer, fusion_num):
    size = len(dataloader.dataset) // session.get_world_size()
    model.train()
    for batch, (X, y) in enumerate(dataloader):
        # Compute prediction error
        pred = model(X)
        loss = loss_fn(pred, y)

        # Backpropagation
        optimizer.zero_grad()
        # loss.backward()
        ht.backward(loss)  # For AMP support
        optimizer.step()


def validate_epoch(dataloader, model, loss_fn, fusion_num):
    size = len(dataloader.dataset) // session.get_world_size()
    num_batches = len(dataloader)
    model.eval()
    test_loss, correct = 0, 0
    with torch.no_grad():
        for X, y in dataloader:
            pred = model(X)
            test_loss += loss_fn(pred, y).item()
            correct += (pred.argmax(1) == y).type(torch.float).sum().item()
    test_loss /= num_batches
    correct /= size
    return {"loss": test_loss, "val_acc": correct}


def train_func(config):
    ht.accelerate(amp=config["amp"])  # For AMP support
    ht.enable_reproducibility(seed=config["seed"])
    fusion_num = config.get("FUSION_N", -1)

    dataset = config.get("dataset")
    if dataset == "imagenet":
        model = torchvision.models.__dict__[config.get("model")]()
    elif dataset == "cifar10" or "cifar100":
        model = resnet18()
    model = ht.prepare_model(model)

    optimizer = torch.optim.SGD(
        model.parameters(),
        lr=config.get("lr", 0.01),
        momentum=config.get("momentum", 0.9),
        weight_decay=config.get("weight_decay", 0.001),
    )
    optimizer = ht.prepare_optimizer(optimizer)

    lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=20, gamma=config.get("gamma", 0.2))

    worker_batch_size = config["batch_size"] // session.get_world_size()
    train_set, val_set = get_datasets(dataset)
    train_loader = DataLoader(train_set, batch_size=worker_batch_size, num_workers=8, pin_memory=True, shuffle=True)
    val_loader = DataLoader(val_set, batch_size=worker_batch_size, num_workers=8, pin_memory=True)
    train_loader = ht.prepare_data_loader(train_loader)
    val_loader = ht.prepare_data_loader(val_loader)

    # Create loss.
    criterion = nn.CrossEntropyLoss()

    for _ in range(10000):
        train_epoch(train_loader, model, criterion, optimizer, fusion_num)
        result = validate_epoch(val_loader, model, criterion, fusion_num)
        lr_scheduler.step()

        session.report(result, checkpoint=TorchCheckpoint.from_model(model))


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--address", default="auto", type=str, help="the address to use for Redis")
    parser.add_argument("--model", type=str, default="resnet18", help="Model to use")
    parser.add_argument("--dataset", type=str, default="cifar10", help="Dataset to use")
    parser.add_argument("--scheduler", default="fifo", choices=["asha", "fifo"], type=str, help="Scheduler Algorithm")
    parser.add_argument("--max-epoch", default=100, type=int, help="Max Epochs")
    parser.add_argument("--max-sample", default=200, type=int, help="Max Samples")
    parser.add_argument("--max-time", default=-1, type=int, help="Max Time (s), -1 for no limit")
    parser.add_argument("--target-acc", default=0.96, type=float, help="Target Validation Accuracy")
    parser.add_argument("--amp", default=False, type=bool, help="Whether enable AMP")
    parser.add_argument("--mps", default=1, type=int, help="Whether enable MPS for GPU Sharing")
    parser.add_argument("--seed", default=1, type=int, help="Fix Random Seed for Reproducing")
    parser.add_argument("--addition-str", default="", type=str, help="Additional String for experiment name")

    ##### ASHA Parameters
    parser.add_argument("--grace", default=3, type=int, help="grace_period")
    parser.add_argument("--reduction", default=3, type=int, help="reduction_factor")
    parser.add_argument("--brackets", default=1, type=int, help="brackets")

    args, _ = parser.parse_known_args()

    # ray.init(address=args.address)
    ray.init(address=None)
    config = SEARCH_SPACE | {
        "model": args.model,
        "dataset": args.dataset,
        "seed": args.seed,
        "amp": args.amp,
    }

    trainer = TorchTrainer(
        train_func,
        train_loop_config=config,
        scaling_config=ScalingConfig(
            num_workers=1,
            use_gpu=True,
            resources_per_worker={"CPU": 8 / args.mps, "GPU": 1 / args.mps},
            _max_cpu_fraction_per_node=0.9,
        ),
    )

    tune_scheduler = FIFOScheduler()
    tune_scheduler = ResourceChangingScheduler(
        base_scheduler=tune_scheduler,
        resources_allocation_function=DistributeResources(add_bundles=True),  # default
    )

    experiment_name = f"{args.model}_{args.dataset}_s{args.max_sample}_e{args.max_epoch}"

    tuner = Tuner(
        trainer,
        param_space={"train_loop_config": config},
        tune_config=TuneConfig(
            num_samples=args.max_sample,
            metric="val_acc",
            mode="max",
            scheduler=tune_scheduler,
            time_budget_s=args.max_time if args.max_time > 0 else None,
        ),
        run_config=RunConfig(
            name=experiment_name,
            local_dir="../ray_results",
            log_to_file=True,
            stop={"training_iteration": args.max_epoch, "val_acc": args.target_acc},
            checkpoint_config=CheckpointConfig(num_to_keep=1),
            # callbacks=[WandbLoggerCallback(api_key_file="~/.wandb/api_key", project=f"{experiment_name}")],
            failure_config=FailureConfig(fail_fast=True, max_failures=0),
        ),
    )

    results = tuner.fit()
    print(results.get_best_result(metric="val_acc", mode="max"))
    df = results.get_dataframe()
    df.to_csv(f"../ray_results/{experiment_name}.csv")

    time.sleep(5)
    os.system("ray stop --force")

@Qinghao-Hu
Copy link
Contributor

Thanks for your reply.

Above is an example script for reproducing this issue, And the below screenshot is that it runs on my local server with 4x3090. I also test it on a Slurm cluster with 8xA100. No trial is actually running after a while.

Env: Ray 2.1, Pytorch 1.13, Python 3.9.

Screenshot from 2022-11-15 10-49-12

@Yard1
Copy link
Member Author

Yard1 commented Nov 15, 2022

Hey, thanks, I can reproduce the behavior using the script you provided. I believe I have identified the issue. Can you check if removing the _max_cpu_fraction_per_node argument helps?

@Qinghao-Hu
Copy link
Contributor

Thank you very much. Removing the _max_cpu_fraction_per_node argument works.

@Qinghao-Hu
Copy link
Contributor

I just took a quick check, by changing max-epoch=10 and max-sample=6 on a 4 GPU server. And find another strange issue in finishing the final trial.

Number of trials: 6/6 (1 ERROR, 5 TERMINATED)
+--------------------------+------------+----------------------+------------------------+------------------------+------------------------+------------------------+--------+------------------+----------+-----------+--------------+
| Trial name               | status     | loc                  |   train_loop_config/ba |   train_loop_config/ga |   train_loop_config/lr |   train_loop_config/mo |   iter |   total time (s) |     loss |   val_acc |   _timestamp |
|                          |            |                      |               tch_size |                    mma |                        |                 mentum |        |                  |          |           |              |
|--------------------------+------------+----------------------+------------------------+------------------------+------------------------+------------------------+--------+------------------+----------+-----------+--------------|
| TorchTrainer_38651_00000 | TERMINATED | 10.100.77.179:343446 |                    256 |                   0.28 |                 0.0047 |                  0.859 |     10 |         100.938  | 0.594755 |    0.8013 |   1668568759 |
| TorchTrainer_38651_00001 | TERMINATED | 10.100.77.179:343499 |                    128 |                   0.36 |                 0.0004 |                  0.546 |     10 |         172.939  | 1.0751   |    0.6165 |   1668568834 |
| TorchTrainer_38651_00002 | TERMINATED | 10.100.77.179:343501 |                    128 |                   0.38 |                 0.0478 |                  0.967 |     10 |         166.395  | 0.900563 |    0.6938 |   1668568827 |
| TorchTrainer_38651_00003 | TERMINATED | 10.100.77.179:343503 |                    512 |                   0.21 |                 0.0551 |                  0.602 |     10 |          67.7346 | 0.641064 |    0.7892 |   1668568729 |
| TorchTrainer_38651_00004 | TERMINATED | 10.100.77.179:351670 |                    256 |                   0.39 |                 0.0137 |                  0.956 |     10 |          95.2324 | 0.612997 |    0.7935 |   1668568828 |
| TorchTrainer_38651_00005 | ERROR      | 10.100.77.179:354571 |                    256 |                   0.87 |                 0.5708 |                  0.888 |      7 |          70.6566 | 2.14731  |    0.4166 |   1668568833 |
+--------------------------+------------+----------------------+------------------------+------------------------+------------------------+------------------------+--------+------------------+----------+-----------+--------------+
Failure # 1 (occurred at 2022-11-16_11-13-31)
�[36mray::_Inner.train()�[39m (pid=323188, ip=10.100.77.179, repr=TorchTrainer)
  File "/home/qhhu/miniconda3/lib/python3.9/site-packages/ray/tune/trainable/trainable.py", line 355, in train
    raise skipped from exception_cause(skipped)
  File "/home/qhhu/miniconda3/lib/python3.9/site-packages/ray/train/_internal/utils.py", line 54, in check_for_failure
    ray.get(object_ref)
ray.exceptions.RayTaskError(TypeError): �[36mray::RayTrainWorker._RayTrainWorker__execute()�[39m (pid=323486, ip=10.100.77.179, repr=<ray.train._internal.worker_group.RayTrainWorker object at 0x7f52e012cf10>)
  File "/home/qhhu/miniconda3/lib/python3.9/site-packages/ray/train/_internal/worker_group.py", line 31, in __execute
    raise skipped from exception_cause(skipped)
  File "/home/qhhu/miniconda3/lib/python3.9/site-packages/ray/train/_internal/utils.py", line 129, in discard_return_wrapper
    train_func(*args, **kwargs)
  File "/home/qhhu/workdir/HPO/hydro/workloads/ray_func_tuner_forSH.py", line 125, in train_func
    train_epoch(train_loader, model, criterion, optimizer, fusion_num)
  File "/home/qhhu/workdir/HPO/hydro/workloads/ray_func_tuner_forSH.py", line 40, in train_epoch
    for batch, (X, y) in enumerate(dataloader):
  File "/home/qhhu/miniconda3/lib/python3.9/site-packages/ray/train/torch/train_loop_utils.py", line 641, in __iter__
    self._prefetch_next_batch()
  File "/home/qhhu/miniconda3/lib/python3.9/site-packages/ray/train/torch/train_loop_utils.py", line 636, in _prefetch_next_batch
    next_batch = next(self.dataloader_iter, None)
  File "/home/qhhu/miniconda3/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 628, in __next__
    data = self._next_data()
  File "/home/qhhu/miniconda3/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1333, in _next_data
    return self._process_data(data)
  File "/home/qhhu/miniconda3/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1359, in _process_data
    data.reraise()
  File "/home/qhhu/miniconda3/lib/python3.9/site-packages/torch/_utils.py", line 543, in reraise
    raise exception
TypeError: Caught TypeError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/home/qhhu/miniconda3/lib/python3.9/site-packages/torch/utils/data/_utils/worker.py", line 244, in _worker_loop
    init_fn(worker_id)
  File "/home/qhhu/miniconda3/lib/python3.9/site-packages/ray/train/torch/train_loop_utils.py", line 433, in wrapper
    worker_init_fn(worker_id)
TypeError: 'NoneType' object is not callable

@Yard1
Copy link
Member Author

Yard1 commented Nov 16, 2022

This seems to be the same issue as in #30247, which should be fixed by #30266

As a workaround, you can specify a dummy worker_init_fn inside the DataLoader

@richardliaw richardliaw added the ray-team-created Ray Team created label Dec 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't P2 Important issue, but not time-critical ray-team-created Ray Team created tune Tune-related issues
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants