Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AIR] Added Ray Logging to MosaicTrainer #29620

Merged
merged 82 commits into from
Oct 27, 2022
Merged
Show file tree
Hide file tree
Changes from 76 commits
Commits
Show all changes
82 commits
Select commit Hold shift + click to select a range
364d7f6
add initial trainer API files
ilee300a Oct 11, 2022
cfa593a
created a basic composer trainer wrapper and a test
ilee300a Oct 11, 2022
919aea4
Merge branch 'ray-project:master' into init_mosaic_trainer_api
ilee300a Oct 11, 2022
7f3b551
formatted
ilee300a Oct 11, 2022
0871fa4
removed unused config key
ilee300a Oct 11, 2022
f083e1b
Merge branch 'init_mosaic_trainer_api' of https://github.com/ilee300a…
ilee300a Oct 12, 2022
536a381
merged
ilee300a Oct 12, 2022
9d15416
Merge branch 'init_mosaic_trainer_api' of https://github.com/ilee300a…
ilee300a Oct 12, 2022
7b509da
addressed comments
ilee300a Oct 12, 2022
2712fda
addressed comments
ilee300a Oct 12, 2022
1851322
Merge branch 'ray-project:master' into init_mosaic_trainer_api
ilee300a Oct 13, 2022
ae1b9a6
change the data directory
ilee300a Oct 13, 2022
5d599e4
removed progress bar by default
ilee300a Oct 13, 2022
5da865d
move dataset preparation inside the trainer init function
ilee300a Oct 13, 2022
0e42dec
add mosaic trainer to the api ref
ilee300a Oct 13, 2022
104075f
add fit config
ilee300a Oct 13, 2022
161009e
remove datasets in the trainer worker loop
ilee300a Oct 13, 2022
72dc74f
added a test for fit config
ilee300a Oct 13, 2022
faa2e9b
added example MosaicTrainer script and a test
ilee300a Oct 13, 2022
810428c
added mosaic trainer test to the build script
ilee300a Oct 13, 2022
3d32969
removed unnecessary line
ilee300a Oct 13, 2022
b96f384
reformatted passing in config and use regex for error message testing
ilee300a Oct 13, 2022
1fe7989
Fixed typo
ilee300a Oct 13, 2022
d39b63d
remove composer loggers in training callback
ilee300a Oct 17, 2022
de77a64
added a test for logger removal
ilee300a Oct 17, 2022
b297926
addressed comments and removed fit_config
ilee300a Oct 17, 2022
00bd1f8
fixed a typo
ilee300a Oct 17, 2022
be4c311
added further description for trainier_init_config
ilee300a Oct 17, 2022
a93ddf9
fixed typo
ilee300a Oct 17, 2022
8b324af
fixed mosaic test_example
ilee300a Oct 17, 2022
65bf8bb
addressed comments
ilee300a Oct 18, 2022
2b57038
added composer to tune requirements
ilee300a Oct 18, 2022
5c96046
fixed library import
ilee300a Oct 18, 2022
7177134
updated torch and torchvision versions for composer compatibility
ilee300a Oct 18, 2022
3617626
removed mosaic from requirements and install it separately in test file
ilee300a Oct 18, 2022
1563395
added mosaicml library to the doc requirements
ilee300a Oct 18, 2022
9fb684e
add mosaicml in custom directives instead
ilee300a Oct 18, 2022
d27c20b
changed mock module name
ilee300a Oct 18, 2022
e1e92e4
added more composer libraries to custom directives
ilee300a Oct 18, 2022
8c64463
reformatted
ilee300a Oct 18, 2022
4fec09e
updated for tests
ilee300a Oct 19, 2022
652c12f
added imports to the docstring example
ilee300a Oct 19, 2022
31914cf
fixed indentation
ilee300a Oct 19, 2022
706d971
moved imports inside the functions
ilee300a Oct 20, 2022
797b4df
added doctest skip and moved MosaicTrainer import in tests
ilee300a Oct 20, 2022
715c37f
Merge branch 'ray-project:master' into init_mosaic_trainer_api
ilee300a Oct 20, 2022
175aeaa
added more doc test skips
ilee300a Oct 20, 2022
82305c4
move composer import
ilee300a Oct 20, 2022
883d2e5
moved mosaic test example imports
ilee300a Oct 21, 2022
8c3306c
added mosaicml to requirements_train
ilee300a Oct 21, 2022
69f60f9
Merge branch 'ray-project:master' into init_mosaic_trainer_api
ilee300a Oct 21, 2022
f168f6c
updated torch and torchvision versions
ilee300a Oct 21, 2022
9e3d119
Merge branch 'init_mosaic_trainer_api' of https://github.com/ilee300a…
ilee300a Oct 21, 2022
858973b
updated ml docker to match torch version
ilee300a Oct 21, 2022
4654631
update
amogkam Oct 21, 2022
48b48bb
update
amogkam Oct 21, 2022
20dc4f8
Update .buildkite/pipeline.ml.yml
amogkam Oct 21, 2022
e401eff
Update .buildkite/pipeline.ml.yml
amogkam Oct 22, 2022
eb0c97d
Update python/requirements/ml/requirements_dl.txt
amogkam Oct 24, 2022
928b529
add conftest
amogkam Oct 24, 2022
3529b77
added ray logging
ilee300a Oct 24, 2022
5e956cc
Merge branch 'init_mosaic_trainer_api' into ray_logging
ilee300a Oct 24, 2022
7ddee9b
remove loggers after trainer instantiation as well
ilee300a Oct 24, 2022
9d1a302
refactor
amogkam Oct 24, 2022
1a2c4b1
suppress console to log
ilee300a Oct 25, 2022
7fcb7b0
check the epoch and accuracy data for test example
ilee300a Oct 25, 2022
9b3d9e7
Merge branch 'init_mosaic_trainer_api' into ray_logging
ilee300a Oct 25, 2022
4cf8fa0
updated fixtures
ilee300a Oct 25, 2022
cbc1c6e
Merge branch 'master' into init_mosaic_trainer_api
ilee300a Oct 25, 2022
1fc42f4
Merge branch 'init_mosaic_trainer_api' into ray_logging
ilee300a Oct 25, 2022
acbcc2a
added custom directives for mosaic util library
ilee300a Oct 25, 2022
0c79025
updated removing composer loggers
ilee300a Oct 25, 2022
3d84dcb
updated -- add ray logger after trainer init without using loggers key
ilee300a Oct 25, 2022
2ce2b41
fixed typo
ilee300a Oct 27, 2022
1993c0f
updated ray logging interval
ilee300a Oct 27, 2022
3f15a5a
updated ray logging tests with suggested changes
ilee300a Oct 27, 2022
36393e9
made test dataset size scaled by predefined batch size
ilee300a Oct 27, 2022
09acf07
updated logging interval
ilee300a Oct 27, 2022
870b8c5
Update python/ray/train/mosaic/_mosaic_utils.py
ilee300a Oct 27, 2022
18392ff
updated logging -- flush data and added comments
ilee300a Oct 27, 2022
875dacc
Merge branch 'ray_logging' of https://github.com/ilee300a/ray into ra…
ilee300a Oct 27, 2022
ac07acc
Merge branch 'ray-project:master' into ray_logging
ilee300a Oct 27, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions doc/source/custom_directives.py
Original file line number Diff line number Diff line change
Expand Up @@ -114,6 +114,8 @@ def update_context(app, pagename, templatename, context, doctree):
"composer.trainer",
"composer.loggers",
"composer.loggers.logger_destination",
"composer.core",
"composer.core.state",
]


Expand Down
8 changes: 4 additions & 4 deletions python/ray/train/examples/mosaic_cifar10_example.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ def trainer_init_per_worker(config):
from composer.models.tasks import ComposerClassifier
import composer.optim

BATCH_SIZE = 32
BATCH_SIZE = 64
# prepare the model for distributed training and wrap with ComposerClassifier for
# Composer Trainer compatibility
model = torchvision.models.resnet18(num_classes=10)
Expand All @@ -37,13 +37,13 @@ def trainer_init_per_worker(config):
datasets.CIFAR10(
data_directory, train=True, download=True, transform=cifar10_transforms
),
list(range(64)),
list(range(BATCH_SIZE * 10)),
)
test_dataset = torch.utils.data.Subset(
datasets.CIFAR10(
data_directory, train=False, download=True, transform=cifar10_transforms
),
list(range(64)),
list(range(2048)),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is batch size for training const and this inline number ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated so that it is BATCH_SIZE *10 just like the train dataset

)

batch_size_per_worker = BATCH_SIZE // session.get_world_size()
Expand Down Expand Up @@ -82,7 +82,7 @@ def train_mosaic_cifar10(num_workers=2, use_gpu=False):
from ray.train.mosaic import MosaicTrainer

trainer_init_config = {
"max_duration": "1ep",
"max_duration": "2ep",
"algorithms": [LabelSmoothing()],
"should_eval": False,
}
Expand Down
56 changes: 56 additions & 0 deletions python/ray/train/mosaic/_mosaic_utils.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
from typing import Any, Dict, Optional, List
import torch

from composer.loggers import Logger
from composer.loggers.logger_destination import LoggerDestination
from composer.core.state import State

from ray.air import session


class RayLogger(LoggerDestination):
"""A logger to relay information logged by composer models to ray.

This logger allows utilizing all necessary logging and logged data handling provided
by the Composer library. All the logged information is saved in the data dictionary
every time a new information is logged, but to reduce unnecessary reporting, the
most up-to-date logged information is reported as metrics every batch checkpoint and
epoch checkpoint (see Composer's Event module for more details).

Because ray's metric dataframe will not include new keys that is reported after the
very first report call, any logged information with the keys not included in the
first batch checkpoint would not be retrievable after training. In other words, if
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are users expected to know what these keys upfront? Looking at the Mosaic code, it seems that these keys are automatically added by Mosaic algorithms and callbacks, so I don't think users are aware of what these keys are in order to provide them here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should fix the underlying bug

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the log level is greater than `LogLevel.BATCH` for some data, they would not be
present in `Result.metrics_dataframe`. To allow preserving those information, the
user can provide keys to be always included in the reported data by using `keys`
argument in the constructor. For `MosaicTrainer`, use
`trainer_init_config['log_keys']` to populate these keys.

Note that in the Event callback functions, we remove unused variables, as this is
practiced in Mosaic's composer library.

Args:
keys: the key values that will be included in the reported metrics.
"""

def __init__(self, keys: List[str] = None) -> None:
ilee300a marked this conversation as resolved.
Show resolved Hide resolved
self.data = {}
if keys:
for key in keys:
self.data[key] = None

def log_metrics(self, metrics: Dict[str, Any], step: Optional[int] = None) -> None:
self.data.update(metrics.items())
for key, val in self.data.items():
if isinstance(val, torch.Tensor):
self.data[key] = val.item()

def epoch_checkpoint(self, state: State, logger: Logger) -> None:
del logger # unused
session.report(self.data)
ilee300a marked this conversation as resolved.
Show resolved Hide resolved

def fit_end(self, state: State, logger: Logger) -> None:
# report at close in case the trainer stops in the middle of an epoch.
# this may be double counted with epoch checkpoint.
del logger # unused
session.report(self.data)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't report on at both the batch level and the epoch level. Each call to session.report should be 1 iteration, so if we log at both, we will be double counting.

For now, I would say let's just log only at every epoch. We can see in the future if we want to give users the ability to configure this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we'll definitely have users want to do it on either level - that was the case with HF, where we started with epochs only and had to add steps too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Completely agree @Yard1. I’m thinking we can default to epoch for now and then add batch support in a follow up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been updated!
We are reporting each epoch now, but we also report after the fit call, in case the training ends before an epoch checkpoint call could be made. This adds an extra report call, in which an epoch checkpoint can be double counted. -- but we can also make it so that this last call is made only if there are extra batch runs after the last epoch run.

Copy link
Contributor Author

@ilee300a ilee300a Oct 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mentioned change above has been applied.

12 changes: 10 additions & 2 deletions python/ray/train/mosaic/mosaic_trainer.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@
from ray.air import session
from ray.air.checkpoint import Checkpoint
from ray.air.config import DatasetConfig, RunConfig, ScalingConfig
from ray.train.mosaic._mosaic_utils import RayLogger
from ray.train.torch import TorchConfig, TorchTrainer
from ray.train.trainer import GenDataset
from ray.util import PublicAPI
Expand Down Expand Up @@ -207,16 +208,23 @@ def _mosaic_train_loop_per_worker(config):
os.environ["WORLD_SIZE"] = str(session.get_world_size())
os.environ["LOCAL_RANK"] = str(session.get_local_rank())

# Replace Composer's Loggers with RayLogger
ray_logger = RayLogger(keys=config.pop("log_keys", []))

# initialize Composer trainer
config["progress_bar"] = False
trainer: Trainer = trainer_init_per_worker(config)

# Remove Composer's Loggers
# Remove Composer's Loggers if there are any added in the trainer_init_per_worker
# this removes the logging part of the loggers
filtered_callbacks = list()
for callback in trainer.state.callbacks:
if not isinstance(callback, LoggerDestination):
filtered_callbacks.append(callback)
filtered_callbacks.append(ray_logger)
trainer.state.callbacks = filtered_callbacks

# this prevents data to be routed to all the Composer Loggers
trainer.logger.destinations = (ray_logger,)

# call the trainer
trainer.fit()
98 changes: 95 additions & 3 deletions python/ray/train/tests/test_mosaic_trainer.py
Original file line number Diff line number Diff line change
Expand Up @@ -71,7 +71,7 @@ def trainer_init_per_worker(config):
weight_decay=2.0e-3,
)

if config.pop("eval", False):
if config.pop("should_eval", False):
config["eval_dataloader"] = evaluator

return composer.trainer.Trainer(
Expand All @@ -85,9 +85,17 @@ def trainer_init_per_worker(config):
def test_mosaic_cifar10(ray_start_4_cpus):
from ray.train.examples.mosaic_cifar10_example import train_mosaic_cifar10

_ = train_mosaic_cifar10()
result = train_mosaic_cifar10().metrics_dataframe

# TODO : add asserts once reporting has been integrated
# check the max epoch value
assert result["epoch"][result.index[-1]] == 1

# check train_iterations
assert result["_training_iteration"][result.index[-1]] == 3

# check metrics/train/Accuracy has increased
acc = list(result["metrics/train/Accuracy"])
assert acc[-1] > acc[0]


def test_init_errors(ray_start_4_cpus):
Expand Down Expand Up @@ -149,6 +157,10 @@ class DummyCallback(Callback):
def fit_start(self, state: State, logger: Logger) -> None:
raise ValueError("Composer Callback object exists.")

class DummyMonitorCallback(Callback):
def fit_start(self, state: State, logger: Logger) -> None:
logger.log_metrics({"dummy_callback": "test"})

# DummyLogger should not throw an error since it should be removed before `fit` call
trainer_init_config = {
"max_duration": "1ep",
Expand All @@ -175,6 +187,86 @@ def fit_start(self, state: State, logger: Logger) -> None:
trainer.fit()
assert e == "Composer Callback object exists."

trainer_init_config["callbacks"] = DummyMonitorCallback()
trainer = MosaicTrainer(
trainer_init_per_worker=trainer_init_per_worker,
trainer_init_config=trainer_init_config,
scaling_config=scaling_config,
)

result = trainer.fit()

assert "dummy_callback" in result.metrics
assert result.metrics["dummy_callback"] == "test"


def test_metrics_key(ray_start_4_cpus):
from ray.train.mosaic import MosaicTrainer

"""Tests if `log_keys` defined in `trianer_init_config` appears in result
metrics_dataframe.
"""
trainer_init_config = {
"max_duration": "1ep",
"should_eval": True,
"log_keys": ["metrics/my_evaluator/Accuracy"],
}

trainer = MosaicTrainer(
trainer_init_per_worker=trainer_init_per_worker,
trainer_init_config=trainer_init_config,
scaling_config=scaling_config,
)

result = trainer.fit()

# check if the passed in log key exists
assert "metrics/my_evaluator/Accuracy" in result.metrics_dataframe.columns


def test_monitor_callbacks(ray_start_4_cpus):
from ray.train.mosaic import MosaicTrainer

# Test Callbacks involving logging (SpeedMonitor, LRMonitor)
from composer.callbacks import SpeedMonitor, LRMonitor, GradMonitor

trainer_init_config = {
"max_duration": "1ep",
"should_eval": True,
}
trainer_init_config["log_keys"] = [
"grad_l2_norm/step",
]
trainer_init_config["callbacks"] = [
SpeedMonitor(window_size=3),
LRMonitor(),
GradMonitor(),
]

trainer = MosaicTrainer(
trainer_init_per_worker=trainer_init_per_worker,
trainer_init_config=trainer_init_config,
scaling_config=scaling_config,
)

result = trainer.fit()

assert len(result.metrics_dataframe) == 2

metrics_columns = result.metrics_dataframe.columns
columns_to_check = [
"wall_clock/train",
"wall_clock/val",
"wall_clock/total",
"lr-DecoupledSGDW/group0",
"grad_l2_norm/step",
]
for column in columns_to_check:
assert column in metrics_columns, column + " is not found"
assert result.metrics_dataframe[column].isnull().sum() == 0, (
column + " column has a null value"
)


Copy link
Contributor

@amogkam amogkam Oct 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we make these newly added tests more robust?

  • do the number of rows in the dataframe match with what we expect?
  • we should add a dummy callback that reports to the logger, and then check to make sure the values in the dataframe match with what we expect.
  • are there any other edge cases you can think of?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests have been updated to check

  • the number of rows in the dataframe
  • value reported by a dummy callback
  • whether null value exists for the reported composer monitoring callbacks

if __name__ == "__main__":
import sys
Expand Down