Synthetic Datasets and Subset Sampling #110

ravi-mosaicml · 2021-11-30T01:55:59Z

This PR adds the following:

Better synethic dataset support. Each dataset hparams can optionally have a synthetic field. If this field exists and is not None, then dataset.initialize_object() should return a synthetic dataset instead of the real one. The synthetic field can contain user-specified hparams for the synthetic dataset, such as the # of unique samples to create, device, or memory format.
Subset datasets. Each dataset hprams can optionally have a num_total_batches field. If this field exists and is not None, then the dataset should limit the total number of samples to this value times the batch size parameter (which is passed in on initialize_object). This field allows for smoketesting (e.g. set num_total_batches to 1 to ensure that the path is accessible on disk). If used with a synthetic dataset, then the synthetic dataset will have this many batches.

Motivation:

Testing: Since each dataset can describe how to generate a synthetic dataset, the tests cases no longer have to know dataset or model specific parameters (e.g. that mnist has 10 output classes). Instead, the dataset's initialize_object is responsible for setting this correctly when creating synthetic data.
Smoketesting: when smoketesting, it can be helpful to load one really small batch to ensure that the path on the disk is correct (when using real data) or that the model can train (when using synthetic data)
Profiling runs: For profiling runs, we'll usually want to discard the first epoch so the data is loaded into the cache. However, we don't need to profile the entire dataset -- only a subset!

Closes #103, #61, and #49

TODO:

Fix tests
Update model yamls for new format
Created issues Synthetic Data Generation for LM Datasets #117 and Synthetic Data Generation for Brats Dataset #118 for adding synthetic datagen for NLP models and brats
Add tests for num_total_batches. There will probably be some issues with DDP and DistributedSampler.
Add tests to ensure that the common but optional fields raise Attribute Errors on unsupported datasets

Closes #11 This PR helps clean up some of the tests, rank zero callbacks, and will be used by future profiling work.

Removed deferred logging since rank is now known at the init event

Before #65, composer.trainer.ddp ensured that DDP functionality was accessed only after ddp was initialized. Now, DDP is available from process start, so this class is no longer needed. Moved all the functionality from this class to the global composer.utils.ddp. This change allows callbacks, algroithms, etc... to use DDP (such as barriers and reductions) as needed. #97 and #101 depend on this functionality. Also removed DDP from the state, as that is available globally.

1. dataset.initialize_object() returns a Dataloader (or a Dataloader, split_fn, and preprocessing_fn tuple), rather than a dataset and dataloader init args. This is made possible by #65. 2. Better synethic dataset support. Each dataset hparams can optionally have a `synthetic` field. If this field exists and is not None, then `dataset.initialize_object()` should return a synthetic dataset instead of the real one. The `synthetic` field can contain user-specified hparams for the synthetic dataset, such as the # of unique samples to create, device, or memory format. 3. Subset datasets. Each dataset hprams can optionally have a `num_total_batches` field. If this field exists and is not None, then the dataset should limit the total number of samples to this value times the batch size parameter (which is passed in on initialize_object). This field allows for smoketesting (e.g. set num_total_batches to 1 to ensure that the path is accessible on disk). If used with a synthetic dataset, then the synthetic dataset will have this many batches. TODO: - [ ] Fix tests - [ ] Update model yamls for new format - [ ] Create issues for adding synthetic datagen for NLP models and brats - [ ] Add tests for num_total_batches. There will probably be some issues with DDP and DistributedSampler.

1. Before,the `dataloder_spec`, `batch_size`, and `dataloader_hparams` were passed as areguments into the trainer. Now, the trainer is initialized with a dataloader (or a dataloader, split_fn, and preprocessing_fn tuple). This change makes the `DataloaderSpec` optional and hidden to the user for simple datasets that do not require custom preprocessing or split functions. 2. Removed `dataloader_to_device` and replaced it with explicit calls in the training loop to 1) move data onto the device, and 2) execute the preprocessing fn. The preprocessing fn is renamed to device transformation fn. Removed the option to execute the device transformation fn in a cuda stream, since that did not add any performance improvement. When using memory pinning, the `batch_to_device` should be a no-op, since the dataloader would have already moved the data onto the GPU. TODO: - [ ] Regression test on resnet base to ensure no throughput or accuracy degredations

…r_spec

jbloxham

I still don't love the num_total_batches param as opposed to total_dataset_size, but I can at least appreciate how this makes writing tests and profilers much simpler.

composer/trainer/trainer_hparams.py

tests/test_dataset_registry.py

hanlint · 2021-12-01T23:39:11Z

composer/callbacks/benchmarker.py

@@ -7,7 +7,7 @@
 import warnings
 from typing import Sequence

-from composer import Logger, State
+from composer.core import Logger, State


did something change with core that required this?

I was running into warnings on the datasets when building the docs, so I reorganized some of the imports in __init__, and then ran into circular import errors, which is why I changed it here. I can move it out into a seperate PR if that would be better.

composer/__init__.py

composer/datasets/cifar10.py

composer/datasets/hparams.py

composer/datasets/imagenet.py

composer/datasets/lm_datasets.py

composer/datasets/synthetic.py

setup.py

Co-authored-by: Jamie Bloxham <jamie.a.bloxham@gmail.com>

* Made custom_inherit an optioanl dependency * Flattened the synthetic hparams

hanlint

LGTM. In lieu of jenkins, can we kick off some smoke test runs with resnet50/ddp and gpt to make sure nothing has broken here?

docs/source/datasets.rst

docs/source/trainer.rst

…r_spec

This PR adds the following: 1. Better synthetic dataset support. Each dataset hparams can optionally have a `synthetic` field. If this field exists and is not None, then `dataset.initialize_object()` should return a synthetic dataset instead of the real one. The `synthetic` field can contain user-specified hparams for the synthetic dataset, such as the # of unique samples to create, device, or memory format. 2. Subset datasets. Each dataset hprams can optionally have a `num_total_batches` field. If this field exists and is not None, then the dataset should limit the total number of samples to this value times the batch size parameter (which is passed in on initialize_object). This field allows for smoketesting (e.g. set num_total_batches to 1 to ensure that the path is accessible on disk). If used with a synthetic dataset, then the synthetic dataset will have this many batches. Motivation: * Testing: Since each dataset can describe how to generate a synthetic dataset, the tests cases no longer have to know dataset or model specific parameters (e.g. that mnist has 10 output classes). Instead, the dataset's initialize_object is responsible for setting this correctly when creating synthetic data. * Smoketesting: when smoketesting, it can be helpful to load one really small batch to ensure that the path on the disk is correct (when using real data) or that the model can train (when using synthetic data) * Profiling runs: For profiling runs, we'll usually want to discard the first epoch so the data is loaded into the cache. However, we don't need to profile the entire dataset -- only a subset! Closes mosaicml#103, mosaicml#61, and mosaicml#49

ravi-mosaicml added 23 commits November 15, 2021 14:33

Added run_event to callback

6357f2e

Closes #11 This PR helps clean up some of the tests, rank zero callbacks, and will be used by future profiling work.

Removed callback helper methods

f395df4

Fixed tests

0f1aa69

Formatting

06cac4b

Addressed PR feedback

d886af6

Fixed tests

9644ad9

Formatting

cf5e533

Fixed _run_event

b1bf400

Merge branch 'dev' into ravi/run_event

9bffe3b

Formatting

4ed9f4f

Removed ip

75944eb

Merge branch 'dev' into ravi/run_event

c8ccb49

Supporting both styles for callbacks

5214f39

Removed deferred logging since rank is now known at the init event

Minimizing Diff

47158fb

Fixed tests

35faa29

Merge branch 'dev' into ravi/run_event

d20c914

Merge branch 'dev' into ravi/run_event

254bd51

Merge branch 'ravi/run_event' into ravi/ddp_global

a28ce89

Merge branch 'dev' into ravi/run_event

c30a274

Merge branch 'ravi/run_event' into ravi/ddp_global

0509df5

Fixed most tests and updated some docs

42be8f4

Base automatically changed from ravi/ddp_global to dev November 30, 2021 18:12

ravi-mosaicml added 3 commits November 30, 2021 11:44

Merge branch 'dev' into ravi/remove_dataloader_spec

0a6dadc

Merge branch 'ravi/dataloaders_in_trainer' into ravi/remove_dataloade…

80af818

…r_spec

ravi-mosaicml changed the base branch from dev to ravi/dataloaders_in_trainer November 30, 2021 21:57

ravi-mosaicml changed the title ~~Dataset and Dataloader Upgrades~~ Synthetic Datasets and Subset Sampling Dec 1, 2021

Cleaned up diff

c94ac5d

ravi-mosaicml added 7 commits December 1, 2021 13:38

Added in Dataloader to hparams

787df7d

Merge branch 'ravi/dataloaders_in_trainer' into ravi/remove_dataloade…

cc3a470

…r_spec

Updated dataset docs to reflect removed dataloader spec

e03e1e7

Fixed tests

467d586

Fixed formatting

30cd7b8

Removed prefetch

15a5582

Merge branch 'ravi/dataloaders_in_trainer' into ravi/remove_dataloade…

e7996b3

…r_spec

jbloxham approved these changes Dec 1, 2021

View reviewed changes

composer/trainer/trainer_hparams.py Outdated Show resolved Hide resolved

tests/test_dataset_registry.py Outdated Show resolved Hide resolved

hanlint requested changes Dec 2, 2021

View reviewed changes

ravi-mosaicml and others added 3 commits December 1, 2021 17:55

Simplified

cf4d449

Update composer/trainer/trainer_hparams.py

37ac41d

Co-authored-by: Jamie Bloxham <jamie.a.bloxham@gmail.com>

Update tests/test_dataset_registry.py

3395548

Co-authored-by: Jamie Bloxham <jamie.a.bloxham@gmail.com>

hanlint added the release label Dec 2, 2021

ravi-mosaicml added 5 commits December 2, 2021 10:22

Addressed PR Comments

2c06b86

* Made custom_inherit an optioanl dependency * Flattened the synthetic hparams

Removed profiler import

2e156a4

Addressed remaining PR feedback

97aab84

Fixed tests

d32fb9b

Fixed tests

7df6d9a

hanlint approved these changes Dec 3, 2021

View reviewed changes

docs/source/datasets.rst Outdated Show resolved Hide resolved

docs/source/trainer.rst Outdated Show resolved Hide resolved

ravi-mosaicml added 5 commits December 3, 2021 00:11

Merge branch 'dev' into ravi/dataloaders_in_trainer

53cc738

Merge branch 'ravi/dataloaders_in_trainer' into ravi/remove_dataloade…

ae398e8

…r_spec

Updated docs

b11abb7

Create samplers before ddp

80f88d5

Merge branch 'ravi/dataloaders_in_trainer' into ravi/remove_dataloade…

a82aba6

…r_spec

Base automatically changed from ravi/dataloaders_in_trainer to dev December 3, 2021 01:12

ravi-mosaicml added 2 commits December 3, 2021 01:15

Merge branch 'dev' into ravi/remove_dataloader_spec

0259704

Minimizing diff

16062af

ravi-mosaicml merged commit 7398ee3 into dev Dec 3, 2021

ravi-mosaicml deleted the ravi/remove_dataloader_spec branch December 3, 2021 01:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Synthetic Datasets and Subset Sampling #110

Synthetic Datasets and Subset Sampling #110

ravi-mosaicml commented Nov 30, 2021 •

edited

Loading

jbloxham left a comment

hanlint Dec 1, 2021

ravi-mosaicml Dec 2, 2021

hanlint left a comment

Synthetic Datasets and Subset Sampling #110

Synthetic Datasets and Subset Sampling #110

Conversation

ravi-mosaicml commented Nov 30, 2021 • edited Loading

jbloxham left a comment

Choose a reason for hiding this comment

hanlint Dec 1, 2021

Choose a reason for hiding this comment

ravi-mosaicml Dec 2, 2021

Choose a reason for hiding this comment

hanlint left a comment

Choose a reason for hiding this comment

ravi-mosaicml commented Nov 30, 2021 •

edited

Loading