Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Synthetic Datasets and Subset Sampling #110

Merged
merged 67 commits into from
Dec 3, 2021

Conversation

ravi-mosaicml
Copy link
Contributor

@ravi-mosaicml ravi-mosaicml commented Nov 30, 2021

This PR adds the following:

  1. Better synethic dataset support. Each dataset hparams can optionally have a synthetic field. If this field exists and is not None, then dataset.initialize_object() should return a synthetic dataset instead of the real one. The synthetic field can contain user-specified hparams for the synthetic dataset, such as the # of unique samples to create, device, or memory format.
  2. Subset datasets. Each dataset hprams can optionally have a num_total_batches field. If this field exists and is not None, then the dataset should limit the total number of samples to this value times the batch size parameter (which is passed in on initialize_object). This field allows for smoketesting (e.g. set num_total_batches to 1 to ensure that the path is accessible on disk). If used with a synthetic dataset, then the synthetic dataset will have this many batches.

Motivation:

  • Testing: Since each dataset can describe how to generate a synthetic dataset, the tests cases no longer have to know dataset or model specific parameters (e.g. that mnist has 10 output classes). Instead, the dataset's initialize_object is responsible for setting this correctly when creating synthetic data.
  • Smoketesting: when smoketesting, it can be helpful to load one really small batch to ensure that the path on the disk is correct (when using real data) or that the model can train (when using synthetic data)
  • Profiling runs: For profiling runs, we'll usually want to discard the first epoch so the data is loaded into the cache. However, we don't need to profile the entire dataset -- only a subset!

Closes #103, #61, and #49

TODO:

Closes #11

This PR helps clean up some of the tests, rank zero callbacks, and will be used by future profiling work.
Removed deferred logging since rank is now known at the init event
Before #65, composer.trainer.ddp ensured that DDP functionality was accessed only after ddp was initialized. Now, DDP is available from process start, so this class is no longer needed. Moved all the functionality from this class to the global composer.utils.ddp.

This change allows callbacks, algroithms, etc... to use DDP (such as barriers and reductions) as needed. #97 and #101 depend on this functionality.

Also removed DDP from the state, as that is available globally.
1. dataset.initialize_object() returns a Dataloader (or a Dataloader, split_fn, and preprocessing_fn tuple), rather than a dataset and dataloader init args. This is made possible by #65.
2. Better synethic dataset support. Each dataset hparams can optionally have
a `synthetic` field. If this field exists and is not None, then `dataset.initialize_object()` should return a synthetic dataset instead of the real one. The `synthetic` field can contain user-specified hparams for the synthetic dataset, such as the # of unique samples to create, device, or memory format.
3. Subset datasets. Each dataset hprams can optionally have a `num_total_batches` field. If this field exists and is not None, then the dataset should limit the total number of samples to this value times the batch size parameter (which is passed in on initialize_object). This field allows for smoketesting (e.g. set num_total_batches to 1 to ensure that the path is accessible on disk). If used with a synthetic dataset, then the synthetic dataset will have this many batches.

TODO:
- [ ] Fix tests
- [ ] Update model yamls for new format
- [ ] Create issues for adding synthetic datagen for NLP models and brats
- [ ] Add tests for num_total_batches. There will probably be some issues with DDP and DistributedSampler.
Base automatically changed from ravi/ddp_global to dev November 30, 2021 18:12
1. Before,the `dataloder_spec`, `batch_size`, and `dataloader_hparams` were passed as areguments into the trainer. Now, the trainer is initialized with a dataloader (or a dataloader, split_fn, and preprocessing_fn tuple). This change makes the `DataloaderSpec` optional and hidden to the user for simple datasets that do not require custom preprocessing or split functions.

2. Removed `dataloader_to_device` and replaced it with explicit calls in the training loop to 1) move data onto the device, and 2) execute the preprocessing fn. The preprocessing fn is renamed to device transformation fn. Removed the option to execute the device transformation fn in a cuda stream, since that did not add any performance improvement. When using memory pinning, the `batch_to_device` should be a no-op, since the dataloader would have already moved the data onto the GPU.

TODO:
- [ ] Regression test on resnet base to ensure no throughput or accuracy degredations
@ravi-mosaicml ravi-mosaicml changed the base branch from dev to ravi/dataloaders_in_trainer November 30, 2021 21:57
@ravi-mosaicml ravi-mosaicml changed the title Dataset and Dataloader Upgrades Synthetic Datasets and Subset Sampling Dec 1, 2021
Copy link
Contributor

@jbloxham jbloxham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't love the num_total_batches param as opposed to total_dataset_size, but I can at least appreciate how this makes writing tests and profilers much simpler.

@@ -7,7 +7,7 @@
import warnings
from typing import Sequence

from composer import Logger, State
from composer.core import Logger, State
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did something change with core that required this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was running into warnings on the datasets when building the docs, so I reorganized some of the imports in __init__, and then ran into circular import errors, which is why I changed it here. I can move it out into a seperate PR if that would be better.

ravi-mosaicml and others added 3 commits December 1, 2021 17:55
Co-authored-by: Jamie Bloxham <jamie.a.bloxham@gmail.com>
Co-authored-by: Jamie Bloxham <jamie.a.bloxham@gmail.com>
@hanlint hanlint added the release label Dec 2, 2021
Copy link
Contributor

@hanlint hanlint left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. In lieu of jenkins, can we kick off some smoke test runs with resnet50/ddp and gpt to make sure nothing has broken here?

Base automatically changed from ravi/dataloaders_in_trainer to dev December 3, 2021 01:12
@ravi-mosaicml ravi-mosaicml merged commit 7398ee3 into dev Dec 3, 2021
@ravi-mosaicml ravi-mosaicml deleted the ravi/remove_dataloader_spec branch December 3, 2021 01:26
coryMosaicML pushed a commit to coryMosaicML/composer that referenced this pull request Feb 23, 2022
This PR adds the following:

1. Better synthetic dataset support. Each dataset hparams can optionally have a `synthetic` field. If this field exists and is not None, then `dataset.initialize_object()` should return a synthetic dataset instead of the real one. The `synthetic` field can contain user-specified hparams for the synthetic dataset, such as the # of unique samples to create, device, or memory format.
2. Subset datasets. Each dataset hprams can optionally have a `num_total_batches` field. If this field exists and is not None, then the dataset should limit the total number of samples to this value times the batch size parameter (which is passed in on initialize_object). This field allows for smoketesting (e.g. set num_total_batches to 1 to ensure that the path is accessible on disk). If used with a synthetic dataset, then the synthetic dataset will have this many batches.

Motivation:
* Testing: Since each dataset can describe how to generate a synthetic dataset, the tests cases no longer have to know dataset or model specific parameters (e.g. that mnist has 10 output classes). Instead, the dataset's initialize_object is responsible for setting this correctly when creating synthetic data.
* Smoketesting: when smoketesting, it can be helpful to load one really small batch to ensure that the path on the disk is correct (when using real data) or that the model can train (when using synthetic data)
* Profiling runs: For profiling runs, we'll usually want to discard the first epoch so the data is loaded into the cache. However, we don't need to profile the entire dataset -- only a subset!

Closes mosaicml#103, mosaicml#61, and mosaicml#49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Synthetic Data Generation Enable small "smoke test" runs Support for subset sampler
4 participants