Ravi/instrumentation point #140

ravi-mosaicml · 2021-12-08T22:52:04Z

Merge ravi's changes back into the base

Argparse is included in the stdlib for python 3.2+, so no need to install from pip. It generates a warning when using colab that you need to restart the runtime.

Torch.memory_format cannot be attached to datasets that are passed into dataloaders, as they are not pickleable. Instead, storing the enum until it is ready to be used

…ize` Implemented issue mosaicml#135. Also renamed `total_batch_size` to `train_batch_size`. Updated hparams.

…ize` (mosaicml#137) 1. Remove the `subset_num_batches` from the dataset hparams. Synthetic datasets should instead use the length of the real dataset as the size, or have a configurable size 2. Add `train_subset_num_batches` and `eval_subset_num_batches` to the trainer hparams 3. Add a check in the trainer that ensures that, if this field is set, then `DatasetHparams.shuffle is False`, or otherwise emit a warning that every epoch may be using a different subset of samples 4. Renamed `total_batch_size` to `train_batch_size`. Updated hparams.

MosaicMLLoggerBackend

…pytorch profiler

2. Added in extra metadata such as global rank and timestamps for clock syncing

* Added `run_event` to callback Closes #11 This PR helps clean up some of the tests, rank zero callbacks, and will be used by future profiling work. * Removed callback helper methods * Fixed tests * Formatting * Addressed PR feedback * Fixed tests * Formatting * Fixed _run_event * Formatting * Removed ip * Instrumentation WIP * Stash * Create dataloader on trainer __init__() #65 made the global rank available in the process start, so it is no longer necessarry to wait until training_start() to create the dataloader. Instead, dataloaders are now initialized in __init__. This change will help with dataloader profiling, as now the dataloader will be immediately bound to the state. * Stash * Added JSON trace handler * Formatting * Fixed trace generation * Prettified memory * Fixed setup.py * Changed setup.py * testing * Removed prepare * Run Directory Uploader Added uploading of the run directory to various cloud providers via a callback. Depends on the LibCloud plugin. Closes #98. Depends on #85 and (for tests) #92. * Supporting both styles for callbacks Removed deferred logging since rank is now known at the init event * Minimizing Diff * Fixed tests * Added fasteners * Fixed tests * Formatting * Lazy population of kwargs * 1. Added object_name_prefix 2. Tested on google cloud storage 3. Added exponential backoff and retrying for transient errors * Addressed PR feedback * Remove the composer.trainer.ddp class Before #65, composer.trainer.ddp ensured that DDP functionality was accessed only after ddp was initialized. Now, DDP is available from process start, so this class is no longer needed. Moved all the functionality from this class to the global composer.utils.ddp. This change allows callbacks, algroithms, etc... to use DDP (such as barriers and reductions) as needed. #97 and #101 depend on this functionality. Also removed DDP from the state, as that is available globally. * Added in DDP barrier * Fixed tests * Update composer/utils/ddp.py * Update composer/utils/ddp.py * Switched tqdm to using callback hooks Added test case for TQDM * Fixed pyright * Fixed DDP barriers * Increased timeout for run directory uploader * Switched callback format for run directory uploader * Replaced `atexit` with cleanup methods When running the trainer multiple times, such as in interactive enviroments, `atexit` does not fire. Instead, replaced it with `.close()` and `.post_close()` hooks on callbacks. `.close()` can be used to write and flush files. `.post_close()` can be used to backup the run directory and capture any changes that may have been made on `.close()` * Uncommented code * Running callbacks befor algorithms for the INIT event in the engine * For the INIT event, run the callbacks first to initialize the loggers. * For other events, run the algorithms first, so the callbacks have the state after algorithms modify it. * Fixed tests * Addressed PR feedback * Added in the scheduler * Added instant events * Fixes * Fixed profile scheduling * Added decorator option * Formatting * Added documentation for the profiler * 1. Added test cases 2. Fixed trace files to be proper json on successful training runs * Profiler entry point * Ravi/instrumentation point (#140) 1. Using `os.getpid()` for process IDs to enable synchronization with the pytorch profiler 2. Switched to using object format instead of array format for the traces 3. Added in extra metadata such as global rank and timestamps for clock syncing * Writing metadata to a seperate file * Fixed tests * Removed the perf counter * Recording IO stats * Log global rank in each torch profiler file * Merging process traces (#144) * Refactor the system profiler and dataloader profiler into callbacks Configuring the pytorch profiler based off of the mosaic profiler hparams * 1. Updated the merge script to merge pytorch trace files 2. Renamed the `MosaicProfiler` to `Profiler` * Increased timeout * Formatting * Fixed the `run_mosaic_profiler` * Added detailed option * Added sort index * Setting `pid` to global rank and `tid` to `os.getpid()` The pytorch profiler uses `os.getpid()` for the thread id. Updating the training loop profiler to be consistent so the events will interleave. Updated the merge script to replace the PID with the global rank. This ensures that GPU streams will show up under the correct rank, since pytorch by default uses the local GPU rank as the PID. This change also ensures that traces will merge properly across nodes where PIDs could conflict. * Simplifying diff * Put the backwards thread second * Thread sorting in trace * Fix * Fixes * Fixed tests * Fixed the profiler * Fixes Co-authored-by: Jamie Bloxham <jamie.a.bloxham@gmail.com> Co-authored-by: Bandish Shah <bandish@mosaicml.com> Co-authored-by: anisehsani <92882465+anisehsani@users.noreply.github.com>

* Added `run_event` to callback Closes #11 This PR helps clean up some of the tests, rank zero callbacks, and will be used by future profiling work. * Removed callback helper methods * Fixed tests * Formatting * Addressed PR feedback * Fixed tests * Formatting * Fixed _run_event * Formatting * Removed ip * Instrumentation WIP * Stash * Create dataloader on trainer __init__() mosaicml#65 made the global rank available in the process start, so it is no longer necessarry to wait until training_start() to create the dataloader. Instead, dataloaders are now initialized in __init__. This change will help with dataloader profiling, as now the dataloader will be immediately bound to the state. * Stash * Added JSON trace handler * Formatting * Fixed trace generation * Prettified memory * Fixed setup.py * Changed setup.py * testing * Removed prepare * Run Directory Uploader Added uploading of the run directory to various cloud providers via a callback. Depends on the LibCloud plugin. Closes mosaicml#98. Depends on mosaicml#85 and (for tests) mosaicml#92. * Supporting both styles for callbacks Removed deferred logging since rank is now known at the init event * Minimizing Diff * Fixed tests * Added fasteners * Fixed tests * Formatting * Lazy population of kwargs * 1. Added object_name_prefix 2. Tested on google cloud storage 3. Added exponential backoff and retrying for transient errors * Addressed PR feedback * Remove the composer.trainer.ddp class Before mosaicml#65, composer.trainer.ddp ensured that DDP functionality was accessed only after ddp was initialized. Now, DDP is available from process start, so this class is no longer needed. Moved all the functionality from this class to the global composer.utils.ddp. This change allows callbacks, algroithms, etc... to use DDP (such as barriers and reductions) as needed. mosaicml#97 and mosaicml#101 depend on this functionality. Also removed DDP from the state, as that is available globally. * Added in DDP barrier * Fixed tests * Update composer/utils/ddp.py * Update composer/utils/ddp.py * Switched tqdm to using callback hooks Added test case for TQDM * Fixed pyright * Fixed DDP barriers * Increased timeout for run directory uploader * Switched callback format for run directory uploader * Replaced `atexit` with cleanup methods When running the trainer multiple times, such as in interactive enviroments, `atexit` does not fire. Instead, replaced it with `.close()` and `.post_close()` hooks on callbacks. `.close()` can be used to write and flush files. `.post_close()` can be used to backup the run directory and capture any changes that may have been made on `.close()` * Uncommented code * Running callbacks befor algorithms for the INIT event in the engine * For the INIT event, run the callbacks first to initialize the loggers. * For other events, run the algorithms first, so the callbacks have the state after algorithms modify it. * Fixed tests * Addressed PR feedback * Added in the scheduler * Added instant events * Fixes * Fixed profile scheduling * Added decorator option * Formatting * Added documentation for the profiler * 1. Added test cases 2. Fixed trace files to be proper json on successful training runs * Profiler entry point * Ravi/instrumentation point (mosaicml#140) 1. Using `os.getpid()` for process IDs to enable synchronization with the pytorch profiler 2. Switched to using object format instead of array format for the traces 3. Added in extra metadata such as global rank and timestamps for clock syncing * Writing metadata to a seperate file * Fixed tests * Removed the perf counter * Recording IO stats * Log global rank in each torch profiler file * Merging process traces (mosaicml#144) * Refactor the system profiler and dataloader profiler into callbacks Configuring the pytorch profiler based off of the mosaic profiler hparams * 1. Updated the merge script to merge pytorch trace files 2. Renamed the `MosaicProfiler` to `Profiler` * Increased timeout * Formatting * Fixed the `run_mosaic_profiler` * Added detailed option * Added sort index * Setting `pid` to global rank and `tid` to `os.getpid()` The pytorch profiler uses `os.getpid()` for the thread id. Updating the training loop profiler to be consistent so the events will interleave. Updated the merge script to replace the PID with the global rank. This ensures that GPU streams will show up under the correct rank, since pytorch by default uses the local GPU rank as the PID. This change also ensures that traces will merge properly across nodes where PIDs could conflict. * Simplifying diff * Put the backwards thread second * Thread sorting in trace * Fix * Fixes * Fixed tests * Fixed the profiler * Fixes Co-authored-by: Jamie Bloxham <jamie.a.bloxham@gmail.com> Co-authored-by: Bandish Shah <bandish@mosaicml.com> Co-authored-by: anisehsani <92882465+anisehsani@users.noreply.github.com>

ravi-mosaicml and others added 21 commits December 3, 2021 10:03

Remove argparse from setup.py (mosaicml#131)

79e15a4

Argparse is included in the stdlib for python 3.2+, so no need to install from pip. It generates a warning when using colab that you need to restart the runtime.

Fixed pickling of torch.memory_format objects (mosaicml#132)

383b62a

Torch.memory_format cannot be attached to datasets that are passed into dataloaders, as they are not pickleable. Instead, storing the enum until it is ready to be used

Fixed issue mosaicml#135; rename total_batch_size to `train_batch_s…

2992505

…ize` Implemented issue mosaicml#135. Also renamed `total_batch_size` to `train_batch_size`. Updated hparams.

Fixed synthetic samplers

b6adc05

Fixed dataset reg test

f7f6350

timeouts

3ad316e

timeouts

28b4fd8

Fixed trainer fit

fc39b80

Set train shuffle to false

6d97f26

Formatting

3aa8d68

isort

301e100

Merge branch 'ravi/i135' into ravi/instrumentation_point

ea1811c

Added profiler helper script

5a26389

Merge branch 'dev' into ravi/instrumentation_point

25f0451

Align instant events with start events

638cda8

Formatting, fixed instant events, thread names

32c5a33

Implement MosaicMLLoggerBackend (mosaicml#81)

2f56ecd

MosaicMLLoggerBackend

Using os.getpid() for thread IDs to enable syncronization with the …

8c0823c

…pytorch profiler

1. Switched to using object format instead of array format

283a2f2

2. Added in extra metadata such as global rank and timestamps for clock syncing

Merge branch 'dev' into ravi/instrumentation_point

bc660fc

ravi-mosaicml merged commit 08c961b into mosaicml:ravi/instrumentation_point Dec 8, 2021

ravi-mosaicml deleted the ravi/instrumentation_point branch December 8, 2021 22:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ravi/instrumentation point #140

Ravi/instrumentation point #140

ravi-mosaicml commented Dec 8, 2021

Ravi/instrumentation point #140

Ravi/instrumentation point #140

Conversation

ravi-mosaicml commented Dec 8, 2021