Run Directory Uploader #101

ravi-mosaicml · 2021-11-22T23:38:16Z

Run Directory Uploader

Added uploading of the run directory to various cloud providers via a callback. Depends on the LibCloud plugin. Did not use s3 as azure blob store is not s3-compatible.

Closes #98.

TODO:

Add in prefix to hparams
Manual tests on cloud providers (e.g. GCS)
Merge Remove composer.trainer.ddp; replace with composer.utils.ddp #105
Add in DDP barriers (may require attaching the ddp object to state, or making it global)

Closes #11 This PR helps clean up some of the tests, rank zero callbacks, and will be used by future profiling work.

#65 made the global rank available in the process start, so it is no longer necessarry to wait until training_start() to create the dataloader. Instead, dataloaders are now initialized in __init__. This change will help with dataloader profiling, as now the dataloader will be immediately bound to the state.

Added uploading of the run directory to various cloud providers via a callback. Depends on the LibCloud plugin. Closes #98. Depends on #85 and (for tests) #92.

Removed deferred logging since rank is now known at the init event

abhi-mosaic · 2021-11-23T01:01:43Z

QQ: is the intention here that we should always be writing our checkpoints to our personal object store? And not using something like WandB artifacts (https://wandb.ai/wandb/common-ml-errors/reports/How-to-Save-and-Load-Models-in-PyTorch--VmlldzozMjg0MTE#save-as-artifacts)

ravi-mosaicml · 2021-11-23T01:21:41Z

QQ: is the intention here that we should always be writing our checkpoints to our personal object store? And not using something like WandB artifacts (https://wandb.ai/wandb/common-ml-errors/reports/How-to-Save-and-Load-Models-in-PyTorch--VmlldzozMjg0MTE#save-as-artifacts)

Yes, ideally we would use wandb internally, but we have been running into issues with getting 4xx errors returned by their client / API (e.g. #90). Also external customers may prefer to use a blob store of their own.

* For the INIT event, run the callbacks first to initialize the loggers. * For other events, run the algorithms first, so the callbacks have the state after algorithms modify it.

ajaysaini725

Just a few notes but otherwise looks good!

composer/callbacks/run_directory_uploader.py

setup.py

Averylamp

LGTM, just make sure it passes tests before merging. Also concerned about performance hits/knowing what the performance hit is, so it'd be nice to know that

…etermine changed files

* Added `run_event` to callback Closes #11 This PR helps clean up some of the tests, rank zero callbacks, and will be used by future profiling work. * Removed callback helper methods * Fixed tests * Formatting * Addressed PR feedback * Fixed tests * Formatting * Fixed _run_event * Formatting * Removed ip * Instrumentation WIP * Stash * Create dataloader on trainer __init__() #65 made the global rank available in the process start, so it is no longer necessarry to wait until training_start() to create the dataloader. Instead, dataloaders are now initialized in __init__. This change will help with dataloader profiling, as now the dataloader will be immediately bound to the state. * Stash * Added JSON trace handler * Formatting * Fixed trace generation * Prettified memory * Fixed setup.py * Changed setup.py * testing * Removed prepare * Run Directory Uploader Added uploading of the run directory to various cloud providers via a callback. Depends on the LibCloud plugin. Closes #98. Depends on #85 and (for tests) #92. * Supporting both styles for callbacks Removed deferred logging since rank is now known at the init event * Minimizing Diff * Fixed tests * Added fasteners * Fixed tests * Formatting * Lazy population of kwargs * 1. Added object_name_prefix 2. Tested on google cloud storage 3. Added exponential backoff and retrying for transient errors * Addressed PR feedback * Remove the composer.trainer.ddp class Before #65, composer.trainer.ddp ensured that DDP functionality was accessed only after ddp was initialized. Now, DDP is available from process start, so this class is no longer needed. Moved all the functionality from this class to the global composer.utils.ddp. This change allows callbacks, algroithms, etc... to use DDP (such as barriers and reductions) as needed. #97 and #101 depend on this functionality. Also removed DDP from the state, as that is available globally. * Added in DDP barrier * Fixed tests * Update composer/utils/ddp.py * Update composer/utils/ddp.py * Switched tqdm to using callback hooks Added test case for TQDM * Fixed pyright * Fixed DDP barriers * Increased timeout for run directory uploader * Switched callback format for run directory uploader * Replaced `atexit` with cleanup methods When running the trainer multiple times, such as in interactive enviroments, `atexit` does not fire. Instead, replaced it with `.close()` and `.post_close()` hooks on callbacks. `.close()` can be used to write and flush files. `.post_close()` can be used to backup the run directory and capture any changes that may have been made on `.close()` * Uncommented code * Running callbacks befor algorithms for the INIT event in the engine * For the INIT event, run the callbacks first to initialize the loggers. * For other events, run the algorithms first, so the callbacks have the state after algorithms modify it. * Fixed tests * Addressed PR feedback * Added in the scheduler * Added instant events * Fixes * Fixed profile scheduling * Added decorator option * Formatting * Added documentation for the profiler * 1. Added test cases 2. Fixed trace files to be proper json on successful training runs * Profiler entry point * Ravi/instrumentation point (#140) 1. Using `os.getpid()` for process IDs to enable synchronization with the pytorch profiler 2. Switched to using object format instead of array format for the traces 3. Added in extra metadata such as global rank and timestamps for clock syncing * Writing metadata to a seperate file * Fixed tests * Removed the perf counter * Recording IO stats * Log global rank in each torch profiler file * Merging process traces (#144) * Refactor the system profiler and dataloader profiler into callbacks Configuring the pytorch profiler based off of the mosaic profiler hparams * 1. Updated the merge script to merge pytorch trace files 2. Renamed the `MosaicProfiler` to `Profiler` * Increased timeout * Formatting * Fixed the `run_mosaic_profiler` * Added detailed option * Added sort index * Setting `pid` to global rank and `tid` to `os.getpid()` The pytorch profiler uses `os.getpid()` for the thread id. Updating the training loop profiler to be consistent so the events will interleave. Updated the merge script to replace the PID with the global rank. This ensures that GPU streams will show up under the correct rank, since pytorch by default uses the local GPU rank as the PID. This change also ensures that traces will merge properly across nodes where PIDs could conflict. * Simplifying diff * Put the backwards thread second * Thread sorting in trace * Fix * Fixes * Fixed tests * Fixed the profiler * Fixes Co-authored-by: Jamie Bloxham <jamie.a.bloxham@gmail.com> Co-authored-by: Bandish Shah <bandish@mosaicml.com> Co-authored-by: anisehsani <92882465+anisehsani@users.noreply.github.com>

…aicml#105) Before mosaicml#65, composer.trainer.ddp ensured that DDP functionality was accessed only after ddp was initialized. Now, DDP is available from process start, so this class is no longer needed. Moved all the functionality from this class to the global composer.utils.ddp. This change allows callbacks, algorithms, etc... to use DDP (such as barriers and reductions) as needed. mosaicml#97 and mosaicml#101 depend on this functionality. Also removed DDP from the state, as that is available globally.

Run Directory Uploader Added uploading of the run directory to various cloud providers via a callback. Depends on the LibCloud plugin. Did not use s3 as azure blob store is not s3-compatible. Closes mosaicml#98.

* Added `run_event` to callback Closes #11 This PR helps clean up some of the tests, rank zero callbacks, and will be used by future profiling work. * Removed callback helper methods * Fixed tests * Formatting * Addressed PR feedback * Fixed tests * Formatting * Fixed _run_event * Formatting * Removed ip * Instrumentation WIP * Stash * Create dataloader on trainer __init__() mosaicml#65 made the global rank available in the process start, so it is no longer necessarry to wait until training_start() to create the dataloader. Instead, dataloaders are now initialized in __init__. This change will help with dataloader profiling, as now the dataloader will be immediately bound to the state. * Stash * Added JSON trace handler * Formatting * Fixed trace generation * Prettified memory * Fixed setup.py * Changed setup.py * testing * Removed prepare * Run Directory Uploader Added uploading of the run directory to various cloud providers via a callback. Depends on the LibCloud plugin. Closes mosaicml#98. Depends on mosaicml#85 and (for tests) mosaicml#92. * Supporting both styles for callbacks Removed deferred logging since rank is now known at the init event * Minimizing Diff * Fixed tests * Added fasteners * Fixed tests * Formatting * Lazy population of kwargs * 1. Added object_name_prefix 2. Tested on google cloud storage 3. Added exponential backoff and retrying for transient errors * Addressed PR feedback * Remove the composer.trainer.ddp class Before mosaicml#65, composer.trainer.ddp ensured that DDP functionality was accessed only after ddp was initialized. Now, DDP is available from process start, so this class is no longer needed. Moved all the functionality from this class to the global composer.utils.ddp. This change allows callbacks, algroithms, etc... to use DDP (such as barriers and reductions) as needed. mosaicml#97 and mosaicml#101 depend on this functionality. Also removed DDP from the state, as that is available globally. * Added in DDP barrier * Fixed tests * Update composer/utils/ddp.py * Update composer/utils/ddp.py * Switched tqdm to using callback hooks Added test case for TQDM * Fixed pyright * Fixed DDP barriers * Increased timeout for run directory uploader * Switched callback format for run directory uploader * Replaced `atexit` with cleanup methods When running the trainer multiple times, such as in interactive enviroments, `atexit` does not fire. Instead, replaced it with `.close()` and `.post_close()` hooks on callbacks. `.close()` can be used to write and flush files. `.post_close()` can be used to backup the run directory and capture any changes that may have been made on `.close()` * Uncommented code * Running callbacks befor algorithms for the INIT event in the engine * For the INIT event, run the callbacks first to initialize the loggers. * For other events, run the algorithms first, so the callbacks have the state after algorithms modify it. * Fixed tests * Addressed PR feedback * Added in the scheduler * Added instant events * Fixes * Fixed profile scheduling * Added decorator option * Formatting * Added documentation for the profiler * 1. Added test cases 2. Fixed trace files to be proper json on successful training runs * Profiler entry point * Ravi/instrumentation point (mosaicml#140) 1. Using `os.getpid()` for process IDs to enable synchronization with the pytorch profiler 2. Switched to using object format instead of array format for the traces 3. Added in extra metadata such as global rank and timestamps for clock syncing * Writing metadata to a seperate file * Fixed tests * Removed the perf counter * Recording IO stats * Log global rank in each torch profiler file * Merging process traces (mosaicml#144) * Refactor the system profiler and dataloader profiler into callbacks Configuring the pytorch profiler based off of the mosaic profiler hparams * 1. Updated the merge script to merge pytorch trace files 2. Renamed the `MosaicProfiler` to `Profiler` * Increased timeout * Formatting * Fixed the `run_mosaic_profiler` * Added detailed option * Added sort index * Setting `pid` to global rank and `tid` to `os.getpid()` The pytorch profiler uses `os.getpid()` for the thread id. Updating the training loop profiler to be consistent so the events will interleave. Updated the merge script to replace the PID with the global rank. This ensures that GPU streams will show up under the correct rank, since pytorch by default uses the local GPU rank as the PID. This change also ensures that traces will merge properly across nodes where PIDs could conflict. * Simplifying diff * Put the backwards thread second * Thread sorting in trace * Fix * Fixes * Fixed tests * Fixed the profiler * Fixes Co-authored-by: Jamie Bloxham <jamie.a.bloxham@gmail.com> Co-authored-by: Bandish Shah <bandish@mosaicml.com> Co-authored-by: anisehsani <92882465+anisehsani@users.noreply.github.com>

ravi-mosaicml added 15 commits November 15, 2021 14:33

Added run_event to callback

6357f2e

Closes #11 This PR helps clean up some of the tests, rank zero callbacks, and will be used by future profiling work.

Removed callback helper methods

f395df4

Fixed tests

0f1aa69

Formatting

06cac4b

Addressed PR feedback

d886af6

Fixed tests

9644ad9

Formatting

cf5e533

Fixed _run_event

b1bf400

Merge branch 'dev' into ravi/run_event

9bffe3b

Formatting

4ed9f4f

Removed ip

75944eb

Merge branch 'ravi/run_event' into ravi/libcloud

f2f4ede

Merge branch 'ravi/create_dataloaders_in_init' into ravi/libcloud

8bf1c67

Run Directory Uploader

8b3563e

Added uploading of the run directory to various cloud providers via a callback. Depends on the LibCloud plugin. Closes #98. Depends on #85 and (for tests) #92.

ravi-mosaicml requested review from moinnadeem, Averylamp and a team November 22, 2021 23:38

ravi-mosaicml added 7 commits November 22, 2021 15:49

Merge branch 'dev' into ravi/run_event

c8ccb49

Supporting both styles for callbacks

5214f39

Removed deferred logging since rank is now known at the init event

Minimizing Diff

47158fb

Fixed tests

35faa29

Merge branch 'dev' into ravi/run_event

d20c914

Merge branch 'dev' into ravi/run_event

254bd51

Merge branch 'ravi/run_event' into ravi/libcloud

0d02d07

ravi-mosaicml changed the base branch from dev to ravi/run_event November 23, 2021 00:51

Added fasteners

d568aa6

ravi-mosaicml mentioned this pull request Nov 23, 2021

Training Loop Profiler #97

Merged

13 tasks

ravi-mosaicml added 4 commits November 30, 2021 11:22

Uncommented code

5171468

Running callbacks befor algorithms for the INIT event in the engine

97326bd

* For the INIT event, run the callbacks first to initialize the loggers. * For other events, run the algorithms first, so the callbacks have the state after algorithms modify it.

Merge branch 'ravi/fix_engine_2' into ravi/remove_atexit

8c21260

Merge branch 'ravi/remove_atexit' into ravi/libcloud

d7f9514

Base automatically changed from ravi/remove_atexit to dev December 1, 2021 03:02

ravi-mosaicml added 2 commits November 30, 2021 19:13

Merge branch 'dev' into ravi/libcloud

a4e3b24

Fixed tests

20dc896

ravi-mosaicml requested review from anisehsani, jbloxham and bandish-shah December 1, 2021 03:20

ajaysaini725 reviewed Dec 1, 2021

View reviewed changes

composer/callbacks/run_directory_uploader.py Outdated Show resolved Hide resolved

composer/callbacks/run_directory_uploader.py Show resolved Hide resolved

composer/callbacks/run_directory_uploader.py Show resolved Hide resolved

ajaysaini725 reviewed Dec 1, 2021

View reviewed changes

composer/callbacks/run_directory_uploader.py Show resolved Hide resolved

Addressed PR feedback

42f9ab3

hanlint added the release label Dec 2, 2021

Averylamp reviewed Dec 2, 2021

View reviewed changes

composer/callbacks/run_directory_uploader.py Show resolved Hide resolved

Averylamp reviewed Dec 2, 2021

View reviewed changes

composer/callbacks/run_directory_uploader.py Show resolved Hide resolved

Averylamp reviewed Dec 2, 2021

View reviewed changes

composer/callbacks/run_directory_uploader.py Outdated Show resolved Hide resolved

Averylamp reviewed Dec 2, 2021

View reviewed changes

setup.py Show resolved Hide resolved

Averylamp approved these changes Dec 2, 2021

View reviewed changes

ravi-mosaicml mentioned this pull request Dec 2, 2021

Split dev dependencies into tests and docs #129

Closed

ravi-mosaicml added 4 commits December 2, 2021 10:53

Fixed bug

481ab37

Fixed bugs

6fc5555

Fixed rank 0 only uploads

ec7011e

Using filesystem timestamps instead of python process timestamps to d…

2d0b058

…etermine changed files

ravi-mosaicml merged commit 71347a6 into dev Dec 3, 2021

ravi-mosaicml deleted the ravi/libcloud branch December 3, 2021 00:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run Directory Uploader #101

Run Directory Uploader #101

ravi-mosaicml commented Nov 22, 2021 •

edited

Loading

abhi-mosaic commented Nov 23, 2021

ravi-mosaicml commented Nov 23, 2021

ajaysaini725 left a comment

Averylamp left a comment

Run Directory Uploader #101

Run Directory Uploader #101

Conversation

ravi-mosaicml commented Nov 22, 2021 • edited Loading

abhi-mosaic commented Nov 23, 2021

ravi-mosaicml commented Nov 23, 2021

ajaysaini725 left a comment

Choose a reason for hiding this comment

Averylamp left a comment

Choose a reason for hiding this comment

ravi-mosaicml commented Nov 22, 2021 •

edited

Loading