Multiple calls to `Trainer.fit()` #948

ravi-mosaicml · 2022-04-22T17:59:11Z

This PR allows for arguments to be specified when calling Trainer.fit() instead of when constructing the trainer.

Refactored Trainer.init to move out shared helper code, so it can be re-used when passing in parameters on fit.
Modified the signature of fit to take training parameters. If not specified on Trainer.__init__, parameters must be specified on fit.
Rearranged the position of arguments in the Trainer and TrainerHparams classes to group by functionality. Re-ordered the docstrings to be in the same order as the arguments.

Closes #138.

…ps_per_epoch`. 1. Made the `state.dataloader` optional, since it will not be provided on `__init__` as part of mosaicml#40. 2. Binding the active dataloader to the state on `Event.FIT_START`, and switching the dataloader to each evaluation dataloader before `Event.EVAL_START`. Restoring the previous (training) dataloader after `Event.EVAL_END`. 3. Moved `Event.EVAL_START` and `Event.EVAL_END` to run for each evaluator, instead of once for all evaluators. With mosaicml#40, the eval() will take in a dataloader, which would then require `Event.EVAL_START` and `Event.EVAL_END`. This change also permits for algorithms that wish to modify (each) evalution dataloader. 4. Moved scaling of the LR schedulers to `Trainer.fit()` before `Event.FIT_START` fires. Schedulers will be passed in on `Trainer.fit()` as part of mosaicml#40. 5. Removed `steps_per_epoch` as part of the state. Instead, algorithms and callbacks can read len(state.dataloader) directly. While this change will make schedulers no longer accurate when using `train_subset_num_batches`, that flag should only be used for performance measurements. As such, it is not necessarry that SSR behaves correctly for performance runs. Added a warning for the `train_subset_num_batches` field. Implements the first part of mosaicml#40. Closes mosaicml#363.

It can be useful for algorithms and callbacks to know which dataloader is active, so added the `dataloader_label` to the state. Removed `evaluators` from state, as nothing is using that anymore.

…er into ravi/optional_dataloader

…()`. Preferable to keep variables on the state object rather than as trainer members, where appropriate. Before, the state.schedulers was empty after `__init__()` but before `fit()`. Now, state.schedulers contains the compiled composer schedulers or original pytorch schedulers. Restored optimizers on Event.INIT; it will be a bigger issue to rewrite algs to not depend on optimizers on init.

* Removed `precision_context` from state * Switched `train_subset_num_batches` and `eval_subset_num_batches` to use `-1` as the default value instead of `None`.

…er into ravi/optional_dataloader

hanlint

LGTM! As a big PR, might be good to also have @eracah or @A-Jacobson take a quick pass on the significant portions (e.g. trainer.py).

eracah · 2022-05-06T19:33:49Z

LGTM! As a big PR, might be good to also have @eracah or @A-Jacobson take a quick pass on the significant portions (e.g. trainer.py).

Ok I'll take a look today.

eracah · 2022-05-06T23:55:12Z

Sorry, this PR is a bit more work than I realized to really look at thoroughly (especially for someone who is still learning the innards of Trainer). I probably won't finish giving it a look through until sometime Monday. Sorry about that. No worries if you want to merge it in before then.

ravi-mosaicml · 2022-05-07T00:37:00Z

Sorry, this PR is a bit more work than I realized to really look at thoroughly (especially for someone who is still learning the innards of Trainer). I probably won't finish giving it a look through until sometime Monday. Sorry about that. No worries if you want to merge it in before then.

No rush on merging it in; can wait till Monday!

- Added `eval_timestamp` and `predict_timestamp` as attributes on the State for tracking evaluation and prediction progress. This is useful for callbacks and loggers to track where we are in the current evaluation or prediction dataloader (for example, logging metrics where the X axis is the evaluation batch number, and the Y axis is some metric, like accuracy, for that batch). - Added new attributes to the state, instead of hot-swapping `state.timestamp`, since it is still useful to know the training batch number during evaluation (e.g. track how evaluation improves as training progresses) - Added tests to ensure that the timestamp is properly set. TODO: - [ ] Merge mosaicml#948

composer/trainer/trainer.py

eracah

Looks pretty good, but I wasn't really able to look at everything closely since this PR is so massive. Maybe would have been better to split it up into some smaller PR's, especially the refactoring into functions parts. I mostly looked at trainer.py. Might be worth getting one more set of eyes on this one because I was a bit overwhelmed with this one.

composer/trainer/trainer.py

ravi-mosaicml · 2022-05-09T23:51:56Z

Looks pretty good, but I wasn't really able to look at everything closely since this PR is so massive. Maybe would have been better to split it up into some smaller PR's, especially the refactoring into functions parts. I mostly looked at trainer.py. Might be worth getting one more set of eyes on this one because I was a bit overwhelmed with this one.

Thanks for taking a look -- I know it's quite large. Since @hanlint already looked and other PRs are starting to stack up on this one (#966, #1020, additional logging refactoring), going to merge this in. Will fix any issues as they come up during regression testing.

- Added in an `ExperimentHparams` class. This class describes how to run a training job that may have multiple calls to `Trainer.fit` and/or `Trainer.eval`. Specifically, `ExperimentHparams.initialize_object()` returns a `(Trainer, List[FitKwargs], List[EvalKwargs])` tuple, that then the user's entrypoint can consome. This class does not automatically train the model, nor does it include an entrypoint. - Added typing definitions for `FitKwargs` and `EvalKwargs`, along with test cases to ensure they stay in sync with the Trainer signature. - Fix an bug introduced in #948, which removed the setting of `State.train_dataloader`. Added back the lines to correctly set the train dataloader.

…rs, and utils. (#1089) This PR enables the `pydocstyle` pre-commit hook for most of the codebase. Docstrings were brought up to compliance so they would pass the pydocstyle check. * The main changes involved ensuring summary docstring lines were indeed just one line, that they ended with periods, and that there were no blank lines between the end of the docstring and the start of the function. * Added docstrings for missing arguments that were identified. * (Coming along for the ride): Remove calls to `trainer.engine.close`, as now `close()` is invoked automatically in `__del__` as part of #948

ravi-mosaicml added 30 commits March 25, 2022 16:51

Restored dataloader_len on state

558f279

Fixed tests

f5d0a1b

Merge branch 'dev' into i40_1

13c6e53

Added dataloader_label; removed evaluators from State

e4facaa

It can be useful for algorithms and callbacks to know which dataloader is active, so added the `dataloader_label` to the state. Removed `evaluators` from state, as nothing is using that anymore.

Merge branch 'dev' into i40_1

f3f47ea

Fixed pyright

56fd87d

Fixed pyright

b89c3bc

Merge branch 'dev' into ravi/optional_dataloader

79a3094

Made max_duration optional

096b44c

Merge branch 'ravi/optional_dataloader' of github.com:mosaicml/compos…

d102506

…er into ravi/optional_dataloader

Addressed PR feedback; fixed Time type annotations

fe70133

Merge branch 'dev' into ravi/optional_dataloader

7e6e7cf

Fixed doctests

11544bc

Fixed selective backprop

1313a49

Inceased timeout

b5b6192

Merge branch 'dev' into ravi/optional_dataloader

ecc4a59

Remove optimizers from state on init; clean up PR

d3408ba

Merge branch 'dev' into ravi/optional_dataloader

7f00806

Merge branch 'dev' into ravi/optional_dataloader

a4b2697

Fixed the deepspeed schedulers

54d6b88

Merge branch 'dev' into ravi/optional_dataloader

f067ed7

Multiple calls to fit/eval WIP

ebb9fe5

Merge branch 'dev' into trainer_fit_eval_signature

3237c9d

WIP

af7bc21

Merge branch 'dev' into ravi/optional_dataloader

59af154

* Addressed PR Feedback

49e68e8

* Removed `precision_context` from state * Switched `train_subset_num_batches` and `eval_subset_num_batches` to use `-1` as the default value instead of `None`.

Merge branch 'ravi/optional_dataloader' of github.com:mosaicml/compos…

93dd36f

…er into ravi/optional_dataloader

Merge branch 'ravi/optional_dataloader' into trainer_fit_eval_signature

7d3a073

ravi-mosaicml added 5 commits May 4, 2022 09:57

Fixed tests; addressed some PR feedback

c56179d

Addressed more PR feedback

db3a7e3

Added docs

8d1e908

Fixed docs

87152b7

Fixed a bug with timing. Added tests

a578676

ravi-mosaicml requested a review from hanlint May 4, 2022 18:46

ravi-mosaicml added 3 commits May 5, 2022 07:35

Merge branch 'dev' into trainer_fit_eval_signature

e17bdca

Merge branch 'dev' into trainer_fit_eval_signature

8a94b44

Merge branch 'dev' into trainer_fit_eval_signature

a8dc9be

hanlint approved these changes May 6, 2022

View reviewed changes

Merge branch 'dev' into trainer_fit_eval_signature

64b8a43

Merge branch 'dev' into trainer_fit_eval_signature

96fdb93

ravi-mosaicml mentioned this pull request May 9, 2022

Add State.eval_timestamp and State.predict_timestamp #1020

Merged

eracah reviewed May 9, 2022

View reviewed changes

composer/trainer/trainer.py Show resolved Hide resolved

eracah reviewed May 9, 2022

View reviewed changes

composer/trainer/trainer.py Show resolved Hide resolved

eracah approved these changes May 9, 2022

View reviewed changes

composer/trainer/trainer.py Show resolved Hide resolved

ravi-mosaicml added 2 commits May 9, 2022 16:39

Merge branch 'dev' into trainer_fit_eval_signature

94e0476

Addressed PR feedback

f5b518f

ravi-mosaicml enabled auto-merge (squash) May 9, 2022 23:52

ravi-mosaicml merged commit b1e89b4 into mosaicml:dev May 10, 2022

ravi-mosaicml deleted the trainer_fit_eval_signature branch May 10, 2022 00:07

This was referenced May 25, 2022

AutoYAHP Part 3: Refactor the models, datasets, and trainer_hparams #1072

Merged

Configured pydocstyle for the trainer, callbacks, loggers, optimizers, and utils. #1089

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiple calls to `Trainer.fit()` #948

Multiple calls to `Trainer.fit()` #948

ravi-mosaicml commented Apr 22, 2022 •

edited

Loading

hanlint left a comment

eracah commented May 6, 2022

eracah commented May 6, 2022

ravi-mosaicml commented May 7, 2022

eracah left a comment

ravi-mosaicml commented May 9, 2022

Multiple calls to Trainer.fit() #948

Multiple calls to Trainer.fit() #948

Conversation

ravi-mosaicml commented Apr 22, 2022 • edited Loading

hanlint left a comment

Choose a reason for hiding this comment

eracah commented May 6, 2022

eracah commented May 6, 2022

ravi-mosaicml commented May 7, 2022

eracah left a comment

Choose a reason for hiding this comment

ravi-mosaicml commented May 9, 2022

Multiple calls to `Trainer.fit()` #948

Multiple calls to `Trainer.fit()` #948

ravi-mosaicml commented Apr 22, 2022 •

edited

Loading