Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple calls to Trainer.fit() #948

Merged
merged 123 commits into from
May 10, 2022

Conversation

ravi-mosaicml
Copy link
Contributor

@ravi-mosaicml ravi-mosaicml commented Apr 22, 2022

This PR allows for arguments to be specified when calling Trainer.fit() instead of when constructing the trainer.

  1. Refactored Trainer.init to move out shared helper code, so it can be re-used when passing in parameters on fit.
  2. Modified the signature of fit to take training parameters. If not specified on Trainer.__init__, parameters must be specified on fit.
  3. Rearranged the position of arguments in the Trainer and TrainerHparams classes to group by functionality. Re-ordered the docstrings to be in the same order as the arguments.

Closes #138.

…ps_per_epoch`.

1. Made the `state.dataloader` optional, since it will not be provided on `__init__` as part of mosaicml#40.
2. Binding the active dataloader to the state on `Event.FIT_START`, and switching the dataloader to each evaluation dataloader before `Event.EVAL_START`. Restoring the previous (training) dataloader after `Event.EVAL_END`.
3. Moved `Event.EVAL_START` and `Event.EVAL_END` to run for each evaluator, instead of once for all evaluators. With mosaicml#40, the eval() will take in a dataloader, which would then require `Event.EVAL_START` and `Event.EVAL_END`. This change also permits for algorithms that wish to modify (each) evalution dataloader.
4. Moved scaling of the LR schedulers to `Trainer.fit()` before `Event.FIT_START` fires. Schedulers will be passed in on `Trainer.fit()` as part of mosaicml#40.
5. Removed `steps_per_epoch` as part of the state. Instead, algorithms and callbacks can read len(state.dataloader) directly. While this change will make schedulers no longer accurate when using `train_subset_num_batches`, that flag should only be used for performance measurements. As such, it is not necessarry that SSR behaves correctly for performance runs. Added a warning for the `train_subset_num_batches` field.

Implements the first part of mosaicml#40.
Closes mosaicml#363.
It can be useful for algorithms and callbacks to know which dataloader is active, so added the `dataloader_label` to the state. Removed `evaluators` from state, as nothing is using that anymore.
…()`.

Preferable to keep variables on the state object rather than as trainer members, where appropriate. Before, the state.schedulers was empty after `__init__()` but before `fit()`. Now, state.schedulers contains the compiled composer schedulers or original pytorch schedulers.

Restored optimizers on Event.INIT; it will be a bigger issue to rewrite algs to not depend on optimizers on init.
* Removed `precision_context` from state
* Switched `train_subset_num_batches` and `eval_subset_num_batches` to use `-1` as the default value instead of `None`.
@ravi-mosaicml ravi-mosaicml requested a review from hanlint May 4, 2022 18:46
Copy link
Contributor

@hanlint hanlint left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! As a big PR, might be good to also have @eracah or @A-Jacobson take a quick pass on the significant portions (e.g. trainer.py).

@eracah
Copy link
Contributor

eracah commented May 6, 2022

LGTM! As a big PR, might be good to also have @eracah or @A-Jacobson take a quick pass on the significant portions (e.g. trainer.py).

Ok I'll take a look today.

@eracah
Copy link
Contributor

eracah commented May 6, 2022

Sorry, this PR is a bit more work than I realized to really look at thoroughly (especially for someone who is still learning the innards of Trainer). I probably won't finish giving it a look through until sometime Monday. Sorry about that. No worries if you want to merge it in before then.

@ravi-mosaicml
Copy link
Contributor Author

Sorry, this PR is a bit more work than I realized to really look at thoroughly (especially for someone who is still learning the innards of Trainer). I probably won't finish giving it a look through until sometime Monday. Sorry about that. No worries if you want to merge it in before then.

No rush on merging it in; can wait till Monday!

ravi-mosaicml added a commit to ravi-mosaicml/ravi-composer that referenced this pull request May 9, 2022
- Added `eval_timestamp` and `predict_timestamp` as attributes on the State for tracking evaluation and prediction progress. This is useful for callbacks and loggers to track where we are in the current evaluation or prediction dataloader (for example, logging metrics where the X axis is the evaluation batch number, and the Y axis is some metric, like accuracy, for that batch).
- Added new attributes to the state, instead of hot-swapping `state.timestamp`, since it is still useful to know the training batch number during evaluation (e.g. track how evaluation improves as training progresses)
- Added tests to ensure that the timestamp is properly set.

TODO:
- [ ] Merge mosaicml#948
Copy link
Contributor

@eracah eracah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good, but I wasn't really able to look at everything closely since this PR is so massive. Maybe would have been better to split it up into some smaller PR's, especially the refactoring into functions parts. I mostly looked at trainer.py. Might be worth getting one more set of eyes on this one because I was a bit overwhelmed with this one.

@ravi-mosaicml
Copy link
Contributor Author

Looks pretty good, but I wasn't really able to look at everything closely since this PR is so massive. Maybe would have been better to split it up into some smaller PR's, especially the refactoring into functions parts. I mostly looked at trainer.py. Might be worth getting one more set of eyes on this one because I was a bit overwhelmed with this one.

Thanks for taking a look -- I know it's quite large. Since @hanlint already looked and other PRs are starting to stack up on this one (#966, #1020, additional logging refactoring), going to merge this in. Will fix any issues as they come up during regression testing.

@ravi-mosaicml ravi-mosaicml enabled auto-merge (squash) May 9, 2022 23:52
@ravi-mosaicml ravi-mosaicml merged commit b1e89b4 into mosaicml:dev May 10, 2022
@ravi-mosaicml ravi-mosaicml deleted the trainer_fit_eval_signature branch May 10, 2022 00:07
ravi-mosaicml added a commit that referenced this pull request May 11, 2022
- Added in an `ExperimentHparams` class. This class describes how to run a training job that may have multiple calls to `Trainer.fit` and/or `Trainer.eval`. Specifically, `ExperimentHparams.initialize_object()` returns a `(Trainer, List[FitKwargs], List[EvalKwargs])` tuple, that then the user's entrypoint can consome.
  This class does not automatically train the model, nor does it include an entrypoint.
- Added typing definitions for `FitKwargs` and `EvalKwargs`, along with test cases to ensure they stay in sync with the Trainer signature.
- Fix an bug introduced in #948, which removed the setting of `State.train_dataloader`. Added back the lines to correctly set the train dataloader.
ravi-mosaicml added a commit that referenced this pull request Jun 2, 2022
…rs, and utils. (#1089)

This PR enables the `pydocstyle` pre-commit hook for most of the codebase. Docstrings were brought up to compliance so they would pass the pydocstyle check.

* The main changes involved ensuring summary docstring lines were indeed just one line, that they ended with periods, and that there were no blank lines between the end of the docstring and the start of the function.
* Added docstrings for missing arguments that were identified.
* (Coming along for the ride): Remove calls to `trainer.engine.close`, as now `close()` is invoked automatically in `__del__` as part of #948
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Multiple calls to .fit
4 participants