-
Notifications
You must be signed in to change notification settings - Fork 433
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multiple calls to Trainer.fit()
#948
Multiple calls to Trainer.fit()
#948
Conversation
…ps_per_epoch`. 1. Made the `state.dataloader` optional, since it will not be provided on `__init__` as part of mosaicml#40. 2. Binding the active dataloader to the state on `Event.FIT_START`, and switching the dataloader to each evaluation dataloader before `Event.EVAL_START`. Restoring the previous (training) dataloader after `Event.EVAL_END`. 3. Moved `Event.EVAL_START` and `Event.EVAL_END` to run for each evaluator, instead of once for all evaluators. With mosaicml#40, the eval() will take in a dataloader, which would then require `Event.EVAL_START` and `Event.EVAL_END`. This change also permits for algorithms that wish to modify (each) evalution dataloader. 4. Moved scaling of the LR schedulers to `Trainer.fit()` before `Event.FIT_START` fires. Schedulers will be passed in on `Trainer.fit()` as part of mosaicml#40. 5. Removed `steps_per_epoch` as part of the state. Instead, algorithms and callbacks can read len(state.dataloader) directly. While this change will make schedulers no longer accurate when using `train_subset_num_batches`, that flag should only be used for performance measurements. As such, it is not necessarry that SSR behaves correctly for performance runs. Added a warning for the `train_subset_num_batches` field. Implements the first part of mosaicml#40. Closes mosaicml#363.
It can be useful for algorithms and callbacks to know which dataloader is active, so added the `dataloader_label` to the state. Removed `evaluators` from state, as nothing is using that anymore.
…er into ravi/optional_dataloader
…()`. Preferable to keep variables on the state object rather than as trainer members, where appropriate. Before, the state.schedulers was empty after `__init__()` but before `fit()`. Now, state.schedulers contains the compiled composer schedulers or original pytorch schedulers. Restored optimizers on Event.INIT; it will be a bigger issue to rewrite algs to not depend on optimizers on init.
* Removed `precision_context` from state * Switched `train_subset_num_batches` and `eval_subset_num_batches` to use `-1` as the default value instead of `None`.
…er into ravi/optional_dataloader
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! As a big PR, might be good to also have @eracah or @A-Jacobson take a quick pass on the significant portions (e.g. trainer.py
).
Ok I'll take a look today. |
Sorry, this PR is a bit more work than I realized to really look at thoroughly (especially for someone who is still learning the innards of Trainer). I probably won't finish giving it a look through until sometime Monday. Sorry about that. No worries if you want to merge it in before then. |
No rush on merging it in; can wait till Monday! |
- Added `eval_timestamp` and `predict_timestamp` as attributes on the State for tracking evaluation and prediction progress. This is useful for callbacks and loggers to track where we are in the current evaluation or prediction dataloader (for example, logging metrics where the X axis is the evaluation batch number, and the Y axis is some metric, like accuracy, for that batch). - Added new attributes to the state, instead of hot-swapping `state.timestamp`, since it is still useful to know the training batch number during evaluation (e.g. track how evaluation improves as training progresses) - Added tests to ensure that the timestamp is properly set. TODO: - [ ] Merge mosaicml#948
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty good, but I wasn't really able to look at everything closely since this PR is so massive. Maybe would have been better to split it up into some smaller PR's, especially the refactoring into functions parts. I mostly looked at trainer.py
. Might be worth getting one more set of eyes on this one because I was a bit overwhelmed with this one.
Thanks for taking a look -- I know it's quite large. Since @hanlint already looked and other PRs are starting to stack up on this one (#966, #1020, additional logging refactoring), going to merge this in. Will fix any issues as they come up during regression testing. |
- Added in an `ExperimentHparams` class. This class describes how to run a training job that may have multiple calls to `Trainer.fit` and/or `Trainer.eval`. Specifically, `ExperimentHparams.initialize_object()` returns a `(Trainer, List[FitKwargs], List[EvalKwargs])` tuple, that then the user's entrypoint can consome. This class does not automatically train the model, nor does it include an entrypoint. - Added typing definitions for `FitKwargs` and `EvalKwargs`, along with test cases to ensure they stay in sync with the Trainer signature. - Fix an bug introduced in #948, which removed the setting of `State.train_dataloader`. Added back the lines to correctly set the train dataloader.
…rs, and utils. (#1089) This PR enables the `pydocstyle` pre-commit hook for most of the codebase. Docstrings were brought up to compliance so they would pass the pydocstyle check. * The main changes involved ensuring summary docstring lines were indeed just one line, that they ended with periods, and that there were no blank lines between the end of the docstring and the start of the function. * Added docstrings for missing arguments that were identified. * (Coming along for the ride): Remove calls to `trainer.engine.close`, as now `close()` is invoked automatically in `__del__` as part of #948
This PR allows for arguments to be specified when calling
Trainer.fit()
instead of when constructing the trainer.Trainer.init
to move out shared helper code, so it can be re-used when passing in parameters onfit
.fit
to take training parameters. If not specified onTrainer.__init__
, parameters must be specified onfit
.Trainer
andTrainerHparams
classes to group by functionality. Re-ordered the docstrings to be in the same order as the arguments.Closes #138.