Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Eval-Only]: Optional timing and dataloader attributes on state; removed evaluators from the state. #832

Merged
merged 29 commits into from
Apr 22, 2022

Conversation

ravi-mosaicml
Copy link
Contributor

@ravi-mosaicml ravi-mosaicml commented Mar 25, 2022

  1. Made the state.dataloader and state.max_duration optional, since they may not be provided on __init__ as part of eval_only flag #40.
  2. Bind dataloader_len to the state, so algorithms can know how many batches to expect per epoch
  3. Added the dataloader_label to the state, so algorithms know which dataloader is currently running.
  4. Binding the active dataloader to the state on Event.FIT_START, and switching the dataloader to each evaluation dataloader before Event.EVAL_START. Restoring the previous (training) dataloader after Event.EVAL_END.
  5. Moved Event.EVAL_START and Event.EVAL_END to run for each evaluator, instead of once for all evaluators. With eval_only flag #40, the eval() will take in a dataloader, which would then call Event.EVAL_START and Event.EVAL_END for each dataloader. This change also permits for algorithms that wish to modify (each) evaluation dataloader.
  6. Moved scaling of the LR schedulers to Trainer.fit() before Event.FIT_START fires. Schedulers will be passed in on Trainer.fit() as part of eval_only flag #40.
  7. Remove the evaluators from the state. Addresses part of Multiple Evaluator Improvements #329. The user need not set state.evaluators; instead, the trainer will keep track of the evaluators.
  8. Removed the precision_context from state. Instead, added a helper static function to precision.py.

Implements the first part of #40.
Closes #363.

…ps_per_epoch`.

1. Made the `state.dataloader` optional, since it will not be provided on `__init__` as part of #40.
2. Binding the active dataloader to the state on `Event.FIT_START`, and switching the dataloader to each evaluation dataloader before `Event.EVAL_START`. Restoring the previous (training) dataloader after `Event.EVAL_END`.
3. Moved `Event.EVAL_START` and `Event.EVAL_END` to run for each evaluator, instead of once for all evaluators. With #40, the eval() will take in a dataloader, which would then require `Event.EVAL_START` and `Event.EVAL_END`. This change also permits for algorithms that wish to modify (each) evalution dataloader.
4. Moved scaling of the LR schedulers to `Trainer.fit()` before `Event.FIT_START` fires. Schedulers will be passed in on `Trainer.fit()` as part of #40.
5. Removed `steps_per_epoch` as part of the state. Instead, algorithms and callbacks can read len(state.dataloader) directly. While this change will make schedulers no longer accurate when using `train_subset_num_batches`, that flag should only be used for performance measurements. As such, it is not necessarry that SSR behaves correctly for performance runs. Added a warning for the `train_subset_num_batches` field.

Implements the first part of #40.
Closes #363.
@ravi-mosaicml ravi-mosaicml requested a review from jbloxham March 25, 2022 23:52
@ravi-mosaicml ravi-mosaicml changed the title [Eval-Only]: Made the state.dataloader optional; removed `state.ste… [Eval-Only]: Optional state.dataloader; removed state.steps_per_epoch. Mar 25, 2022
@hanlint
Copy link
Contributor

hanlint commented Mar 26, 2022

I'm slightly concerned about 5 here -- subset_num_batches is also sometimes used to ensure a model overfits on a small amount of data.

@ravi-mosaicml
Copy link
Contributor Author

I'm slightly concerned about 5 here -- subset_num_batches is also sometimes used to ensure a model overfits on a small amount of data.

Ah ok, makes sense. I will refactor the schedulers to use train_subset_num_batches if not None else len(state.dataloader)

@ravi-mosaicml
Copy link
Contributor Author

Upon further thought, I think it would make sense to leave subset_num_batches as part of state, so an algorithm (e.g. the future benchmarker) can end an epoch early to do dynamic profiling.

@ravi-mosaicml ravi-mosaicml changed the title [Eval-Only]: Optional state.dataloader; removed state.steps_per_epoch. [Eval-Only]: Made state.dataloader optional Apr 12, 2022
It can be useful for algorithms and callbacks to know which dataloader is active, so added the `dataloader_label` to the state. Removed `evaluators` from state, as nothing is using that anymore.
@ravi-mosaicml ravi-mosaicml changed the title [Eval-Only]: Made state.dataloader optional [Eval-Only]: Set state.dataloader to the active dataloader; remove evaluators from State; run EVAL_START and EVAL_END for each evaluator Apr 12, 2022
@ravi-mosaicml ravi-mosaicml changed the title [Eval-Only]: Set state.dataloader to the active dataloader; remove evaluators from State; run EVAL_START and EVAL_END for each evaluator [Eval-Only]: Optional timing and dataloader attributes on state; removed evaluators from the state. Apr 13, 2022
Copy link
Contributor

@ajaysaini725 ajaysaini725 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few comments overall looks good!

@ravi-mosaicml ravi-mosaicml force-pushed the ravi/optional_dataloader branch from 5b23840 to b5b6192 Compare April 14, 2022 19:55
…()`.

Preferable to keep variables on the state object rather than as trainer members, where appropriate. Before, the state.schedulers was empty after `__init__()` but before `fit()`. Now, state.schedulers contains the compiled composer schedulers or original pytorch schedulers.

Restored optimizers on Event.INIT; it will be a bigger issue to rewrite algs to not depend on optimizers on init.
Copy link
Contributor

@ajaysaini725 ajaysaini725 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a small note otherwise LGTM 👍

* Removed `precision_context` from state
* Switched `train_subset_num_batches` and `eval_subset_num_batches` to use `-1` as the default value instead of `None`.
@ravi-mosaicml ravi-mosaicml merged commit 0e4955e into dev Apr 22, 2022
@ravi-mosaicml ravi-mosaicml deleted the ravi/optional_dataloader branch April 22, 2022 16:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Replace state.train_dataloader and state.eval_dataloader with just state.dataloader
3 participants