AutoYAHP Part 3: Refactor the models, datasets, and trainer_hparams #1072

ravi-mosaicml · 2022-05-20T06:55:54Z

This PR cleans up the model hparams, dataset hparams, and trainer hparams for AutoYAHP. It does not depend on AutoYAHP itself (a future PR will remove the underlying hparam classes).

Fix invalid type annotations in hparam classes (Yahp 1.1 has stricter type settings)
Refactor the trainer hparams and trainer to have the same argument names. Specifically, this meant renaming "optimizer" to "optimizers" and "deepspeed" to "deepspeed_config" in the trainer hparams, and "load_strict" to "load_strict_model_weights" in the Trainer. Updated the YAMLs as appropriate.
Removed the device and precision settings from the YAMLs, as the trainer now uses smart defaults (AMP + GPU if cuda is available; otherwise FP32 + CPU). This is necessarry as AutoYAHP will parse the yamls into the DeviceGPU() / DeviceCPU() instances directly, and DeviceGPU() throws an error if cuda is not available (before, yahp would parse into DeviceGPUHparams(), which could then be replaced with DeviceCPUHparams()). (While it would have also worked to patch the yaml directly, I thought it'd be generally better to make the yamls more portable).

- Treat all python warnings as errors in tests. Existing tests that were throwing warnings were either fixed (yay!) or modified with `@pytest.mark.filterwarnings` - In BlurPool, throwing an error if both `replace_maxpools` and `replace_convs` are False, as that results in the algorithm being a no-op. - Made the default optimizer warning in the trainer less verbose. - Converted the bert yaml to have duration specified in terms of samples, to fix an issue where the warmup period, combine with max duration, was a noop. - Moved `TestMetricSetter` to `tests/common` and renamed as `MetricSetterCallback`. Tests should not be importing from other test files (tests/common is OK) - Removed TestWandBLogger, since the trainer tests do the same thing

Fix progressive resizing

This PR refactors the algorithms and tests as will be required by AutoYAHP. It does not depend on AutoYAHP itself (a future PR will remove the underlying hparam classes). - Refactored algorithm tests to not depend on hparams - Reformatted the factorize and selective backprop docstrings so they would be correctly parsed by auto-yahp - Refactor `algorithm_settings.py` to not depend on hparams and to return a list of `pytest.param` objects for a `pytest.mark.parametrize`. This change makes it more re-usable since it now includes information about markers required for each algorithm. - Moved the `TestTrainerAlgorithms` into `tests/algorithms/test_algorithms_train.py`, since it tests the individual algorithms, not the trainer, and thus should live in `tests/algorithms`. - Add helper methods for scanning a module to discover subclass implementations, check that the registry contains an entry, and test that a class is constructable from yaml

Silencing deepspeed deprecation warnings

Bump pyright

…taset_trainer_for_autoyahp

composer/datasets/cifar.py

composer/datasets/glue.py

A-Jacobson · 2022-05-25T18:34:33Z

Thoughts on how to go forward?

Undo the change in Multiple calls to Trainer.fit() #948 and always use CPU + FP32, in both the functional api and hparams path. Require our yamls to make these explicit. (Not a fan of this option, as it would require an extra parameter for the 99% case where if one has a GPU, then it should be used).

Keep the devices and precisions specified in the yamls, but leave the smart defaults in the trainer. (Basically, revert these changes in this PR). This would make the YAMLs less portable, but can see that making sense if we want to view them more as "benchmarks" rather than "examples".

Auto-default the device, but always use FP32 precision unless otherwise specified (Match what PTL does)

Leave this PR as-is

Definitely not 2, a few of us have already run afoul on this while trying to debug runs on cpu. I don't think we can go wrong with 1, 3 or 4 as long as it's clear what is happening. My personal order or preference is 3. -> 4. -> 1.

Yes, amp is the best performance but it could cause some training runs to overflow. If I'm training a model and getting nans it would be important to me to at least be aware that amp is being used to I could try toggling it off.

A-Jacobson

Left a few comments asking about why the hparams classes are organized the way the are + my opinion on the device defaults.

Overall LGTM, especially since many of these hparams classes will be leaving is soon (rest in peace you will not be missed).

composer/models/resnet/resnet_hparams.py

composer/yamls/models/bert-base.yaml

ravi-mosaicml · 2022-05-25T22:38:42Z

Definitely not 2, a few of us have already run afoul on this while trying to debug runs on cpu. I don't think we can go wrong with 1, 3 or 4 as long as it's clear what is happening. My personal order or preference is 3. -> 4. -> 1.

Discussed offline; decided to go with option 4 (leave as is), as option 3 would require the yamls to specify precision: amp, which would defeat the portability.

…taset_trainer_for_autoyahp

…aicml/ravi-composer into refactor_callbacks_for_autoyahp

…taset_trainer_for_autoyahp

ravi-mosaicml added 30 commits May 16, 2022 12:48

Remove webdatasets from setup.py

42556ad

Ignore malformed yaml warnings coming from yahp

cf7854d

Merge branch 'dev' into test_warning_cleanup

8888a35

Fixed filter

a71078b

Disabiling tqdm by default in tests

b50912f

Fix progressive resizing

Fix pyright errors with uninitailized instance variables

0d91ba0

Ignoring warnings for torch jit export

d69f473

Fix ffcv tests?

be052ce

Silence more warnings

58f1383

Merge branch 'test_warning_cleanup' into remove_hparams_from_tests

bbf1a70

Merge branch 'dev' into test_warning_cleanup

ad33e8d

Merge branch 'test_warning_cleanup' into remove_hparams_from_tests

e824353

Fix algorithm resumption

fb7ee90

Fix tests

f88ffec

Merge branch 'test_warning_cleanup' into remove_hparams_from_tests

6c7b6f7

Addressed some PR feedback

6d2496e

Silencing deepspeed deprecation warnings

Fix more deprecation warnings

19bf68e

Bump pyright

reportMissingImports=None; removed type ignores

9f950f2

Fix warnings

d5371a1

Fix WandB Logger

2de8ab6

Merge branch 'dev' into test_warning_cleanup

4637223

Adjusted warning message

92e3823

Merge branch 'test_warning_cleanup' into remove_hparams_from_tests

52c7baa

Fix more warnings

6b933c3

Merge branch 'test_warning_cleanup' into remove_hparams_from_tests

6657bdc

Increase timeout

fb9b92f

Merge branch 'test_warning_cleanup' into remove_hparams_from_tests

32340a4

Force cpuonly in conda

cd25dc8

ravi-mosaicml added 4 commits May 25, 2022 07:38

Fix the memory monitor error message

337f283

Fix more tests

0253a1b

Fix tests

d4bd7ac

Merge branch 'refactor_callbacks_for_autoyahp' into refactor_model_da…

eaad45b

…taset_trainer_for_autoyahp

A-Jacobson reviewed May 25, 2022

View reviewed changes

composer/datasets/cifar.py Outdated Show resolved Hide resolved

Fix tests again

12edac2

A-Jacobson reviewed May 25, 2022

View reviewed changes

composer/datasets/glue.py Show resolved Hide resolved

A-Jacobson approved these changes May 25, 2022

View reviewed changes

eracah approved these changes May 25, 2022

View reviewed changes

composer/models/resnet/resnet_hparams.py Show resolved Hide resolved

composer/models/resnet/resnet_hparams.py Show resolved Hide resolved

composer/yamls/models/bert-base.yaml Show resolved Hide resolved

ravi-mosaicml added 3 commits May 25, 2022 15:04

Merge branch 'dev' into refactor_callbacks_for_autoyahp

45ad601

Address PR Feedback

dfb346a

Run the memory monitor on cpu; use a filterwarnings

383fb46

ravi-mosaicml added 13 commits May 25, 2022 15:57

Fix memory monitor test

e08e9fa

Merge branch 'refactor_callbacks_for_autoyahp' into refactor_model_da…

3d59e22

…taset_trainer_for_autoyahp

Do the test_hf_model on gpu

92564ba

Fix tests hopefully for good

3745458

Merge branch 'dev' into refactor_callbacks_for_autoyahp

7c62f94

Merge branch 'refactor_callbacks_for_autoyahp' into refactor_model_da…

a6741c7

…taset_trainer_for_autoyahp

Increase timeout

19ea309

Merge branch 'refactor_callbacks_for_autoyahp' of github.com:ravi-mos…

443cdbb

…aicml/ravi-composer into refactor_callbacks_for_autoyahp

Merge branch 'refactor_callbacks_for_autoyahp' into refactor_model_da…

0ef0b6e

…taset_trainer_for_autoyahp

Fix test precision

57c7cac

Increase timeout

a6c0509

Merge branch 'dev' into refactor_model_dataset_trainer_for_autoyahp

52c2a7f

Fix doctests

66a8b60

ravi-mosaicml changed the title ~~AutoYAHP Part 3: Refactor the models, datasets, and trainer_hparams for AutoYAHP~~ AutoYAHP Part 3: Refactor the models, datasets, and trainer_hparams May 26, 2022

ravi-mosaicml merged commit f1b1f22 into mosaicml:dev May 26, 2022

ravi-mosaicml deleted the refactor_model_dataset_trainer_for_autoyahp branch May 31, 2022 21:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AutoYAHP Part 3: Refactor the models, datasets, and trainer_hparams #1072

AutoYAHP Part 3: Refactor the models, datasets, and trainer_hparams #1072

ravi-mosaicml commented May 20, 2022 •

edited

Loading

A-Jacobson commented May 25, 2022

A-Jacobson left a comment

ravi-mosaicml commented May 25, 2022

AutoYAHP Part 3: Refactor the models, datasets, and trainer_hparams #1072

AutoYAHP Part 3: Refactor the models, datasets, and trainer_hparams #1072

Conversation

ravi-mosaicml commented May 20, 2022 • edited Loading

A-Jacobson commented May 25, 2022

A-Jacobson left a comment

Choose a reason for hiding this comment

ravi-mosaicml commented May 25, 2022

ravi-mosaicml commented May 20, 2022 •

edited

Loading