[Checkpointing - PR4] Refactored the `CheckpointLoader` into a `load_checkpoint` function #693

ravi-mosaicml · 2022-03-08T20:56:41Z

Since the checkpoint loading happens in Trainer.__init__ (except for the restoration of the rng state), there is no need for a checkpoint loader class.

The CheckpointLoader class is replaced with a function load_checkpoint, and all private members are converted into private, module-level helper functions.
The file downloading portions of the checkpoint loader were refactored into their own, standalone utility at composer.utils.file_retriever, with their own test cases.

This PR is the first in a series for cleaning up the checkpoint API. One of the prerequesites is storing the seed on the state. Here, only the rank zero seed is stored on state, since only the rank zero state is persisted in a checkpoint. The trainer uses a distributed reduction to share the seed across states, so the same seed will be restored when resuming from checkpointing, even if a seed was not originally specified. This PR ignores the `seed` parameter passed into the trainer when resuming from a checkpoint. For the time being, if a new seed is desired, the `seed` attribute must be removed from the checkpoint state dict. #497 will introduce a cleaner API for this (edge) use case.

See https://www.notion.so/Checkpoint-API-Redesign-f0ee88d4bd5149b6a091589ed3f465f5 for the design doc.

1. RNG serialization / deserialization is moved from `composer.trainer._checkpoint` to `composer.utils.reproducibility`. This change is needed to refactor the checkpoint saver into a public module. 2. Moved helper methods from `composer.trainer._deepspeed` to `composer.core.state` to determine whether the model is deepspeed 3. Added a similar helper for `is_model_ddp`. 3. Refactored how the state_dict was serialized and deserialized to support serialization of `@property`s. Stopped storing leading underscores in the checkpoint, as that is a state implementation detail and not something that should be persisted through the checkpoint.

…oint` function Since the checkpoint loading happens in `Trainer.__init__` (except for the restoration of the rng state), there is no need for a checkpoint loader class. This class is replaced with a function `load_checkpoint`, and all private members are converted into private, module-level helper functions.

ajaysaini725

LGTM

…n utility

ravi-mosaicml added 14 commits March 7, 2022 11:18

Merge branch 'dev' into ravi/i414_p1

4c3050f

Merge branch 'dev' into ravi/i414_p1

8f1e517

Fix bug in dist.broadcast

4f31ba0

Addressed PR comments

bc0cabd

Fixed doctest fixtures

f91b22b

Removed broken seealso reference

98fc37d

Merge branch 'dev' into ravi/i414_p1

79d5d1c

Added in new Checkpointing Events

2dcdc2f

See https://www.notion.so/Checkpoint-API-Redesign-f0ee88d4bd5149b6a091589ed3f465f5 for the design doc.

Update composer/trainer/trainer_hparams.py

18b9568

Update state.py

bbd963b

Docstring cleanup

95374d1

ravi-mosaicml requested review from jbloxham and ajaysaini725 March 8, 2022 20:56

ravi-mosaicml added 14 commits March 8, 2022 12:58

Update trainer.py

63a65d6

Fixed bug

1633207

Merge branch 'dev' into ravi/i414_p1

0332f0d

Fix docs

88ba77a

Merge branch 'ravi/i414_p1' into ravi/i414_p1.1

632a441

Merge branch 'ravi/i414_p1' into ravi/i414_p1.1

09ca742

Merge branch 'ravi/i414_p1.1' into ravi/i414_p1.2

ee59da3

Fix typo

a401338

Merge branch 'ravi/i414_p1.2' into ravi/i414_p1.3

b6abd7a

Fixed formatting

cf661c6

Fix flaky test

ebd2094

Merge branch 'ravi/i414_p1.2' into ravi/i414_p1.3

b1ba07a

Fixed state_dict_serialized_attributes

5dd61c2

Merge branch 'ravi/i414_p1.2' into ravi/i414_p1.3

05ac265

Fixed doctest

18007b3

ravi-mosaicml removed the request for review from jbloxham March 9, 2022 22:55

ajaysaini725 approved these changes Mar 10, 2022

View reviewed changes

ravi-mosaicml added 7 commits March 10, 2022 11:18

Refactored the file downloading out of the checkpoint and into its ow…

b4b87f4

…n utility

Fixed docstring.

021261e

Fixed docs

a9a323a

Fixed docs

09d7ac8

Fixed docstrings; renamed load_path to load_path_format

b797822

Formatting

a2c2110

Fixed tests

92bfb44

Base automatically changed from ravi/i414_p1.2 to dev March 11, 2022 02:49

ravi-mosaicml added 4 commits March 10, 2022 19:24

Merge branch 'dev' into ravi/i414_p1.3

e749711

Formatting

554e8ea

Fixed formatting

fdcabbd

Fixed docs

3accc78

ravi-mosaicml merged commit 2a0b253 into dev Mar 11, 2022

ravi-mosaicml deleted the ravi/i414_p1.3 branch March 11, 2022 04:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Checkpointing - PR4] Refactored the `CheckpointLoader` into a `load_checkpoint` function #693

[Checkpointing - PR4] Refactored the `CheckpointLoader` into a `load_checkpoint` function #693

ravi-mosaicml commented Mar 8, 2022 •

edited

Loading

ajaysaini725 left a comment

[Checkpointing - PR4] Refactored the CheckpointLoader into a load_checkpoint function #693

[Checkpointing - PR4] Refactored the CheckpointLoader into a load_checkpoint function #693

Conversation

ravi-mosaicml commented Mar 8, 2022 • edited Loading

ajaysaini725 left a comment

Choose a reason for hiding this comment

[Checkpointing - PR4] Refactored the `CheckpointLoader` into a `load_checkpoint` function #693

[Checkpointing - PR4] Refactored the `CheckpointLoader` into a `load_checkpoint` function #693

ravi-mosaicml commented Mar 8, 2022 •

edited

Loading