Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ckpt-rewr] Get Optim State Dict Util API #3299

Merged
merged 19 commits into from
May 31, 2024
Merged

Conversation

eracah
Copy link
Contributor

@eracah eracah commented May 16, 2024

What does this PR do?

Adds an API for extracting optimizer state dict from a model and optimizer object.

State dict generation is a necessary operation before the save AND load of a checkpoint.
Currently in composer it is coupled with the State, and not very readable, hard to extend, hard to test, and hard for users to harness to do custom things. As such, we present a function to generate state_dict for the optimizer decoupled from State as a standalone function. By making an explicit function for the optimizer, it’s easier to test because we have a standalone function (we don’t have to make a dummy State function). Moreover, it’s easier to save each state dict as a separate file Also, an advanced user can just call these functions themselves if they have a custom, advanced script or callback.

This state dict generation function enables:

  • generating sharded or full state dicts
  • generating state dicts of different precision
  • specify keys to include
  • specify keys to exclude

These are all options that will be useful for save and load. Because save and load require state dict generation, we need these options in state dict generation as well

GRT-2903

@eracah eracah marked this pull request as draft May 16, 2024 22:26
@eracah eracah marked this pull request as ready for review May 24, 2024 18:52
@eracah eracah requested review from bigning and mvpatel2000 May 24, 2024 18:53
Copy link
Contributor

@mvpatel2000 mvpatel2000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eracah is this a refactor or is it adding anything new anywhere? makes it a bit easier to review if I know what parts I need to carefully read through. It seems mostly copy paste but as helper fn?

@eracah
Copy link
Contributor Author

eracah commented May 29, 2024

@eracah is this a refactor or is it adding anything new anywhere? makes it a bit easier to review if I know what parts I need to carefully read through. It seems mostly copy paste but as helper fn?

a lot if it is a refactor, but it adds ignore, include, precision, and explicit cpu_offload control

Copy link
Contributor

@mvpatel2000 mvpatel2000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@eracah eracah enabled auto-merge (squash) May 31, 2024 00:31
@eracah eracah merged commit 8b4c684 into mosaicml:dev May 31, 2024
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants