-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Train] Disable gathering the full state dict in RayFSDPStrategy
for lightning>2.1
#44569
[Train] Disable gathering the full state dict in RayFSDPStrategy
for lightning>2.1
#44569
Conversation
Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>
# Lightning < 2.1 lacks FSDP state_dict_type support. | ||
# (PR: https://github.com/Lightning-AI/pytorch-lightning/pull/17623). | ||
# We need this patch logic to enable FSDP checkpointing between 2.0 and 2.1. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we just force users to upgrade to versions outside of this range? It's a bit confusing for the behavior to be hardcoded to full state dict ckpt based on the library version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without this fix, the state dict of checkpoint in lightning 2.0.x will be empty.
After offline discussion, we will not raise an error since the hardcoded gathering logic doesn't contradict with the lightning behavior. Instead, we add a notice in RayFSDPStrategy
docstring to recommend users upgrade to beyond 2.1 if they want to use FSDP.
Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! some small nits
Co-authored-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: Yunxuan Xiao <xiaoyunxuan1998@gmail.com>
Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>
…odify_ray_fsdp_strategy
Why are these changes needed?
lighting 2.0.x does not natively support FSDP
state_dict_type
. Therefore, we added default state dict gathering logic (#34967) to enable FSDP checkpointing. After 2.1, Lightning inherently supports FSDPstate_dict_type
, so we no longer need this patch logic.This PR restricts the patch's applicability to Lightning versions 2.0 through 2.1, enabling users to leverage Lightning's native FSDP integration in versions beyond 2.1.
Related issue number
Closes #44501
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.