Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FSDP] issues around early init_weight #490

Closed
min-xu-ai opened this issue Mar 8, 2021 · 4 comments
Closed

[FSDP] issues around early init_weight #490

min-xu-ai opened this issue Mar 8, 2021 · 4 comments
Assignees

Comments

@min-xu-ai
Copy link
Contributor

cc @prigoyal @myleott

Documenting several issues around early init_weight.

By early, I mean in vissl's case, we do the following:

# building trunk
trunk = FSDP(...)
set_seed(distributed_rank)
trunk.init_weight()

# build model
model = FSDP(sequential(trunk, head))

two issues observed so far:

  1. when init_weight is called, I used summon_full_params(), which called lazy_init. However, when root FSDP is later called during forward, it calls lazy_init again and try to set the is_root flag for children FSDP. There, we assert since children's FSDP is_root flags are already set.
  2. vissl seems to init_weight at each rank differently. This means each worker starts with different weights? This is not standard DDP. Priya can you confirm? If so, FSDP might not be able to support that exact behavior.
@min-xu-ai min-xu-ai self-assigned this Mar 8, 2021
@min-xu-ai
Copy link
Contributor Author

Another thing: if the model is on CPU, we can't use summon since all_gather needs CUDA and dense tensors. So early build of the model must be on the GPU too.

@myleott
Copy link
Contributor

myleott commented Mar 8, 2021

  1. Just to clarify (I think you’re saying the same thing) but the first time _lazy_init is called, it must be called by the actual root/outer-most FSDP instance. One solution here seems to be swapping the init_weight and build_model steps, something like:
model = FSDP(sequential(trunk, head))
set_seed(distributed_rank)
with trunk.summon_full_params():
    trunk.init_weight()
  1. vissl seems to init_weight at each rank differently. This means each worker starts with different weights? This is not standard DDP.

This is fine. With FSDP there are no redundant params across workers, so the DDP constraint that every worker has the same weights is no longer applicable. In fact, the weights must be distinct across workers.

  1. Another thing: if the model is on CPU, we can't use summon since all_gather needs CUDA and dense tensors. So early build of the model must be on the GPU too.

Interesting, can you add a test for this? This is the same issue as #444. We can solve both by adding a compute_device argument to __init__. Then even if the model resides on CPU, we can internally allocate tensors on the right device in _lazy_init

@min-xu-ai
Copy link
Contributor Author

To summarize things in this issue (for my own memory):

  1. currently vissl doesn't have the issue in (1) above with summon_full_params() since we can init_weight before sub-wrapping, as long as all ranks have the same seed.
    TODO1: would be nice to have an assert on summon_full_params() first on the inner FSDP or support that in some way. (like summon_full_params doesn't mutate the is_root flags.
  2. second issue with summon on CPU models, related to [FSDP] improve robustness to mismatch between torch.cuda.current_device and model's device #444. perhaps need to do the fix together with [FSDP] improve robustness to mismatch between torch.cuda.current_device and model's device #444.

@myleott
Copy link
Contributor

myleott commented Mar 20, 2021

This was causing segfaults for RoBERTa + FSDP in fairseq, so I fixed it here: #543

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants