[FSDP] issues around early init_weight #490

min-xu-ai · 2021-03-08T01:13:32Z

Documenting several issues around early init_weight.

By early, I mean in vissl's case, we do the following:

# building trunk
trunk = FSDP(...)
set_seed(distributed_rank)
trunk.init_weight()

# build model
model = FSDP(sequential(trunk, head))

two issues observed so far:

when init_weight is called, I used summon_full_params(), which called lazy_init. However, when root FSDP is later called during forward, it calls lazy_init again and try to set the is_root flag for children FSDP. There, we assert since children's FSDP is_root flags are already set.
vissl seems to init_weight at each rank differently. This means each worker starts with different weights? This is not standard DDP. Priya can you confirm? If so, FSDP might not be able to support that exact behavior.

The text was updated successfully, but these errors were encountered:

min-xu-ai · 2021-03-08T02:35:57Z

Another thing: if the model is on CPU, we can't use summon since all_gather needs CUDA and dense tensors. So early build of the model must be on the GPU too.

myleott · 2021-03-08T11:36:05Z

Just to clarify (I think you’re saying the same thing) but the first time _lazy_init is called, it must be called by the actual root/outer-most FSDP instance. One solution here seems to be swapping the init_weight and build_model steps, something like:

model = FSDP(sequential(trunk, head))
set_seed(distributed_rank)
with trunk.summon_full_params():
    trunk.init_weight()

vissl seems to init_weight at each rank differently. This means each worker starts with different weights? This is not standard DDP.

This is fine. With FSDP there are no redundant params across workers, so the DDP constraint that every worker has the same weights is no longer applicable. In fact, the weights must be distinct across workers.

Another thing: if the model is on CPU, we can't use summon since all_gather needs CUDA and dense tensors. So early build of the model must be on the GPU too.

Interesting, can you add a test for this? This is the same issue as #444. We can solve both by adding a compute_device argument to __init__. Then even if the model resides on CPU, we can internally allocate tensors on the right device in _lazy_init

min-xu-ai · 2021-03-18T05:23:14Z

To summarize things in this issue (for my own memory):

currently vissl doesn't have the issue in (1) above with summon_full_params() since we can init_weight before sub-wrapping, as long as all ranks have the same seed.
TODO1: would be nice to have an assert on summon_full_params() first on the inner FSDP or support that in some way. (like summon_full_params doesn't mutate the is_root flags.
second issue with summon on CPU models, related to [FSDP] improve robustness to mismatch between torch.cuda.current_device and model's device #444. perhaps need to do the fix together with [FSDP] improve robustness to mismatch between torch.cuda.current_device and model's device #444.

myleott · 2021-03-20T13:00:35Z

This was causing segfaults for RoBERTa + FSDP in fairseq, so I fixed it here: #543

min-xu-ai self-assigned this Mar 8, 2021

myleott closed this as completed in fa1b85f Mar 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FSDP] issues around early init_weight #490

[FSDP] issues around early init_weight #490

min-xu-ai commented Mar 8, 2021

min-xu-ai commented Mar 8, 2021

myleott commented Mar 8, 2021 •

edited

Loading

min-xu-ai commented Mar 18, 2021

myleott commented Mar 20, 2021

[FSDP] issues around early init_weight #490

[FSDP] issues around early init_weight #490

Comments

min-xu-ai commented Mar 8, 2021

min-xu-ai commented Mar 8, 2021

myleott commented Mar 8, 2021 • edited Loading

min-xu-ai commented Mar 18, 2021

myleott commented Mar 20, 2021

myleott commented Mar 8, 2021 •

edited

Loading