-
Notifications
You must be signed in to change notification settings - Fork 283
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FSDP] issues around early init_weight #490
Comments
Another thing: if the model is on CPU, we can't use summon since all_gather needs CUDA and dense tensors. So early build of the model must be on the GPU too. |
model = FSDP(sequential(trunk, head))
set_seed(distributed_rank)
with trunk.summon_full_params():
trunk.init_weight()
This is fine. With FSDP there are no redundant params across workers, so the DDP constraint that every worker has the same weights is no longer applicable. In fact, the weights must be distinct across workers.
Interesting, can you add a test for this? This is the same issue as #444. We can solve both by adding a |
To summarize things in this issue (for my own memory):
|
This was causing segfaults for RoBERTa + FSDP in fairseq, so I fixed it here: #543 |
cc @prigoyal @myleott
Documenting several issues around early init_weight.
By early, I mean in vissl's case, we do the following:
two issues observed so far:
The text was updated successfully, but these errors were encountered: