[fix][FSDP] fix weight init when using apply() (fixes #490 and #444) #543

myleott · 2021-03-20T13:00:23Z

add compute_device so that we can use summon_full_params immediately after FSDP.__init__, even if the params are still on CPU (this also fixes [FSDP] improve robustness to mismatch between torch.cuda.current_device and model's device #444)
override FSDP.apply so that it calls summon_full_params first. This makes it possible to do weight inits via model.apply(custom_weight_init_fn) without segfaulting, and should give identical results to not using FSDP

This also required reworking _all_buffers_to to no longer use apply, since it is called from within _lazy_init and created some circular logic: apply -> summon_full_params -> _lazy_init -> _all_buffers_to -> apply.

…es to CUDA

myleott · 2021-03-20T14:24:08Z

Doesn't work with SyncBN, will fix

sshleifer

LGTM, nice tests!

fairscale/nn/data_parallel/fully_sharded_data_parallel.py

tests/nn/data_parallel/test_fsdp_apply.py

min-xu-ai

Super nice. Some nonblocking comments.

tests/nn/data_parallel/test_fsdp_apply.py

tests/nn/data_parallel/test_fsdp.py

tests/ci_test_list_2.txt

fairscale/nn/data_parallel/fully_sharded_data_parallel.py

min-xu-ai · 2021-03-20T18:15:40Z

The test failures might be related to the cast_buffer change?

myleott · 2021-03-20T18:59:53Z

The test failures might be related to the cast_buffer change?

It's very weird, tests only seem to fail on 1.6, but pass in 1.7.1 and 1.8 (and on my local machine)... will dig a bit

Summary: Before this PR (facebookresearch/fairscale#543) was merged, we used to need the extra cuda() calls. Now, they are not needed. Unfortunately, this doesn't solve the long model init time issue we have. A FSDP model init still take >20 mins for me. This is really bad for debugging the regnet128 conv layer crash problem I am debugging. The following debugging output shows that most delays are in FSDP wrapping, some in BN wrapping and some in the layer wrapping. ``` INFO 2021-04-14 12:18:35,883 regnet_2.py: 159: block created INFO 2021-04-14 12:18:35,884 regnet_2.py: 161: cpu INFO 2021-04-14 12:18:35,884 regnet_2.py: 161: cpu INFO 2021-04-14 12:18:35,884 regnet_2.py: 161: cpu INFO 2021-04-14 12:18:35,884 regnet_2.py: 161: cpu INFO 2021-04-14 12:18:35,884 regnet_2.py: 161: cpu INFO 2021-04-14 12:18:35,884 regnet_2.py: 161: cpu INFO 2021-04-14 12:18:35,884 regnet_2.py: 161: cpu INFO 2021-04-14 12:18:35,884 regnet_2.py: 161: cpu INFO 2021-04-14 12:18:35,884 regnet_2.py: 161: cpu INFO 2021-04-14 12:18:35,884 regnet_2.py: 161: cpu INFO 2021-04-14 12:18:35,884 regnet_2.py: 161: cpu INFO 2021-04-14 12:18:35,884 regnet_2.py: 161: cpu INFO 2021-04-14 12:18:35,884 regnet_2.py: 161: cpu INFO 2021-04-14 12:19:07,388 regnet_2.py: 163: block bn wrapped INFO 2021-04-14 12:19:18,388 regnet_2.py: 166: block wrapped ``` In any case, this PR is pretty safe and should go in so that we don't need to do an extra `cuda()` call before wrapping. Pull Request resolved: fairinternal/ssl_scaling#75 Reviewed By: prigoyal Differential Revision: D27776285 Pulled By: min-xu-ai fbshipit-source-id: 3e43c6fe750fd6ee35933400b03a069d62040d8a

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 20, 2021

myleott mentioned this pull request Mar 20, 2021

[FSDP] issues around early init_weight #490

Closed

myleott requested review from min-xu-ai and sshleifer March 20, 2021 13:00

myleott force-pushed the fsdp_apply branch 2 times, most recently from 46f422f to a31d5dd Compare March 20, 2021 13:09

myleott changed the title ~~[fix][FSDP] Fix FSDP weight init (fixes #490 and #444)~~ [fix][FSDP] fix weight init when using apply() (fixes #490 and #444) Mar 20, 2021

myleott added 3 commits March 20, 2021 06:11

Add new test for weight init (fails)

d184867

Set FSDP.compute_device so summon_full_params works before module mov…

293e151

…es to CUDA

Override FSDP.apply to enable custom weight init

442f3dd

myleott force-pushed the fsdp_apply branch from a31d5dd to 442f3dd Compare March 20, 2021 13:12

myleott marked this pull request as draft March 20, 2021 14:23

Fix SyncBN

7ba4542

sshleifer approved these changes Mar 20, 2021

View reviewed changes

fairscale/nn/data_parallel/fully_sharded_data_parallel.py Outdated Show resolved Hide resolved

tests/nn/data_parallel/test_fsdp_apply.py Outdated Show resolved Hide resolved

min-xu-ai approved these changes Mar 20, 2021

View reviewed changes

myleott added 2 commits March 20, 2021 12:18

Very strange fix for PyTorch 1.6 -- remove torch.no_grad...

7445fc3

CR

51b61f9

myleott marked this pull request as ready for review March 20, 2021 20:13

myleott merged commit fa1b85f into master Mar 20, 2021

myleott deleted the fsdp_apply branch March 20, 2021 21:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[fix][FSDP] fix weight init when using apply() (fixes #490 and #444) #543

[fix][FSDP] fix weight init when using apply() (fixes #490 and #444) #543

myleott commented Mar 20, 2021

myleott commented Mar 20, 2021

sshleifer left a comment

min-xu-ai left a comment

min-xu-ai commented Mar 20, 2021

myleott commented Mar 20, 2021

[fix][FSDP] fix weight init when using apply() (fixes #490 and #444) #543

[fix][FSDP] fix weight init when using apply() (fixes #490 and #444) #543

Conversation

myleott commented Mar 20, 2021

myleott commented Mar 20, 2021

sshleifer left a comment

Choose a reason for hiding this comment

min-xu-ai left a comment

Choose a reason for hiding this comment

min-xu-ai commented Mar 20, 2021

myleott commented Mar 20, 2021