[Train] Add FullyShardedDataParallel support to TorchTrainer #28096

markrogersjr · 2022-08-25T10:04:40Z

Why are these changes needed?

As of version 1.11, PyTorch supports automatically sharding large models via FullyShardedDataParallel. This change is necessary to take advantage of this new feature.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

amogkam · 2022-08-25T18:34:27Z

Wow thanks for the contribution @markrogersjr! Will take a closer look later!

amogkam · 2022-08-29T19:05:54Z

ci/env/install-dependencies.sh

@@ -412,7 +412,7 @@ install_dependencies() {
        1.5) TORCHVISION_VERSION=0.6.0;;
        *) TORCHVISION_VERSION=0.5.0;;
      esac
-      pip install --use-deprecated=legacy-resolver --upgrade torch=="${TORCH_VERSION-1.9.0}" torchvision=="${TORCHVISION_VERSION}"
+      pip install --use-deprecated=legacy-resolver --upgrade torch=="${TORCH_VERSION-1.11.0}" torchvision=="${TORCHVISION_VERSION}"


Hmm is this change necessary? This is only here for legacy reasons for Ray Serve tests... for Ray Train tests it doesn't go inside this conditional.

amogkam · 2022-08-29T19:10:41Z

python/ray/train/torch/train_loop_utils.py

@@ -51,6 +52,8 @@ def prepare_model(
    move_to_device: bool = True,
    wrap_ddp: bool = True,
    ddp_kwargs: Optional[Dict[str, Any]] = None,
+    wrap_fsdp: bool = False,
+    fsdp_kwargs: Optional[Dict[str, Any]] = None,


Rather than a separate wrap_fsdp arg, can we change the API to have a single parallel_strategy arg that can be set to either "ddp", "fsdp", or None? This will allow for more extendability if we add more distributed strategies later.

Then we can also consolidate ddp_kwargs and fsdp_kwargs.

amogkam · 2022-08-29T19:23:38Z

python/ray/train/tests/test_gpu.py

@@ -7,6 +7,7 @@
 import torch
 import torchvision
 from torch.nn.parallel import DistributedDataParallel
+from torch.distributed.fsdp import FullyShardedDataParallel


This import will fail if the user's Pytorch version is less than 1.11 right?

To make sure that users can still use Ray Train with older versions of torch, can we do something like the following:

from distutils import LooseVersion if LooseVersion(torch.__version__) < LooseVersion("1.11.0"): FullyShardedDataParallel = None else: from torch.distributed.fsdp import FullyShardedDataParallel

Then in prepare_model, if FullyShardedDataParallel is None but the user enables it, we can raise an error saying that the user needs to upgrade their torch version to enable FSDP.

amogkam · 2022-08-30T00:13:24Z

python/requirements_ml_docker.txt

@@ -5,7 +5,7 @@ tblib

 # If you make changes to the torch versions, please also make the corresponding changes to `requirements_dl.txt`!
 -f https://download.pytorch.org/whl/torch_stable.html
-torch==1.9.0+cu111
+torch==1.11.0+cu111


Unfortunately our dependencies for CI are a bit of a mess right now so upgrading won't work out of the box.

Instead can we do the following:

Split out the fsdp test to a separate file (test_torch_fsdp.py)

Add this file to ray/train/tests/BUILD with the following tags: tags = ["team:ml", "exclusive", "gpu_only", "torch_1_11"]

Create a new test suite like this: https://github.com/ray-project/ray/blob/master/.buildkite/pipeline.gpu.large.yml#L1-L10, except with the following changes:

- label: ":tv: :steam_locomotive: Train GPU tests (PyTorch 1.11) " conditions: ["RAY_CI_TRAIN_AFFECTED"] commands: - cleanup() { if [ "${BUILDKITE_PULL_REQUEST}" = "false" ]; then ./ci/build/upload_build_info.sh; fi }; trap cleanup EXIT - PYTHON=3.7 TRAIN_TESTING=1 ./ci/env/install-dependencies.sh # Because Python version changed, we need to re-install Ray here - rm -rf ./python/ray/thirdparty_files; rm -rf ./python/ray/pickle5_files; ./ci/ci.sh build - pip install -Ur torch==1.11.0+cu113 --extra-index-url https://download.pytorch.org/whl/cu113 - ./ci/env/env_info.sh - bazel test --config=ci $(./ci/run/bazel_export_options) --build_tests_only --test_tag_filters=gpu,gpu_only,torch_1_11,-ray_air python/ray/train/...

markrogersjr · 2022-09-01T21:21:08Z

@amogkam Thanks for the feedback, I tried to incorporate your suggestions as closely as possible. Several of the tests are still not passing. I tried tweaking the PyTorch 1.11 test configuration to no avail. Any suggestions?

amogkam

Thanks @markrogersjr! Left some comments- I think these should work

.buildkite/pipeline.gpu.large.yml

amogkam · 2022-09-02T00:20:43Z

@markrogersjr can you also merge in latest master?

Signed-off-by: Mark Rogers <m@inmimo.me>

…gs in prepare_model Signed-off-by: Mark Rogers <m@inmimo.me>

Signed-off-by: Mark Rogers <m@inmimo.me>

markrogersjr · 2022-09-02T20:47:20Z

@amogkam I managed to get most tests to pass, looks like the rest could be failing due to flakiness. Please let me know how I can help from here!

Signed-off-by: Amog Kamsetty <amogkamsetty@yahoo.com>

amogkam · 2022-09-02T23:43:45Z

Thanks @markrogersjr, this looks great to me! I just pushed some additional changes, primarily for backwards compatibility. But I will make sure to get this merged in!

Signed-off-by: Amog Kamsetty <amogkamsetty@yahoo.com>

amogkam

Excellent, thanks @markrogersjr!

markrogersjr · 2022-09-07T03:39:12Z

@amogkam thank you as well, nice work!

…ject#28096) As of version 1.11, PyTorch supports automatically sharding large models via FullyShardedDataParallel. This change is necessary to take advantage of this new feature. Signed-off-by: Mark Rogers <m@inmimo.me> Signed-off-by: Amog Kamsetty <amogkamsetty@yahoo.com> Co-authored-by: Amog Kamsetty <amogkamsetty@yahoo.com> Signed-off-by: ilee300a <ilee300@anyscale.com>

markrogersjr force-pushed the fsdp branch from 03d6d09 to 0d4d5b4 Compare August 25, 2022 17:55

markrogersjr requested review from amogkam, sven1977, richardliaw and matthewdeng as code owners August 25, 2022 17:55

amogkam self-assigned this Aug 25, 2022

markrogersjr force-pushed the fsdp branch from 0d4d5b4 to 2bae3a8 Compare August 28, 2022 19:31

amogkam reviewed Aug 30, 2022

View reviewed changes

amogkam added this to the Large Model Training (Model Parallelism) milestone Aug 30, 2022

markrogersjr force-pushed the fsdp branch 8 times, most recently from 6568bd5 to 95c8e4b Compare August 31, 2022 19:40

markrogersjr requested a review from a team as a code owner August 31, 2022 19:44

markrogersjr force-pushed the fsdp branch 4 times, most recently from 8b9a85a to a7b7301 Compare September 1, 2022 05:53

amogkam reviewed Sep 1, 2022

View reviewed changes

.buildkite/pipeline.gpu.large.yml Outdated Show resolved Hide resolved

.buildkite/pipeline.gpu.large.yml Show resolved Hide resolved

.buildkite/pipeline.gpu.large.yml Outdated Show resolved Hide resolved

markrogersjr force-pushed the fsdp branch from 263dee6 to 3d4ab60 Compare September 2, 2022 00:26

markrogersjr added 3 commits September 1, 2022 18:54

add fsdp support for torch trainer

066e4de

Signed-off-by: Mark Rogers <m@inmimo.me>

separate fsdp tests and environment; consolidate parallel strategy ar…

b9d01f8

…gs in prepare_model Signed-off-by: Mark Rogers <m@inmimo.me>

remove -r flag when installing torch; remove Literal

64ef161

Signed-off-by: Mark Rogers <m@inmimo.me>

markrogersjr added 4 commits September 1, 2022 18:54

update notebook to mention fsdp

838de56

Signed-off-by: Mark Rogers <m@inmimo.me>

lint torch train utils file

f864206

Signed-off-by: Mark Rogers <m@inmimo.me>

add main call to test

b0e29e0

Signed-off-by: Mark Rogers <m@inmimo.me>

include additional dependencies to avoid import error

35a0a6e

Signed-off-by: Mark Rogers <m@inmimo.me>

markrogersjr force-pushed the fsdp branch 3 times, most recently from 7984709 to d1b3ce6 Compare September 2, 2022 05:59

only add kwargs for ddp; only use torch1.11 flag

c279ce9

Signed-off-by: Mark Rogers <m@inmimo.me>

markrogersjr force-pushed the fsdp branch from d1b3ce6 to c279ce9 Compare September 2, 2022 06:54

amogkam changed the title ~~add fsdp support for torch trainer~~ [Train] Add FullyShardedDataParallel support to TorchTrainer Sep 2, 2022

backwards compat

11ed890

Signed-off-by: Amog Kamsetty <amogkamsetty@yahoo.com>

amogkam added 3 commits September 2, 2022 16:46

update

34cafa6

Signed-off-by: Amog Kamsetty <amogkamsetty@yahoo.com>

retrigger CI

6bdb040

Signed-off-by: Amog Kamsetty <amogkamsetty@yahoo.com>

update tag

29992b2

Signed-off-by: Amog Kamsetty <amogkamsetty@yahoo.com>

amogkam approved these changes Sep 7, 2022

View reviewed changes

richardliaw approved these changes Sep 7, 2022

View reviewed changes

amogkam merged commit be92ab6 into ray-project:master Sep 7, 2022

markrogersjr deleted the fsdp branch September 7, 2022 14:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Train] Add FullyShardedDataParallel support to TorchTrainer #28096

[Train] Add FullyShardedDataParallel support to TorchTrainer #28096

markrogersjr commented Aug 25, 2022 •

edited

Loading

amogkam commented Aug 25, 2022

amogkam Aug 29, 2022

amogkam Aug 29, 2022

amogkam Aug 29, 2022

amogkam Aug 30, 2022

markrogersjr commented Sep 1, 2022

amogkam left a comment

amogkam commented Sep 2, 2022

markrogersjr commented Sep 2, 2022

amogkam commented Sep 2, 2022

amogkam left a comment

markrogersjr commented Sep 7, 2022

[Train] Add FullyShardedDataParallel support to TorchTrainer #28096

[Train] Add FullyShardedDataParallel support to TorchTrainer #28096

Conversation

markrogersjr commented Aug 25, 2022 • edited Loading

Why are these changes needed?

Related issue number

Checks

amogkam commented Aug 25, 2022

amogkam Aug 29, 2022

Choose a reason for hiding this comment

amogkam Aug 29, 2022

Choose a reason for hiding this comment

amogkam Aug 29, 2022

Choose a reason for hiding this comment

amogkam Aug 30, 2022

Choose a reason for hiding this comment

markrogersjr commented Sep 1, 2022

amogkam left a comment

Choose a reason for hiding this comment

amogkam commented Sep 2, 2022

markrogersjr commented Sep 2, 2022

amogkam commented Sep 2, 2022

amogkam left a comment

Choose a reason for hiding this comment

markrogersjr commented Sep 7, 2022

markrogersjr commented Aug 25, 2022 •

edited

Loading