[AIR] Avoid checkpoint conversion, move encoding logic to checkpoints #28794

Yard1 · 2022-09-26T23:33:32Z

Signed-off-by: Antoni Baum antoni.baum@protonmail.com

Why are these changes needed?

This PR avoids always converting to dictionary when reporting a checkpoint in Train and uses Checkpoints instead of dicts to transfer data. This is a low hanging fruit change for better consistency and performance with non-dict checkpoints. In order to facilitate that, the data encoding logic in Ray Train has been modified. Encoding and decoding is now done in the checkpoint classes. I believe this is the cleanest solution as it is both generic and inherently tied to the checkpoint itself - however, this has the downside of requiring users to use correct checkpoint classes for torch and horovod. In order to maintain backwards compatibility, the checkpoint class is automatically changed in session.py if a torch checkpoint is required (which has extra encoding and decoding logic to deal with serialization issues). Warnings are printed where necessary.

The old way of encoding/decoding checkpoints, with the logic being defined in Backend classes, is soft deprecated but will still be used (while printing out a depreciation warning).

The only breaking change is that passing torch tensors/data in train.report/session.report is not allowed anymore. Considering that with the switch to the session API we do not expect users to return models/tensors but just metrics and that the train.report API is soon to be hard deprecated, I do not think that this is an issue worth making a special case for. Happy to make it fully backwards compatible though!

Finally, as a side effect, HuggingFaceTrainer will now return a HuggingFaceCheckpoint instead of a base Checkpoint (cc @bveeramani ).

This PR changes some of the tests to use the correct framework checkpoints - a followup PR will change examples and documentation (so we e. use TorchCheckpoint with TorchTrainer training UDFs).

Release tests: https://buildkite.com/ray-project/release-tests-pr/builds/19121

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>

xwjiang2010 · 2022-09-28T15:05:06Z

Add some descriptions?

Yard1 · 2022-09-28T19:15:30Z

Test failures are unrelated.

python/ray/train/_internal/checkpoint.py

Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>

krfricke

LGTM. Can we run a small set of release tests for confirmation that this works?

amogkam

Thanks @Yard1

How much of the code in this PR can be cleaned up if we sequence the ray.train.* hard deprecations before this? I think we should do that deprecation before this PR since we are no longer getting this in for 2.1.

I'm wary about adding more tech debt in order to be backwards compatible with tech debt that we want to deprecate anyways. Then we can simplify this PR even more.

amogkam · 2022-10-17T19:00:09Z

python/ray/air/examples/horovod/horovod_pytorch_example.py

@@ -149,12 +149,11 @@ def train_func(config):
            model, optimizer, train_sampler, train_loader, epoch, log_interval, use_cuda
        )
        if save_model_as_dict:
-            checkpoint_dict = dict(model=model.state_dict())
+            checkpoint = TorchCheckpoint.from_state_dict(model.state_dict())


amogkam · 2022-10-17T19:09:49Z

python/ray/train/backend.py

+    #     "serialization and deserialization. The checkpoint "
+    #     "type will be changed automatically. "
+    #     "This behavior may change in the future."
+    # )


Let's just remove this for now or move to session.report? We want the warning to be printed in session.report anyways (we are going to be hard deprecating train.report).

I'd put it in session.report but then the logic to check whether the checkpoint is good or bad would need to be moved there too, and that would make it nasty to work with.

amogkam · 2022-10-17T19:13:09Z

python/ray/train/torch/torch_checkpoint.py

+
+    def __getstate__(self) -> dict:
+        if self._data_dict:
+            state = self.__dict__.copy()


are these copies necessary?

we don't want to modify the underlying state when serializing (this is a shallow copy so it should be cheap)

amogkam · 2022-10-17T19:13:18Z

python/ray/train/torch/torch_checkpoint.py

+
+    def __setstate__(self, state: dict):
+        if "_data_dict" in state:
+            state = state.copy()


ditto here: do we need to do this copy?

python/ray/train/torch/torch_trainer.py

python/ray/train/_internal/session.py

Yard1 · 2022-10-17T19:36:08Z

Yeah I am fine with delaying this until we deprecate more things!

Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>

Yard1 · 2022-10-25T20:51:11Z

@amogkam updated to account for deprecations (let's wait with merging before we deduplicate examples though)

amogkam

Thanks @Yard1, looks good overall!

amogkam · 2022-10-26T20:17:27Z

python/ray/train/_internal/checkpoint.py

-        # Decode checkpoint.
-        checkpoint_data = decode_checkpoint_fn(checkpoint_data)
+        # TODO(ml-team): Remove once we remove Backend.decode_data
+        checkpoint_data = decode_checkpoint_fn(checkpoint_data).to_dict()


btw it seems that the semantics for scoring is not consistent across across Tune and our different Trainers.

Currently we do the following:

For DL trainers, we check the checkpoint dict itself for score_attribute

For non-DL trainers (which goes through TuneReportCallback) and for basic Tune, we check the metrics associated with checkpoint for score_attribute, not the checkpoint itself.

With the new API, we should push to using approach 2 for everything in followups.

Definitely, let me make a note of that

python/ray/train/_internal/checkpoint.py

python/ray/train/_internal/utils.py

python/ray/train/tests/test_session.py

Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>

…ckpoints (ray-project#28794)" This reverts commit 4c20503.

…ckpoints (#28794)" (#29784) This added dependencies from the HorovodConfig on TensorFlow and Torch. If either of these is not installed, (e.g. if the user is using Horovod with Torch and does not have TensorFlow installed), then they will run into a `ModuleNotFoundError`. https://github.com/ray-project/ray/blob/6b9a56d28e1029741feaa864257d75824fe36622/python/ray/train/horovod/config.py#L16-L17 Reverting this for now.

…c to checkpoints (ray-project#28794)" (ray-project#29784)" This reverts commit 57ea8bd.

#28794 fixed to avoid the issue discovered in #29784 (comment) Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>

…ray-project#28794) This PR avoids always converting to dictionary when reporting a checkpoint in Train and uses Checkpoints instead of dicts to transfer data. This is a low hanging fruit change for better consistency and performance with non-dict checkpoints. In order to facilitate that, the data encoding logic in Ray Train has been modified. Encoding and decoding is now done in the checkpoint classes. I believe this is the cleanest solution as it is both generic and inherently tied to the checkpoint itself - however, this has the downside of requiring users to use correct checkpoint classes for torch and horovod. In order to maintain backwards compatibility, the checkpoint class is automatically changed in session.py if a torch checkpoint is required (which has extra encoding and decoding logic to deal with serialization issues). Warnings are printed where necessary. Signed-off-by: Antoni Baum <antoni.baum@protonmail.com> Co-authored-by: Kai Fricke <krfricke@users.noreply.github.com> Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

…ckpoints (ray-project#28794)" (ray-project#29784) This added dependencies from the HorovodConfig on TensorFlow and Torch. If either of these is not installed, (e.g. if the user is using Horovod with Torch and does not have TensorFlow installed), then they will run into a `ModuleNotFoundError`. https://github.com/ray-project/ray/blob/6b9a56d28e1029741feaa864257d75824fe36622/python/ray/train/horovod/config.py#L16-L17 Reverting this for now. Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

ray-project#28794 fixed to avoid the issue discovered in ray-project#29784 (comment) Signed-off-by: Antoni Baum <antoni.baum@protonmail.com> Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

Avoid checkpoint conversion

8acd735

Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>

Yard1 force-pushed the train_avoid_checkpoint_conversion branch from 3fbf793 to 8acd735 Compare September 27, 2022 17:46

Yard1 added 6 commits September 27, 2022 18:14

Fix legacy session tests

f5058a3

Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>

Fix horovod

442179a

Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>

Merge branch 'ray-project:master' into train_avoid_checkpoint_conversion

8f891a2

Update checkpoint.py

0491fe3

Update checkpoint.py

bac4cfb

Encoded attr

bd95d91

Yard1 added 2 commits September 28, 2022 17:42

Lint

6e24f54

Merge branch 'ray-project:master' into train_avoid_checkpoint_conversion

e031339

Yard1 added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Sep 28, 2022

Yard1 requested review from amogkam, krfricke and xwjiang2010 September 28, 2022 19:15

Yard1 assigned amogkam, krfricke and xwjiang2010 Sep 28, 2022

amogkam reviewed Sep 28, 2022

View reviewed changes

python/ray/train/_internal/checkpoint.py Outdated Show resolved Hide resolved

Merge branch 'master' into train_avoid_checkpoint_conversion

f763cd5

Yard1 removed the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Sep 29, 2022

Yard1 added 6 commits September 29, 2022 17:08

Handle preprocessor

16dfa01

Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>

Deprecate encode/decode in backend

19fc16d

Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>

Make sure we maintain backwards compat

870fa59

Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>

Make sure HF has the right checkpoint type

77f8083

Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>

Fix

4f7189f

Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>

GPU fix

e8f6af3

Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>

Yard1 changed the title ~~[AIR] Avoid checkpoint conversion~~ [AIR] Avoid checkpoint conversion, move encoding logic to checkpoints Sep 29, 2022

Cleanup

11a5edf

Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>

Yard1 added 2 commits October 13, 2022 19:33

Merge branch 'ray-project:master' into train_avoid_checkpoint_conversion

9e447be

Merge branch 'master' into train_avoid_checkpoint_conversion

c7ee9bf

krfricke reviewed Oct 17, 2022

View reviewed changes

Merge branch 'ray-project:master' into train_avoid_checkpoint_conversion

a29b877

amogkam reviewed Oct 17, 2022

View reviewed changes

Yard1 added 2 commits October 25, 2022 20:19

Merge branch 'master' into train_avoid_checkpoint_conversion

fedb86a

Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>

Fixes

35f9459

Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>

Yard1 requested a review from amogkam October 25, 2022 20:50

Yard1 added 2 commits October 26, 2022 19:31

Merge branch 'master' into train_avoid_checkpoint_conversion

b540d67

Merge branch 'ray-project:master' into train_avoid_checkpoint_conversion

79a8b1f

amogkam reviewed Oct 26, 2022

View reviewed changes

Yard1 added 3 commits October 26, 2022 21:03

Apply feedback from code review

2542d44

Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>

Remove serialization error hint

8cdd6e9

Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>

Remove test

b82a8b9

Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>

amogkam approved these changes Oct 26, 2022

View reviewed changes

Yard1 added 2 commits October 27, 2022 00:25

Merge branch 'ray-project:master' into train_avoid_checkpoint_conversion

fe6c7b9

Merge branch 'master' into train_avoid_checkpoint_conversion

5755809

Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>

amogkam merged commit 4c20503 into ray-project:master Oct 27, 2022

Yard1 deleted the train_avoid_checkpoint_conversion branch October 27, 2022 18:24

matthewdeng added a commit to matthewdeng/ray that referenced this pull request Oct 27, 2022

Revert "[AIR] Avoid checkpoint conversion, move encoding logic to che…

6618d53

…ckpoints (ray-project#28794)" This reverts commit 4c20503.

Yard1 added a commit to Yard1/ray that referenced this pull request Oct 27, 2022

Revert "Revert "[AIR] Avoid checkpoint conversion, move encoding logi…

a4f178f

…c to checkpoints (ray-project#28794)" (ray-project#29784)" This reverts commit 57ea8bd.

Yard1 mentioned this pull request Oct 27, 2022

[AIR] Avoid checkpoint conversion (horovod fixed) #29785

Merged

7 tasks

amogkam pushed a commit that referenced this pull request Oct 28, 2022

[AIR] Avoid checkpoint conversion (horovod fixed) (#29785)

c4c6ffe

#28794 fixed to avoid the issue discovered in #29784 (comment) Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AIR] Avoid checkpoint conversion, move encoding logic to checkpoints #28794

[AIR] Avoid checkpoint conversion, move encoding logic to checkpoints #28794

Yard1 commented Sep 26, 2022 •

edited

Loading

xwjiang2010 commented Sep 28, 2022

Yard1 commented Sep 28, 2022

krfricke left a comment

amogkam left a comment •

edited

Loading

amogkam Oct 17, 2022

amogkam Oct 17, 2022

Yard1 Oct 25, 2022

amogkam Oct 17, 2022

Yard1 Oct 25, 2022

amogkam Oct 17, 2022

Yard1 Oct 25, 2022

Yard1 commented Oct 17, 2022

Yard1 commented Oct 25, 2022

amogkam left a comment

amogkam Oct 26, 2022

Yard1 Oct 26, 2022

[AIR] Avoid checkpoint conversion, move encoding logic to checkpoints #28794

[AIR] Avoid checkpoint conversion, move encoding logic to checkpoints #28794

Conversation

Yard1 commented Sep 26, 2022 • edited Loading

Why are these changes needed?

Related issue number

Checks

xwjiang2010 commented Sep 28, 2022

Yard1 commented Sep 28, 2022

krfricke left a comment

Choose a reason for hiding this comment

amogkam left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Yard1 commented Oct 17, 2022

Yard1 commented Oct 25, 2022

amogkam left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Yard1 commented Sep 26, 2022 •

edited

Loading

amogkam left a comment •

edited

Loading