[tune] Release test for durable multifile checkpoints #34860

krfricke · 2023-04-28T09:33:38Z

Why are these changes needed?

We are currently only testing single-file checkpoints. However, there have been performance regressions with multi-file checkpoints due to unthreaded uploads in pyarrow. These have since been resolved, but we should collect metrics to catch future regressions.

When comparing against a version where the improvements have been reverted, we observe significant improvements in runtime:

2023-04-28 06:52:38,151	INFO tune.py:1011 -- Total run time: 362.95 seconds (337.86 seconds for the tuning loop).

vs.

2023-04-28 06:54:57,166	INFO tune.py:1011 -- Total run time: 472.55 seconds (436.54 seconds for the tuning loop).

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Kai Fricke <kai@anyscale.com>

justinvyu

Looks good to me! I agree that we should report some metric like % time syncing as part of the release test results in the future.

justinvyu · 2023-04-30T18:17:43Z

python/ray/tune/utils/release_test_util.py

-        class AwsDurableTrainable(TestDurableTrainable):
-            AWS_ACCESS_KEY_ID = aws_key_id
-            AWS_SECRET_ACCESS_KEY = aws_secret
-            AWS_SESSION_TOKEN = aws_session


This stuff is not needed anymore due to propagating the env vars in the runtime env from the previous PR?

That's one thing - the other is that the cluster infrastructure now correctly sets up credentials on all worker nodes

release/tune_tests/scalability_tests/workloads/test_durable_multifile_checkpoints.py

Signed-off-by: Kai Fricke <kai@anyscale.com>

) We are currently only testing single-file checkpoints. However, there have been performance regressions with multi-file checkpoints due to unthreaded uploads in pyarrow. These have since been resolved, but we should collect metrics to catch future regressions. When comparing against a [version where the improvements have been reverted](ray-project#34861), we observe significant improvements in runtime: ``` 2023-04-28 06:52:38,151 INFO tune.py:1011 -- Total run time: 362.95 seconds (337.86 seconds for the tuning loop). ``` vs. ``` 2023-04-28 06:54:57,166 INFO tune.py:1011 -- Total run time: 472.55 seconds (436.54 seconds for the tuning loop). ``` Signed-off-by: Kai Fricke <kai@anyscale.com>

[tune] Release test for durable multifile checkpoints

1fc60d2

Signed-off-by: Kai Fricke <kai@anyscale.com>

krfricke mentioned this pull request Apr 28, 2023

[do-not-merge] [tune] Comparison benchmark for multifile checkpoints #34861

Closed

8 tasks

Kai Fricke added 2 commits April 28, 2023 12:58

num_to_keep=2

5b9e9f1

Signed-off-by: Kai Fricke <kai@anyscale.com>

More files

60fb102

Signed-off-by: Kai Fricke <kai@anyscale.com>

krfricke marked this pull request as ready for review April 28, 2023 15:18

krfricke requested a review from justinvyu April 28, 2023 15:18

krfricke assigned justinvyu Apr 28, 2023

justinvyu approved these changes Apr 30, 2023

View reviewed changes

Wording

3e71563

Signed-off-by: Kai Fricke <kai@anyscale.com>

krfricke merged commit b5e5bd7 into ray-project:master May 1, 2023

krfricke deleted the tune/scalability-multi-checkpoint branch May 1, 2023 07:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[tune] Release test for durable multifile checkpoints #34860

[tune] Release test for durable multifile checkpoints #34860

krfricke commented Apr 28, 2023 •

edited

Loading

justinvyu left a comment

justinvyu Apr 30, 2023

krfricke May 1, 2023

[tune] Release test for durable multifile checkpoints #34860

[tune] Release test for durable multifile checkpoints #34860

Conversation

krfricke commented Apr 28, 2023 • edited Loading

Why are these changes needed?

Related issue number

Checks

justinvyu left a comment

Choose a reason for hiding this comment

justinvyu Apr 30, 2023

Choose a reason for hiding this comment

krfricke May 1, 2023

Choose a reason for hiding this comment

krfricke commented Apr 28, 2023 •

edited

Loading