Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[tune] Release test for durable multifile checkpoints #34860

Merged

Conversation

krfricke
Copy link
Contributor

@krfricke krfricke commented Apr 28, 2023

Why are these changes needed?

We are currently only testing single-file checkpoints. However, there have been performance regressions with multi-file checkpoints due to unthreaded uploads in pyarrow. These have since been resolved, but we should collect metrics to catch future regressions.

When comparing against a version where the improvements have been reverted, we observe significant improvements in runtime:

2023-04-28 06:52:38,151	INFO tune.py:1011 -- Total run time: 362.95 seconds (337.86 seconds for the tuning loop).

vs.

2023-04-28 06:54:57,166	INFO tune.py:1011 -- Total run time: 472.55 seconds (436.54 seconds for the tuning loop).

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Kai Fricke <kai@anyscale.com>
Kai Fricke added 2 commits April 28, 2023 12:58
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
@krfricke krfricke marked this pull request as ready for review April 28, 2023 15:18
@krfricke krfricke requested a review from justinvyu April 28, 2023 15:18
Copy link
Contributor

@justinvyu justinvyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! I agree that we should report some metric like % time syncing as part of the release test results in the future.

Comment on lines -145 to -148
class AwsDurableTrainable(TestDurableTrainable):
AWS_ACCESS_KEY_ID = aws_key_id
AWS_SECRET_ACCESS_KEY = aws_secret
AWS_SESSION_TOKEN = aws_session
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This stuff is not needed anymore due to propagating the env vars in the runtime env from the previous PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's one thing - the other is that the cluster infrastructure now correctly sets up credentials on all worker nodes

Signed-off-by: Kai Fricke <kai@anyscale.com>
@krfricke krfricke merged commit b5e5bd7 into ray-project:master May 1, 2023
@krfricke krfricke deleted the tune/scalability-multi-checkpoint branch May 1, 2023 07:14
architkulkarni pushed a commit to architkulkarni/ray that referenced this pull request May 16, 2023
)

We are currently only testing single-file checkpoints. However, there have been performance regressions with multi-file checkpoints due to unthreaded uploads in pyarrow. These have since been resolved, but we should collect metrics to catch future regressions.

When comparing against a [version where the improvements have been reverted](ray-project#34861), we observe significant improvements in runtime:

```
2023-04-28 06:52:38,151	INFO tune.py:1011 -- Total run time: 362.95 seconds (337.86 seconds for the tuning loop).
```

vs.

```
2023-04-28 06:54:57,166	INFO tune.py:1011 -- Total run time: 472.55 seconds (436.54 seconds for the tuning loop).
```

Signed-off-by: Kai Fricke <kai@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants