Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Data] Enable execution plan optimizer for supported Ray Data APIs #36294

Merged
merged 19 commits into from
Jun 22, 2023

Conversation

scottjlee
Copy link
Contributor

@scottjlee scottjlee commented Jun 10, 2023

Why are these changes needed?

This PR further expands support for general use of the execution plan optimizer, including all currently existing Ray Data APIs, with the major exception of DatasetPipeline. In the case where a DatasetPipeline is used, the ExecutionPlan will include a flag to skip the new optimizer path and fall back to the legacy plan optimizer.

In addition, this PR implements several small patches to fully enable the new execution plan optimizer on existing Ray Data APIs, such as data lineage serialization.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy

Scott Lee added 13 commits June 9, 2023 17:27
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
@scottjlee scottjlee changed the title [Data] Remove skip for missing logical plan in Dataset plan optimizer [Data] Enable execution plan optimizer for supported Ray Data APIs Jun 19, 2023
@scottjlee scottjlee marked this pull request as ready for review June 20, 2023 03:40
@scottjlee
Copy link
Contributor Author

Failing CI and release tests look unrelated to this PR to me.

@scottjlee scottjlee added tests-ok The tagger certifies test failures are unrelated and assumes personal liability. data Ray Data-related issues Ray-2.6 labels Jun 20, 2023
@@ -136,6 +136,8 @@ def __init__(
# determined by the config at the time it was created.
self._context = copy.deepcopy(DataContext.get_current())

self._skip_optimizer_pipeline = not self._context.optimizer_enabled
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit, what about naming this variable "_generated_from_pipeline"? This will be easier to reason about. Because it only indicates whether this plan is generated from a pipeline, and doesn't nothing to do with the optimizer here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, updated.

Scott Lee added 2 commits June 21, 2023 15:04
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
Copy link
Contributor

@c21 c21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LG

# Whether the corresponding dataset is generated from a pipeline.
# Currently, when this is True, this skips the new execution plan optimizer.
# TODO(scottjlee): remove this once we remove DatasetPipeline.
self._generated_from_pipeline = not self._context.optimizer_enabled
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be true regardless of the optimizer flag. Because in _get_execution_dag, we will check the optimizer flag.

Scott Lee added 4 commits June 22, 2023 10:32
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
fix
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
@raulchen raulchen merged commit b6636bf into ray-project:master Jun 22, 2023
arvind-chandra pushed a commit to lmco/ray that referenced this pull request Aug 31, 2023
…ay-project#36294)

This PR further expands support for general use of the execution plan optimizer, including all currently existing Ray Data APIs, with the major exception of `DatasetPipeline`. In the case where a DatasetPipeline is used, the `ExecutionPlan` will include a flag to skip the new optimizer path and fall back to the legacy plan optimizer.

In addition, this PR implements several small patches to fully enable the new execution plan optimizer on existing Ray Data APIs, such as data lineage serialization.

Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: e428265 <arvind.chandramouli@lmco.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data Ray Data-related issues Ray-2.6 tests-ok The tagger certifies test failures are unrelated and assumes personal liability.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants