[Data] Enable execution plan optimizer for supported Ray Data APIs #36294

scottjlee · 2023-06-10T00:28:26Z

Why are these changes needed?

This PR further expands support for general use of the execution plan optimizer, including all currently existing Ray Data APIs, with the major exception of DatasetPipeline. In the case where a DatasetPipeline is used, the ExecutionPlan will include a flag to skip the new optimizer path and fall back to the legacy plan optimizer.

In addition, this PR implements several small patches to fully enable the new execution plan optimizer on existing Ray Data APIs, such as data lineage serialization.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Scott Lee <sjl@anyscale.com>

scottjlee · 2023-06-20T03:40:56Z

Failing CI and release tests look unrelated to this PR to me.

raulchen · 2023-06-21T21:34:38Z

python/ray/data/_internal/plan.py

@@ -136,6 +136,8 @@ def __init__(
        # determined by the config at the time it was created.
        self._context = copy.deepcopy(DataContext.get_current())

+        self._skip_optimizer_pipeline = not self._context.optimizer_enabled


Nit, what about naming this variable "_generated_from_pipeline"? This will be easier to reason about. Because it only indicates whether this plan is generated from a pipeline, and doesn't nothing to do with the optimizer here.

Good point, updated.

Signed-off-by: Scott Lee <sjl@anyscale.com>

c21

LG

raulchen · 2023-06-22T17:28:26Z

python/ray/data/_internal/plan.py

+        # Whether the corresponding dataset is generated from a pipeline.
+        # Currently, when this is True, this skips the new execution plan optimizer.
+        # TODO(scottjlee): remove this once we remove DatasetPipeline.
+        self._generated_from_pipeline = not self._context.optimizer_enabled


This should be true regardless of the optimizer flag. Because in _get_execution_dag, we will check the optimizer flag.

Signed-off-by: Scott Lee <sjl@anyscale.com>

…ay-project#36294) This PR further expands support for general use of the execution plan optimizer, including all currently existing Ray Data APIs, with the major exception of `DatasetPipeline`. In the case where a DatasetPipeline is used, the `ExecutionPlan` will include a flag to skip the new optimizer path and fall back to the legacy plan optimizer. In addition, this PR implements several small patches to fully enable the new execution plan optimizer on existing Ray Data APIs, such as data lineage serialization. Signed-off-by: Scott Lee <sjl@anyscale.com> Signed-off-by: e428265 <arvind.chandramouli@lmco.com>

Scott Lee added 13 commits June 9, 2023 17:27

remove op skip for missing logical plan

a701422

Signed-off-by: Scott Lee <sjl@anyscale.com>

Merge branch 'master' into enable-optimizer-all

241d9d6

Signed-off-by: Scott Lee <sjl@anyscale.com>

include logical plan in dataset lineage construction

c18340d

Signed-off-by: Scott Lee <sjl@anyscale.com>

add skip optimizer param for pipeline case

5d81e81

Signed-off-by: Scott Lee <sjl@anyscale.com>

label tests

5ac402c

Signed-off-by: Scott Lee <sjl@anyscale.com>

Merge branch 'master' into enable-optimizer-all

3b3c72b

Signed-off-by: Scott Lee <sjl@anyscale.com>

more test labels

ce4dc5f

Signed-off-by: Scott Lee <sjl@anyscale.com>

Merge branch 'master' into enable-optimizer-all

811005e

Signed-off-by: Scott Lee <sjl@anyscale.com>

workaround for datasetpipeline

6a87c34

Signed-off-by: Scott Lee <sjl@anyscale.com>

Merge branch 'master' into enable-optimizer-all

d6873fa

Signed-off-by: Scott Lee <sjl@anyscale.com>

Merge branch 'master' into enable-optimizer-all

f38b9f1

Signed-off-by: Scott Lee <sjl@anyscale.com>

Merge branch 'master' into enable-optimizer-all

fdd0370

Signed-off-by: Scott Lee <sjl@anyscale.com>

clean up

32ca08d

Signed-off-by: Scott Lee <sjl@anyscale.com>

scottjlee changed the title ~~[Data] Remove skip for missing logical plan in Dataset plan optimizer~~ [Data] Enable execution plan optimizer for supported Ray Data APIs Jun 19, 2023

scottjlee marked this pull request as ready for review June 20, 2023 03:40

scottjlee requested review from ericl, scv119, c21, amogkam, bveeramani and raulchen as code owners June 20, 2023 03:40

scottjlee assigned raulchen Jun 20, 2023

scottjlee added tests-ok The tagger certifies test failures are unrelated and assumes personal liability. data Ray Data-related issues Ray-2.6 labels Jun 20, 2023

scottjlee assigned c21 Jun 21, 2023

raulchen approved these changes Jun 21, 2023

View reviewed changes

Scott Lee added 2 commits June 21, 2023 15:04

Merge branch 'master' into enable-optimizer-all

accdef6

Signed-off-by: Scott Lee <sjl@anyscale.com>

address comments

69c54e2

Signed-off-by: Scott Lee <sjl@anyscale.com>

c21 approved these changes Jun 21, 2023

View reviewed changes

raulchen reviewed Jun 22, 2023

View reviewed changes

Scott Lee added 4 commits June 22, 2023 10:32

update generated_from_pipeline

9033754

Signed-off-by: Scott Lee <sjl@anyscale.com>

Merge branch 'master' into enable-optimizer-all

ad55c68

Signed-off-by: Scott Lee <sjl@anyscale.com>

fix

cd46a25

Signed-off-by: Scott Lee <sjl@anyscale.com>

Merge branch 'master' into enable-optimizer-all

5c28201

Signed-off-by: Scott Lee <sjl@anyscale.com>

raulchen merged commit b6636bf into ray-project:master Jun 22, 2023

akshay-anyscale mentioned this pull request Jul 21, 2023

Add service deployment instructions to stable diffusion template #37645

Closed

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data] Enable execution plan optimizer for supported Ray Data APIs #36294

[Data] Enable execution plan optimizer for supported Ray Data APIs #36294

scottjlee commented Jun 10, 2023 •

edited

Loading

scottjlee commented Jun 20, 2023

raulchen Jun 21, 2023

scottjlee Jun 21, 2023

c21 left a comment

raulchen Jun 22, 2023

[Data] Enable execution plan optimizer for supported Ray Data APIs #36294

[Data] Enable execution plan optimizer for supported Ray Data APIs #36294

Conversation

scottjlee commented Jun 10, 2023 • edited Loading

Why are these changes needed?

Related issue number

Checks

scottjlee commented Jun 20, 2023

raulchen Jun 21, 2023

Choose a reason for hiding this comment

scottjlee Jun 21, 2023

Choose a reason for hiding this comment

c21 left a comment

Choose a reason for hiding this comment

raulchen Jun 22, 2023

Choose a reason for hiding this comment

scottjlee commented Jun 10, 2023 •

edited

Loading