[Datasets] Enable lazy execution by default #31286

c21 · 2022-12-22T01:13:00Z

Signed-off-by: Cheng Su scnju13@gmail.com

Why are these changes needed?

This PR is to enable lazy execution by default. See ray-project/enhancements#19 for motivation. The change includes:

Change Dataset constructor: Dataset.__init__(lazy: bool = True). Also remove defer_execution field, as it's no longer needed.
read_api.py:read_datasource() returns a lazy Dataset with computing the first input block.
Add ds.fully_executed() calls to required unit tests, to make sure they are passing.

TODO:

Fix all unit tests
[Datasets] Fix the bug of eagerly clearing up input blocks #31459
[Datasets] Make data nightly tests still work with lazy execution #31460
Remove the behavior to eagerly compute first block for read
[data] Improve str/repr of lazy Datasets #31417
Update documentation

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

ericl · 2022-12-22T01:22:00Z

Nice. Shall we also update the documentation in the same change?

c21 · 2022-12-22T01:25:21Z

Shall we also update the documentation in the same change?

@ericl - I am thinking about in a separate PR for easier doc review (assuming there're more code change for fixing unit tests). But I can also do in same PR if people prefer.

ericl

Just tried this out, seems to work well! A few thoughts:

Currently, any execution will "cache" a snapshot of the final stage of blocks. Should we change this behavior to only "cache" on a call to fully_executed() or add an explicit cache() action?
The str-form of Datasets will include "num_rows=?" and "schema=Unknown schema" a lot now. Can we change this to num_rows=<Pending execution> and schema=<Pending execution> for clarity? Better yet, we could improve the str-form to show the pending stages that will be executed.

These changes could go into separate PRs.

Edit: filed #31417 for (2)

clarkzinzow · 2023-01-04T18:38:47Z

@c21 We should also audit all benchmarks that involve Datasets, to ensure that setup operations that were previously executed eagerly are still executed eagerly, so we're not accidentally including e.g. reading or setup transformations when we're trying to benchmark a single downstream operation. The .map_batches() benchmarks come to mind.

c21 · 2023-01-05T01:04:19Z

We should also audit all benchmarks that involve Datasets, to ensure that setup operations that were previously executed eagerly are still executed eagerly, so we're not accidentally including e.g. reading or setup transformations when we're trying to benchmark a single downstream operation. The .map_batches() benchmarks come to mind.

@clarkzinzow - thanks, will go over all nightly tests.

c21 · 2023-01-05T01:05:02Z

Currently, any execution will "cache" a snapshot of the final stage of blocks. Should we change this behavior to only "cache" on a call to fully_executed() or add an explicit cache() action?

Discussed offline with @ericl, we shall postpone it later given impact is low.

The str-form of Datasets will include "num_rows=?" and "schema=Unknown schema" a lot now. Can we change this to num_rows= and schema= for clarity? Better yet, we could improve the str-form to show the pending stages that will be executed.

Will do it in a separate PR.

c21 · 2023-01-05T20:28:55Z

All CI tests are passed. The failed nightly tests will be addressed in #31460 .

c21 · 2023-01-05T20:31:30Z

python/ray/train/tests/test_batch_predictor.py

@@ -147,7 +147,8 @@ def test_automatic_enable_gpu_from_num_gpus_per_worker(shutdown_only):
    with pytest.raises(
        ValueError, match="DummyPredictor does not support GPU prediction"
    ):
-        _ = batch_predictor.predict(test_dataset, num_gpus_per_worker=1)
+        ds = batch_predictor.predict(test_dataset, num_gpus_per_worker=1)


This is the major behavior change needed to call out - batch_predictor.predict() becomes lazy now, and not force the execution. Users need to call batch_predictor.predict().fully_executed() to force the prediction to actually run.

The pros is we can chain multiple predictors now for free - batch_predictor2.predict(batch_predictor1.predict()).

The cons is this will probably be a surprising behavior change for current users.

If the impact of pros is not big, we can change to force execution inside batch_predictor.predict().

cc @ericl, @amogkam.

I'm strongly in favor of having the default behavior being calling fully_executed inside predict, at least for now.

Laziness for the chained case is something we can handle separately.

Let's force execution inside batch predictor? Chaining is not really a use case here, and you can always fall back to using the Data API directly.

Cool, let me make the change to force execution inside predictor.

Updated to force execution inside batch predictor.

clarkzinzow

Mostly LGTM, the big thing that we need resolve IMO is the semi-lazy read behavior. Semi-lazy reading (eagerly reading the first block/file) has been confusing for users, results in redundant reading when there's stage fusion or conversion to a pipeline, and I think we should try to get rid of it when switching to lazy execution by default, if possible. I think that it would be much better if we had fully lazy mode and eager mode, where semi-lazy execution is an execution optimization that is enabled when we determine we only need to compute a subset of blocks; these semantics should translate pretty well to our new execution planner and streaming execution model (e.g. limit pushdown, metadata peeking, streaming consumption of narrow op chains).

Instead of always computing the first block, I think that we should make the read fully lazy, and if the user immediately calls ds.schema() right after reading, we only then trigger reading of the first block (and only if the schema isn't already available from e.g. file metadata).

I think it should be pretty straightforward to move this lazy_block_list.compute_first_block()/progressive computation logic to the ExecutionPlan, since we can have ExecutionPlan.schema() (and other ops that only require the first block) trigger semi-lazy execution with e.g. a certain flag set, and if that flag is set AND the plan only consists of a read stage AND the input blocks are a lazy block list, we trigger minimal computation of that read stage.

clarkzinzow · 2023-01-05T20:39:21Z

doc/source/ray-core/_examples/datasets_train/datasets_train.py

@@ -580,7 +580,7 @@ def train_func(config):
        read_dataset(data_path)
    )

-    num_columns = len(train_dataset.schema().names)
+    num_columns = len(train_dataset.schema(fetch_if_missing=True).names)


Should we change the fetch_if_missing default value to True in ds.schema()? I.e. should we transparently trigger execution when fetching metadata on a lazy dataset?

should we transparently trigger execution when fetching metadata on a lazy dataset?

Yes agreed. Do we want to do in a separate PR ? #31286 (comment) .

Yep doing that in a separate PR sounds good!

python/ray/data/_internal/plan.py

python/ray/data/read_api.py

python/ray/data/dataset.py

python/ray/data/_internal/plan.py

clarkzinzow · 2023-01-05T20:48:16Z

python/ray/data/read_api.py

    block_list.compute_first_block()
    block_list.ensure_metadata_for_first_block()


IMO we should get rid of this default behavior of always computing the first block and ensuring the metadata for the first block, and instead make the dataset fully lazy by default. We can still progressively launch read tasks to e.g. fetch the schema, see the first few rows, streaming iteration directly on the read, etc. We'd just move that logic into ExecutionPlan.schema().

Hmm, that seems fine as long as we still resolve the metadata for parquet and so on. This would only apply to JSON/CSV presumably. Maybe we can also add an optimized schema resolver for these file types that peeks at the header of the file only.

Yes agreed, we shall improve the schema resolver in the long term. I am 100% agree to get rid of always computing the first block. Do we want to do it in a separate PR? #31286 (comment)

This PR is fixing the issue found in #31286. Previously we always eagerly clears up non-lazy input blocks (plan._in_blocks) when executing the plan. This is not safe as the input blocks might be used by downstream operations later. Signed-off-by: Cheng Su <scnju13@gmail.com> Co-authored-by: Clark Zinzow <clark@anyscale.com>

ericl · 2023-01-06T00:26:39Z

FWIW, getting rid of the first block reading would also help with integrating fully streaming execution (right now actually that breaks streaming actually). How about we do that as a separate PR followup from this though? I think this PR is already a pretty extensive change, and we should generally avoid mixing complex changes.

c21 · 2023-01-06T00:34:06Z

FWIW, getting rid of the first block reading would also help with integrating fully streaming execution (right now actually that breaks streaming actually). How about we do that as a separate PR followup from this though? I think this PR is already a pretty extensive change, and we should generally avoid mixing complex changes.

SGTM. @clarkzinzow - WDYT?

btw I will make the batch_predictor.predict change in this PR.

Signed-off-by: Cheng Su <scnju13@gmail.com>

clarkzinzow · 2023-01-06T16:28:56Z

How about we do that as a separate PR followup from this though?

@ericl @c21 As long as it's done as a P0 follow-up PR that we're sure will get in before the next release, that sounds good to me! I would normally say that we shouldn't enable lazy execution by default in master until we have the fully lazy semantics, but since we're actively iterating on the execution model and we have a good bit of time before the release, we can be pragmatic here.

In hindsight, it probably would have been better to do the following sequence of PRs:

Move semi-lazy reading from an eager computing of the first block in read_datasource to a metadata peeking and streaming read optimization in the ExecutionPlan, which would keep the existing "eagerly compute first block" semantics for eager mode and would make lazy mode fully lazy.
Enable lazy execution by default.
Port metadata peeking optimization to plan optimization + streaming executor, since the policy of "only compute as many blocks as needed for operation" is really just a pushdown optimization rule + streaming execution.

clarkzinzow

LGTM! Are we going to update documentation and/or improve the repr in this PR?

c21 · 2023-01-06T18:50:58Z

As long as it's done as a P0 follow-up PR that we're sure will get in before the next release, that sounds good to me! I would normally say that we shouldn't enable lazy execution by default in master until we have the fully lazy semantics, but since we're actively iterating on the execution model and we have a good bit of time before the release, we can be pragmatic here.

@clarkzinzow - yeah agree here. The TODOs for this PR (1.Remove the behavior to eagerly compute first block for read, 2.improve the str/repr of dataset, 3.update documentation) are definitely P0 which I will work on next week immediately.

This PR is fixing the issue found in #31286. Previously we always eagerly clears up non-lazy input blocks (plan._in_blocks) when executing the plan. This is not safe as the input blocks might be used by downstream operations later. Signed-off-by: Cheng Su <scnju13@gmail.com> Co-authored-by: Clark Zinzow <clark@anyscale.com>

This PR is to enable lazy execution by default. See ray-project/enhancements#19 for motivation. The change includes: * Change `Dataset` constructor: `Dataset.__init__(lazy: bool = True)`. Also remove `defer_execution` field, as it's no longer needed. * `read_api.py:read_datasource()` returns a lazy `Dataset` with computing the first input block. * Add `ds.fully_executed()` calls to required unit tests, to make sure they are passing. TODO: - [x] Fix all unit tests - [x] #31459 - [x] #31460 - [ ] Remove the behavior to eagerly compute first block for read - [ ] #31417 - [ ] Update documentation

…1460) This is followup from #31286 (comment), here we audit all data nightly tests to make sure they are still working with lazy execution enabled by default. Signed-off-by: Cheng Su <scnju13@gmail.com>

…y-project#31460) This is followup from ray-project#31286 (comment), here we audit all data nightly tests to make sure they are still working with lazy execution enabled by default. Signed-off-by: Cheng Su <scnju13@gmail.com> Signed-off-by: Andrea Pisoni <andreapiso@gmail.com>

This PR is fixing the issue found in ray-project#31286. Previously we always eagerly clears up non-lazy input blocks (plan._in_blocks) when executing the plan. This is not safe as the input blocks might be used by downstream operations later. Signed-off-by: Cheng Su <scnju13@gmail.com> Co-authored-by: Clark Zinzow <clark@anyscale.com> Signed-off-by: tmynn <hovhannes.tamoyan@gmail.com>

This PR is to enable lazy execution by default. See ray-project/enhancements#19 for motivation. The change includes: * Change `Dataset` constructor: `Dataset.__init__(lazy: bool = True)`. Also remove `defer_execution` field, as it's no longer needed. * `read_api.py:read_datasource()` returns a lazy `Dataset` with computing the first input block. * Add `ds.fully_executed()` calls to required unit tests, to make sure they are passing. TODO: - [x] Fix all unit tests - [x] ray-project#31459 - [x] ray-project#31460 - [ ] Remove the behavior to eagerly compute first block for read - [ ] ray-project#31417 - [ ] Update documentation Signed-off-by: tmynn <hovhannes.tamoyan@gmail.com>

c21 requested review from ericl, scv119, clarkzinzow, jjyao and jianoaix as code owners December 22, 2022 01:13

c21 force-pushed the lazy branch from 6a3b91b to 1f26a93 Compare December 22, 2022 01:20

c21 force-pushed the lazy branch from 1f26a93 to e782438 Compare December 22, 2022 07:46

c21 requested a review from a team as a code owner December 24, 2022 00:45

ericl self-assigned this Dec 24, 2022

ericl reviewed Jan 3, 2023

View reviewed changes

ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 3, 2023

c21 force-pushed the lazy branch from 18cd871 to e9b6782 Compare January 5, 2023 01:03

This was referenced Jan 5, 2023

[Datasets] Fix the bug of eagerly clearing up input blocks #31459

Merged

[Datasets] Make data nightly tests still work with lazy execution #31460

Merged

c21 changed the title ~~[WIP][Datasets] Enable lazy execution by default~~ [Datasets] Enable lazy execution by default Jan 5, 2023

c21 removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 5, 2023

c21 assigned clarkzinzow and jianoaix Jan 5, 2023

ericl approved these changes Jan 5, 2023

View reviewed changes

c21 commented Jan 5, 2023

View reviewed changes

jianoaix approved these changes Jan 5, 2023

View reviewed changes

clarkzinzow requested changes Jan 5, 2023

View reviewed changes

c21 force-pushed the lazy branch from 09741c9 to 8196ebc Compare January 6, 2023 02:04

c21 added 6 commits January 6, 2023 00:55

Enable lazy by default

7706d70

Signed-off-by: Cheng Su <scnju13@gmail.com>

Try to fix all CI tests for Data

757f634

Signed-off-by: Cheng Su <scnju13@gmail.com>

Fix plan.schema() and datasets_train.py, test_batch_predictor.py

2f6c0a7

Signed-off-by: Cheng Su <scnju13@gmail.com>

Fix docstring in dataset.py

ef972d8

Signed-off-by: Cheng Su <scnju13@gmail.com>

Address comments for batch predictor and others

e9ba092

Signed-off-by: Cheng Su <scnju13@gmail.com>

Fix unit test

2bc1e81

Signed-off-by: Cheng Su <scnju13@gmail.com>

c21 force-pushed the lazy branch from 89113a1 to 2bc1e81 Compare January 6, 2023 08:55

clarkzinzow approved these changes Jan 6, 2023

View reviewed changes

ericl merged commit 9cb9c0e into ray-project:master Jan 6, 2023

c21 deleted the lazy branch January 6, 2023 21:39

This was referenced Jan 10, 2023

[Datasets] Do not eagerly execute first block for read_xxx API #31558

Merged

[Dataset] Improve str/repr of Dataset to include execution plan #31604

Merged

jianoaix mentioned this pull request Jan 12, 2023

[Datasets] Make Dataset lazy-only #31639

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Datasets] Enable lazy execution by default #31286

[Datasets] Enable lazy execution by default #31286

c21 commented Dec 22, 2022 •

edited

Loading

ericl commented Dec 22, 2022

c21 commented Dec 22, 2022

ericl left a comment •

edited

Loading

clarkzinzow commented Jan 4, 2023

c21 commented Jan 5, 2023

c21 commented Jan 5, 2023

c21 commented Jan 5, 2023

c21 Jan 5, 2023

amogkam Jan 5, 2023 •

edited

Loading

ericl Jan 5, 2023 •

edited

Loading

c21 Jan 5, 2023

c21 Jan 6, 2023

clarkzinzow left a comment •

edited

Loading

clarkzinzow Jan 5, 2023

c21 Jan 6, 2023

clarkzinzow Jan 6, 2023

clarkzinzow Jan 5, 2023

ericl Jan 5, 2023

c21 Jan 6, 2023

ericl commented Jan 6, 2023

c21 commented Jan 6, 2023

clarkzinzow commented Jan 6, 2023 •

edited

Loading

clarkzinzow left a comment •

edited

Loading

c21 commented Jan 6, 2023

		block_list.compute_first_block()
		block_list.ensure_metadata_for_first_block()

[Datasets] Enable lazy execution by default #31286

[Datasets] Enable lazy execution by default #31286

Conversation

c21 commented Dec 22, 2022 • edited Loading

Why are these changes needed?

Related issue number

Checks

ericl commented Dec 22, 2022

c21 commented Dec 22, 2022

ericl left a comment • edited Loading

Choose a reason for hiding this comment

clarkzinzow commented Jan 4, 2023

c21 commented Jan 5, 2023

c21 commented Jan 5, 2023

c21 commented Jan 5, 2023

Choose a reason for hiding this comment

amogkam Jan 5, 2023 • edited Loading

Choose a reason for hiding this comment

ericl Jan 5, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

clarkzinzow left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ericl commented Jan 6, 2023

c21 commented Jan 6, 2023

clarkzinzow commented Jan 6, 2023 • edited Loading

clarkzinzow left a comment • edited Loading

Choose a reason for hiding this comment

c21 commented Jan 6, 2023

c21 commented Dec 22, 2022 •

edited

Loading

ericl left a comment •

edited

Loading

amogkam Jan 5, 2023 •

edited

Loading

ericl Jan 5, 2023 •

edited

Loading

clarkzinzow left a comment •

edited

Loading

clarkzinzow commented Jan 6, 2023 •

edited

Loading

clarkzinzow left a comment •

edited

Loading