[Datasets] Defer first block computation when reading a Datasource with schema information in metadata #34251

scottjlee · 2023-04-11T00:04:42Z

Why are these changes needed?

See #33943

Related issue number

Closes #33943

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Scott Lee <sjl@anyscale.com>

python/ray/data/_internal/plan.py

c21 · 2023-04-11T05:17:42Z

python/ray/data/tests/test_dataset_parquet.py


    # Forces a data read.
    values = [[s["one"], s["two"]] for s in ds.take_all()]
-    check_num_computed(ds, 2, 2)
+    check_num_computed(ds, 2, 0)


why we have difference between bulk and streaming?

the comment here explains:

ray/python/ray/data/tests/test_dataset_parquet.py

Lines 37 to 40 in 3f9b238

# When streaming executor is on, the _num_computed() is affected only

# by the ds.schema() which will still partial read the blocks, but will

# not affected by operations like take() as it's executed via streaming

# executor.

c21 · 2023-04-11T05:17:50Z

python/ray/data/tests/test_dataset_parquet.py

@@ -475,11 +479,13 @@ def test_parquet_read_partitioned(ray_start_regular_shared, fs, data_path):
        [3, "f"],
        [3, "g"],
    ]
+    check_num_computed(ds, 2, 0)


c21 · 2023-04-11T05:17:55Z

python/ray/data/tests/test_dataset_parquet.py


    # Test column selection.
    ds = ray.data.read_parquet(data_path, columns=["one"], filesystem=fs)
    values = [s["one"] for s in ds.take()]
    assert sorted(values) == [1, 1, 1, 3, 3, 3]
+    check_num_computed(ds, 2, 0)


c21 · 2023-04-11T05:18:00Z

python/ray/data/tests/test_dataset_parquet.py


    # Forces a data read.
    values = [[s["one"], s["two"]] for s in ds.take()]
-    check_num_computed(ds, 2, 2)
+    check_num_computed(ds, 2, 0)


Signed-off-by: Scott Lee <sjl@anyscale.com>

c21 · 2023-04-18T03:02:09Z

Merging to master.

…th schema information in metadata (ray-project#34251) In the current implementation of [ExecutionPlan._get_unified_blocks_schema](https://github.com/ray-project/ray/blob/master/python/ray/data/_internal/plan.py#L418), we force execution to compute the first block when given a `LazyBlockList`. However, when creating a Dataset from a datasource which have schema information available before reading (e.g. Parquet), this unnecessarily forces execution, since we already check for metadata in the subsequent [ensure_metadata_for_first_block](https://github.com/ray-project/ray/blob/master/python/ray/data/_internal/lazy_block_list.py#L379). Therefore, we can remove `blocks.compute_first_block()`. Signed-off-by: Scott Lee <sjl@anyscale.com> Signed-off-by: elliottower <elliot@elliottower.com>

…th schema information in metadata (ray-project#34251) In the current implementation of [ExecutionPlan._get_unified_blocks_schema](https://github.com/ray-project/ray/blob/master/python/ray/data/_internal/plan.py#L418), we force execution to compute the first block when given a `LazyBlockList`. However, when creating a Dataset from a datasource which have schema information available before reading (e.g. Parquet), this unnecessarily forces execution, since we already check for metadata in the subsequent [ensure_metadata_for_first_block](https://github.com/ray-project/ray/blob/master/python/ray/data/_internal/lazy_block_list.py#L379). Therefore, we can remove `blocks.compute_first_block()`. Signed-off-by: Scott Lee <sjl@anyscale.com> Signed-off-by: Jack He <jackhe2345@gmail.com>

Scott Lee added 5 commits April 10, 2023 16:58

defer first block execution until after metadata check

6482fe9

Signed-off-by: Scott Lee <sjl@anyscale.com>

update comments

cbc575f

Signed-off-by: Scott Lee <sjl@anyscale.com>

Merge branch 'master' into fix-exec-with-schema

b63915d

Signed-off-by: Scott Lee <sjl@anyscale.com>

Merge branch 'master' into fix-exec-with-schema

dfd1944

Signed-off-by: Scott Lee <sjl@anyscale.com>

update tests

3f9b238

Signed-off-by: Scott Lee <sjl@anyscale.com>

scottjlee marked this pull request as ready for review April 11, 2023 05:01

scottjlee requested review from ericl, scv119, clarkzinzow, jjyao, jianoaix and c21 as code owners April 11, 2023 05:01

scottjlee assigned c21, amogkam and bveeramani Apr 11, 2023

amogkam approved these changes Apr 11, 2023

View reviewed changes

python/ray/data/_internal/plan.py Show resolved Hide resolved

c21 reviewed Apr 11, 2023

View reviewed changes

clean up comments

9c32fe1

Signed-off-by: Scott Lee <sjl@anyscale.com>

c21 approved these changes Apr 11, 2023

View reviewed changes

Scott Lee added 2 commits April 17, 2023 09:35

Merge branch 'master' into fix-exec-with-schema

640266e

Signed-off-by: Scott Lee <sjl@anyscale.com>

Merge branch 'master' into fix-exec-with-schema

b4ca07f

Signed-off-by: Scott Lee <sjl@anyscale.com>

scottjlee requested a review from c21 April 17, 2023 20:49

Merge branch 'master' into fix-exec-with-schema

9c5e13e

Signed-off-by: Scott Lee <sjl@anyscale.com>

c21 merged commit c538e69 into ray-project:master Apr 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Datasets] Defer first block computation when reading a Datasource with schema information in metadata #34251

[Datasets] Defer first block computation when reading a Datasource with schema information in metadata #34251

scottjlee commented Apr 11, 2023

c21 Apr 11, 2023

scottjlee Apr 11, 2023

c21 Apr 11, 2023

c21 Apr 11, 2023

c21 Apr 11, 2023

c21 commented Apr 18, 2023

	# When streaming executor is on, the _num_computed() is affected only
	# by the ds.schema() which will still partial read the blocks, but will
	# not affected by operations like take() as it's executed via streaming
	# executor.

[Datasets] Defer first block computation when reading a Datasource with schema information in metadata #34251

[Datasets] Defer first block computation when reading a Datasource with schema information in metadata #34251

Conversation

scottjlee commented Apr 11, 2023

Why are these changes needed?

Related issue number

Checks

c21 Apr 11, 2023

Choose a reason for hiding this comment

scottjlee Apr 11, 2023

Choose a reason for hiding this comment

c21 Apr 11, 2023

Choose a reason for hiding this comment

c21 Apr 11, 2023

Choose a reason for hiding this comment

c21 Apr 11, 2023

Choose a reason for hiding this comment

c21 commented Apr 18, 2023