[Data][3/N] Enable optimizer: fix stats and RandomizeBlocks #35952

raulchen · 2023-05-31T20:56:41Z

Why are these changes needed?

This is the last PR for enabling optimizer by default. It contains the following changes:

Fix DatasetStats related issues, including:
- Map op names not including function names.
- Read op names not including data source names.
- RandomizeBlocks's name.
- generate_randomize_blocks_fn returning empty stats, making it being skipped in summary.
Dropping support for fusing 2 actor-based map ops.
- In the old backend, we fuse 2 actors only if they are the same class and have the same constructor args. This is not useful in practice. Test changes about num_cpus are related to this.
Fix the issue that ReorderRandomizeBlocksRule will modify the operator's input_dependencies in place. This is a bug because the operator instance might be shared by multiple Datasets.

Related issue number

Closes #32596

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Hao Chen <chenh1024@gmail.com>

scottjlee · 2023-05-31T22:30:40Z

python/ray/data/_internal/logical/operators/map_operator.py

+            # callable object.
+            return fn.__class__.__name__
+    except AttributeError as e:
+        logging.error("Failed to get name of UDF %s: %s", fn, e)


should we use DatasetLogger here for consistency? https://github.com/ray-project/ray/blob/master/python/ray/data/_internal/dataset_logger.py

scottjlee · 2023-06-01T01:45:18Z

python/ray/data/tests/test_map.py

@@ -374,7 +374,7 @@ def test_map_batches_basic(ray_start_regular_shared, tmp_path, restore_data_cont

 def test_map_batches_extra_args(shutdown_only, tmp_path):
    ray.shutdown()
-    ray.init(num_cpus=2)
+    ray.init(num_cpus=3)


is this needed as a result of code changes in this PR?

This is because we are no longer fusing actors.

Signed-off-by: Hao Chen <chenh1024@gmail.com>

…ect#35952) ## Why are these changes needed? This is the last PR for enabling optimizer by default. It contains the following changes: - Fix DatasetStats related issues, including: - Map op names not including function names. - Read op names not including data source names. - RandomizeBlocks's name. - `generate_randomize_blocks_fn` returning empty stats, making it being skipped in summary. - Dropping support for fusing 2 actor-based map ops. - In the old backend, we fuse 2 actors only if they are the same class and have the same constructor args. This is not useful in practice. Test changes about `num_cpus` are related to this. - Fix the issue that `ReorderRandomizeBlocksRule` will modify the operator's input_dependencies in place. This is a bug because the operator instance might be shared by multiple Datasets. ## Related issue number Closes ray-project#32596

…ect#35952) ## Why are these changes needed? This is the last PR for enabling optimizer by default. It contains the following changes: - Fix DatasetStats related issues, including: - Map op names not including function names. - Read op names not including data source names. - RandomizeBlocks's name. - `generate_randomize_blocks_fn` returning empty stats, making it being skipped in summary. - Dropping support for fusing 2 actor-based map ops. - In the old backend, we fuse 2 actors only if they are the same class and have the same constructor args. This is not useful in practice. Test changes about `num_cpus` are related to this. - Fix the issue that `ReorderRandomizeBlocksRule` will modify the operator's input_dependencies in place. This is a bug because the operator instance might be shared by multiple Datasets. ## Related issue number Closes ray-project#32596 Signed-off-by: e428265 <arvind.chandramouli@lmco.com>

patch all

6e15214

Signed-off-by: Hao Chen <chenh1024@gmail.com>

raulchen requested review from ericl, scv119, c21, amogkam, scottjlee and bveeramani as code owners May 31, 2023 20:56

raulchen changed the title ~~[Data][3/N] Enable optimizer:~~ [Data][3/N] Enable optimizer: fix stats and RandomizeBlocks May 31, 2023

raulchen added 4 commits May 31, 2023 14:37

lint

8788354

Signed-off-by: Hao Chen <chenh1024@gmail.com>

lint

f6d51fc

Signed-off-by: Hao Chen <chenh1024@gmail.com>

comment

4b7c2d4

Signed-off-by: Hao Chen <chenh1024@gmail.com>

fix

ab2be03

Signed-off-by: Hao Chen <chenh1024@gmail.com>

raulchen assigned ericl and scottjlee May 31, 2023

ericl approved these changes May 31, 2023

View reviewed changes

raulchen added 3 commits May 31, 2023 16:39

fix test_stats.py

79261cd

Signed-off-by: Hao Chen <chenh1024@gmail.com>

fix

0267f9f

Signed-off-by: Hao Chen <chenh1024@gmail.com>

fix

e1f05de

Signed-off-by: Hao Chen <chenh1024@gmail.com>

scottjlee approved these changes Jun 1, 2023

View reviewed changes

logger

610afa1

Signed-off-by: Hao Chen <chenh1024@gmail.com>

raulchen merged commit 609b8e6 into ray-project:master Jun 1, 2023

raulchen deleted the enable-optimizer-3 branch June 1, 2023 04:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data][3/N] Enable optimizer: fix stats and RandomizeBlocks #35952

[Data][3/N] Enable optimizer: fix stats and RandomizeBlocks #35952

raulchen commented May 31, 2023 •

edited

Loading

scottjlee May 31, 2023

scottjlee Jun 1, 2023

raulchen Jun 1, 2023

[Data][3/N] Enable optimizer: fix stats and RandomizeBlocks #35952

[Data][3/N] Enable optimizer: fix stats and RandomizeBlocks #35952

Conversation

raulchen commented May 31, 2023 • edited Loading

Why are these changes needed?

Related issue number

Checks

scottjlee May 31, 2023

Choose a reason for hiding this comment

scottjlee Jun 1, 2023

Choose a reason for hiding this comment

raulchen Jun 1, 2023

Choose a reason for hiding this comment

raulchen commented May 31, 2023 •

edited

Loading