[data] Use default scheduling strategy to fix failing dataset_shuffle_push_based_sort_1tb test #36722

stephanie-wang · 2023-06-22T19:59:21Z

Why are these changes needed?

Fixes a bug introduced in #36290 where SPREAD scheduling was getting used in many Datasets tasks. This led to poor locality, which we can see in the sort test failure described in #36449.

This PR removes the scheduling strategy arg, so the behavior is now to use Ray's default scheduling strategy by default, unless the scheduling strategy is explicitly overridden by an operator implementation.

Unfortunately there is not a good way to add a smaller regression test right now; long-term we should collect the metrics from Ray core about what ray.remote args were passed to tasks, and check that these are correct.

Related issue number

Closes #36449.

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

stephanie-wang · 2023-06-22T20:01:13Z

Running the test here.

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

can-anyscale · 2023-06-28T18:00:31Z

FYI, branch cut is Friday so it would be easier if we can merge it by then. Thankkks

…_push_based_sort_1tb test (ray-project#36722) Fixes a bug introduced in ray-project#36290 where SPREAD scheduling was getting used in many Datasets tasks. This led to poor locality, which we can see in the sort test failure described in ray-project#36449. This PR removes the scheduling strategy arg, so the behavior is now to use Ray's default scheduling strategy by default, unless the scheduling strategy is explicitly overridden by an operator implementation. Unfortunately there is not a good way to add a smaller regression test right now; long-term we should collect the metrics from Ray core about what ray.remote args were passed to tasks, and check that these are correct. Related issue number Closes ray-project#36449. --------- Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu> Signed-off-by: e428265 <arvind.chandramouli@lmco.com>

As #39871 indicated, The [current Data Internals page](https://docs.ray.io/en/latest/data/data-internals.html#scheduling) has a section on Scheduling, which confusingly states that both SPREAD and DEFAULT are the default scheduling strategies used. This PR summarized the scheduling strategy used by Ray Data as follows: 1. By default, the scheduling strategy is set to Default Hybrid Policy([code](https://github.com/ray-project/ray/blob/9c143f63233d5cbde8a6943db31b91fb3b05f017/python/ray/data/_internal/remote_fn.py#L26), [related PR](#36722)). 2. Read operation overrides the scheduling strategy to Spread Policy if the file is not located locally; otherwise, it is scheduled to the current node([code](https://github.com/Yicheng-Lu-llll/ray/blob/9c143f63233d5cbde8a6943db31b91fb3b05f017/python/ray/data/read_api.py#L338)). 3. Map operation overrides the scheduling strategy to Spread Policy if total argument size <50MB([code](https://github.com/ray-project/ray/blob/9c143f63233d5cbde8a6943db31b91fb3b05f017/python/ray/data/_internal/execution/operators/map_operator.py#L213), [related PR](#36290)). Slack discussion: https://ray-distributed.slack.com/archives/C02PHB3SQHH/p1695756535614819 --------- Signed-off-by: Yicheng-Lu-llll <luyc58576@gmail.com>

As ray-project#39871 indicated, The [current Data Internals page](https://docs.ray.io/en/latest/data/data-internals.html#scheduling) has a section on Scheduling, which confusingly states that both SPREAD and DEFAULT are the default scheduling strategies used. This PR summarized the scheduling strategy used by Ray Data as follows: 1. By default, the scheduling strategy is set to Default Hybrid Policy([code](https://github.com/ray-project/ray/blob/9c143f63233d5cbde8a6943db31b91fb3b05f017/python/ray/data/_internal/remote_fn.py#L26), [related PR](ray-project#36722)). 2. Read operation overrides the scheduling strategy to Spread Policy if the file is not located locally; otherwise, it is scheduled to the current node([code](https://github.com/Yicheng-Lu-llll/ray/blob/9c143f63233d5cbde8a6943db31b91fb3b05f017/python/ray/data/read_api.py#L338)). 3. Map operation overrides the scheduling strategy to Spread Policy if total argument size <50MB([code](https://github.com/ray-project/ray/blob/9c143f63233d5cbde8a6943db31b91fb3b05f017/python/ray/data/_internal/execution/operators/map_operator.py#L213), [related PR](ray-project#36290)). Slack discussion: https://ray-distributed.slack.com/archives/C02PHB3SQHH/p1695756535614819 --------- Signed-off-by: Yicheng-Lu-llll <luyc58576@gmail.com> Signed-off-by: Victor <vctr.y.m@example.com>

stephanie-wang assigned ericl and scottjlee Jun 22, 2023

ericl approved these changes Jun 22, 2023

View reviewed changes

scottjlee approved these changes Jun 22, 2023

View reviewed changes

stephanie-wang force-pushed the dataset-sort-test branch from feaf4ae to c0944ca Compare June 23, 2023 15:30

stephanie-wang requested review from scv119, c21, amogkam, bveeramani and raulchen as code owners June 26, 2023 18:49

stephanie-wang added @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. do-not-merge Do not merge this PR! labels Jun 26, 2023

stephanie-wang force-pushed the dataset-sort-test branch from e2ebeca to a20b673 Compare June 26, 2023 22:46

stephanie-wang requested review from richardliaw, gjoliver, krfricke, xwjiang2010, matthewdeng, Yard1, maxpumperla and a team as code owners June 26, 2023 22:46

Let's try this again

6d1badd

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

stephanie-wang force-pushed the dataset-sort-test branch from a20b673 to 6d1badd Compare June 26, 2023 23:18

stephanie-wang added 3 commits June 26, 2023 16:29

Merge remote-tracking branch 'upstream/master' into dataset-sort-test

df91d3c

Merge remote-tracking branch 'upstream/master' into dataset-sort-test

52546d0

Fix

f085c6c

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

stephanie-wang removed the do-not-merge Do not merge this PR! label Jun 28, 2023

stephanie-wang merged commit 617db01 into ray-project:master Jun 28, 2023

stephanie-wang deleted the dataset-sort-test branch June 28, 2023 23:11

stephanie-wang mentioned this pull request Jul 5, 2023

[data] Unit-test physical execution of Datasets #37106

Closed

scottjlee mentioned this pull request Jul 13, 2023

[Data] Mark dataset_shuffle_sort_1tb release test as stable #37401

Merged

8 tasks

akshay-anyscale mentioned this pull request Jul 21, 2023

Add service deployment instructions to stable diffusion template #37645

Closed

8 tasks

Yicheng-Lu-llll mentioned this pull request Sep 28, 2023

Clarify default scheduling strategy used by Ray Data #39929

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[data] Use default scheduling strategy to fix failing dataset_shuffle_push_based_sort_1tb test #36722

[data] Use default scheduling strategy to fix failing dataset_shuffle_push_based_sort_1tb test #36722

stephanie-wang commented Jun 22, 2023 •

edited

Loading

stephanie-wang commented Jun 22, 2023

can-anyscale commented Jun 28, 2023

[data] Use default scheduling strategy to fix failing dataset_shuffle_push_based_sort_1tb test #36722

[data] Use default scheduling strategy to fix failing dataset_shuffle_push_based_sort_1tb test #36722

Conversation

stephanie-wang commented Jun 22, 2023 • edited Loading

Why are these changes needed?

Related issue number

Checks

stephanie-wang commented Jun 22, 2023

can-anyscale commented Jun 28, 2023

stephanie-wang commented Jun 22, 2023 •

edited

Loading