Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[data] Use default scheduling strategy to fix failing dataset_shuffle_push_based_sort_1tb test #36722

Merged
merged 4 commits into from
Jun 28, 2023

Conversation

stephanie-wang
Copy link
Contributor

@stephanie-wang stephanie-wang commented Jun 22, 2023

Why are these changes needed?

Fixes a bug introduced in #36290 where SPREAD scheduling was getting used in many Datasets tasks. This led to poor locality, which we can see in the sort test failure described in #36449.

This PR removes the scheduling strategy arg, so the behavior is now to use Ray's default scheduling strategy by default, unless the scheduling strategy is explicitly overridden by an operator implementation.

Unfortunately there is not a good way to add a smaller regression test right now; long-term we should collect the metrics from Ray core about what ray.remote args were passed to tasks, and check that these are correct.

Related issue number

Closes #36449.

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@stephanie-wang
Copy link
Contributor Author

Running the test here.

@stephanie-wang stephanie-wang added @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. do-not-merge Do not merge this PR! labels Jun 26, 2023
Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
@can-anyscale
Copy link
Collaborator

FYI, branch cut is Friday so it would be easier if we can merge it by then. Thankkks

@stephanie-wang stephanie-wang removed the do-not-merge Do not merge this PR! label Jun 28, 2023
@stephanie-wang stephanie-wang merged commit 617db01 into ray-project:master Jun 28, 2023
@stephanie-wang stephanie-wang deleted the dataset-sort-test branch June 28, 2023 23:11
arvind-chandra pushed a commit to lmco/ray that referenced this pull request Aug 31, 2023
…_push_based_sort_1tb test (ray-project#36722)

Fixes a bug introduced in ray-project#36290 where SPREAD scheduling was getting used in many Datasets tasks. This led to poor locality, which we can see in the sort test failure described in ray-project#36449.

This PR removes the scheduling strategy arg, so the behavior is now to use Ray's default scheduling strategy by default, unless the scheduling strategy is explicitly overridden by an operator implementation.

Unfortunately there is not a good way to add a smaller regression test right now; long-term we should collect the metrics from Ray core about what ray.remote args were passed to tasks, and check that these are correct.
Related issue number

Closes ray-project#36449.

---------

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Signed-off-by: e428265 <arvind.chandramouli@lmco.com>
raulchen pushed a commit that referenced this pull request Oct 2, 2023
As #39871 indicated, The [current Data Internals page](https://docs.ray.io/en/latest/data/data-internals.html#scheduling) has a section on Scheduling, which confusingly states that both SPREAD and DEFAULT are the default scheduling strategies used.  This PR summarized the scheduling strategy used by Ray Data as follows:

1. By default, the scheduling strategy is set to Default Hybrid Policy([code](https://github.com/ray-project/ray/blob/9c143f63233d5cbde8a6943db31b91fb3b05f017/python/ray/data/_internal/remote_fn.py#L26), [related PR](#36722)).
2. Read operation overrides the scheduling strategy to Spread Policy if the file is not located locally; otherwise, it is scheduled to the current node([code](https://github.com/Yicheng-Lu-llll/ray/blob/9c143f63233d5cbde8a6943db31b91fb3b05f017/python/ray/data/read_api.py#L338)).
3. Map operation overrides the scheduling strategy to Spread Policy if total argument size <50MB([code](https://github.com/ray-project/ray/blob/9c143f63233d5cbde8a6943db31b91fb3b05f017/python/ray/data/_internal/execution/operators/map_operator.py#L213), [related PR](#36290)).

Slack discussion: https://ray-distributed.slack.com/archives/C02PHB3SQHH/p1695756535614819

---------

Signed-off-by: Yicheng-Lu-llll <luyc58576@gmail.com>
vymao pushed a commit to vymao/ray that referenced this pull request Oct 11, 2023
As ray-project#39871 indicated, The [current Data Internals page](https://docs.ray.io/en/latest/data/data-internals.html#scheduling) has a section on Scheduling, which confusingly states that both SPREAD and DEFAULT are the default scheduling strategies used.  This PR summarized the scheduling strategy used by Ray Data as follows:

1. By default, the scheduling strategy is set to Default Hybrid Policy([code](https://github.com/ray-project/ray/blob/9c143f63233d5cbde8a6943db31b91fb3b05f017/python/ray/data/_internal/remote_fn.py#L26), [related PR](ray-project#36722)).
2. Read operation overrides the scheduling strategy to Spread Policy if the file is not located locally; otherwise, it is scheduled to the current node([code](https://github.com/Yicheng-Lu-llll/ray/blob/9c143f63233d5cbde8a6943db31b91fb3b05f017/python/ray/data/read_api.py#L338)).
3. Map operation overrides the scheduling strategy to Spread Policy if total argument size <50MB([code](https://github.com/ray-project/ray/blob/9c143f63233d5cbde8a6943db31b91fb3b05f017/python/ray/data/_internal/execution/operators/map_operator.py#L213), [related PR](ray-project#36290)).

Slack discussion: https://ray-distributed.slack.com/archives/C02PHB3SQHH/p1695756535614819

---------

Signed-off-by: Yicheng-Lu-llll <luyc58576@gmail.com>
Signed-off-by: Victor <vctr.y.m@example.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
@author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Release test dataset_shuffle_push_based_sort_1tb.aws failed
4 participants