Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Clarify default scheduling strategy used by Ray Data (ray-project#39929)
As ray-project#39871 indicated, The [current Data Internals page](https://docs.ray.io/en/latest/data/data-internals.html#scheduling) has a section on Scheduling, which confusingly states that both SPREAD and DEFAULT are the default scheduling strategies used. This PR summarized the scheduling strategy used by Ray Data as follows: 1. By default, the scheduling strategy is set to Default Hybrid Policy([code](https://github.com/ray-project/ray/blob/9c143f63233d5cbde8a6943db31b91fb3b05f017/python/ray/data/_internal/remote_fn.py#L26), [related PR](ray-project#36722)). 2. Read operation overrides the scheduling strategy to Spread Policy if the file is not located locally; otherwise, it is scheduled to the current node([code](https://github.com/Yicheng-Lu-llll/ray/blob/9c143f63233d5cbde8a6943db31b91fb3b05f017/python/ray/data/read_api.py#L338)). 3. Map operation overrides the scheduling strategy to Spread Policy if total argument size <50MB([code](https://github.com/ray-project/ray/blob/9c143f63233d5cbde8a6943db31b91fb3b05f017/python/ray/data/_internal/execution/operators/map_operator.py#L213), [related PR](ray-project#36290)). Slack discussion: https://ray-distributed.slack.com/archives/C02PHB3SQHH/p1695756535614819 --------- Signed-off-by: Yicheng-Lu-llll <luyc58576@gmail.com> Signed-off-by: Victor <vctr.y.m@example.com>
- Loading branch information