Skip to content

Commit

Permalink
Clarify default scheduling strategy used by Ray Data (ray-project#39929)
Browse files Browse the repository at this point in the history
As ray-project#39871 indicated, The [current Data Internals page](https://docs.ray.io/en/latest/data/data-internals.html#scheduling) has a section on Scheduling, which confusingly states that both SPREAD and DEFAULT are the default scheduling strategies used.  This PR summarized the scheduling strategy used by Ray Data as follows:

1. By default, the scheduling strategy is set to Default Hybrid Policy([code](https://github.com/ray-project/ray/blob/9c143f63233d5cbde8a6943db31b91fb3b05f017/python/ray/data/_internal/remote_fn.py#L26), [related PR](ray-project#36722)).
2. Read operation overrides the scheduling strategy to Spread Policy if the file is not located locally; otherwise, it is scheduled to the current node([code](https://github.com/Yicheng-Lu-llll/ray/blob/9c143f63233d5cbde8a6943db31b91fb3b05f017/python/ray/data/read_api.py#L338)).
3. Map operation overrides the scheduling strategy to Spread Policy if total argument size <50MB([code](https://github.com/ray-project/ray/blob/9c143f63233d5cbde8a6943db31b91fb3b05f017/python/ray/data/_internal/execution/operators/map_operator.py#L213), [related PR](ray-project#36290)).

Slack discussion: https://ray-distributed.slack.com/archives/C02PHB3SQHH/p1695756535614819

---------

Signed-off-by: Yicheng-Lu-llll <luyc58576@gmail.com>
Signed-off-by: Victor <vctr.y.m@example.com>
  • Loading branch information
Yicheng-Lu-llll authored and Victor committed Oct 11, 2023
1 parent 41594f7 commit 3fedb12
Showing 1 changed file with 4 additions and 1 deletion.
5 changes: 4 additions & 1 deletion doc/source/data/data-internals.rst
Original file line number Diff line number Diff line change
Expand Up @@ -77,10 +77,13 @@ For an in-depth guide on shuffle performance, see :ref:`Performance Tips and Tun
Scheduling
==========

Ray Data uses Ray Core for execution, and is subject to the same scheduling considerations as normal Ray Tasks and Actors. Ray Data uses the following custom scheduling settings by default for improved performance:
Ray Data uses Ray Core for execution. Below is a summary of the :ref:`scheduling strategy <ray-scheduling-strategies>` for Ray Data:

* The ``SPREAD`` scheduling strategy ensures that data blocks and map tasks are evenly balanced across the cluster.
* Dataset tasks ignore placement groups by default, see :ref:`Ray Data and Placement Groups <datasets_pg>`.
* Map operations use the ``SPREAD`` scheduling strategy if the total argument size is less than 50 MB; otherwise, they use the ``DEFAULT`` scheduling strategy.
* Read operations use the ``SPREAD`` scheduling strategy.
* All other operations, such as split, sort, and shuffle, use the ``DEFAULT`` scheduling strategy.

.. _datasets_pg:

Expand Down

0 comments on commit 3fedb12

Please sign in to comment.