Clarify default scheduling strategy used by Ray Data (ray-project#39929)

As ray-project#39871 indicated, The [current Data Internals page](https://docs.ray.io/en/latest/data/data-internals.html#scheduling) has a section on Scheduling, which confusingly states that both SPREAD and DEFAULT are the default scheduling strategies used. This PR summarized the scheduling strategy used by Ray Data as follows: 1. By default, the scheduling strategy is set to Default Hybrid Policy([code](https://github.com/ray-project/ray/blob/9c143f63233d5cbde8a6943db31b91fb3b05f017/python/ray/data/_internal/remote_fn.py#L26), [related PR](ray-project#36722)). 2. Read operation overrides the scheduling strategy to Spread Policy if the file is not located locally; otherwise, it is scheduled to the current node([code](https://github.com/Yicheng-Lu-llll/ray/blob/9c143f63233d5cbde8a6943db31b91fb3b05f017/python/ray/data/read_api.py#L338)). 3. Map operation overrides the scheduling strategy to Spread Policy if total argument size <50MB([code](https://github.com/ray-project/ray/blob/9c143f63233d5cbde8a6943db31b91fb3b05f017/python/ray/data/_internal/execution/operators/map_operator.py#L213), [related PR](ray-project#36290)). Slack discussion: https://ray-distributed.slack.com/archives/C02PHB3SQHH/p1695756535614819 --------- Signed-off-by: Yicheng-Lu-llll <luyc58576@gmail.com> Signed-off-by: Victor <vctr.y.m@example.com>
vymao · Oct 11, 2023 · 3fedb12 · 3fedb12
1 parent 41594f7
commit 3fedb12
Showing 1 changed file with 4 additions and 1 deletion.
diff --git a/doc/source/data/data-internals.rst b/doc/source/data/data-internals.rst
@@ -77,10 +77,13 @@ For an in-depth guide on shuffle performance, see :ref:`Performance Tips and Tun
 Scheduling
 ==========
 
-Ray Data uses Ray Core for execution, and is subject to the same scheduling considerations as normal Ray Tasks and Actors. Ray Data uses the following custom scheduling settings by default for improved performance:
+Ray Data uses Ray Core for execution. Below is a summary of the :ref:`scheduling strategy <ray-scheduling-strategies>` for Ray Data:
 
 * The ``SPREAD`` scheduling strategy ensures that data blocks and map tasks are evenly balanced across the cluster.
 * Dataset tasks ignore placement groups by default, see :ref:`Ray Data and Placement Groups <datasets_pg>`.
+* Map operations use the ``SPREAD`` scheduling strategy if the total argument size is less than 50 MB; otherwise, they use the ``DEFAULT`` scheduling strategy.
+* Read operations use the ``SPREAD`` scheduling strategy.
+* All other operations, such as split, sort, and shuffle, use the ``DEFAULT`` scheduling strategy.
 
 .. _datasets_pg: