From 3fedb1222a565356eed3a5122bbde4d750a6c5a4 Mon Sep 17 00:00:00 2001
From: Yicheng-Lu-llll <51814063+Yicheng-Lu-llll@users.noreply.github.com>
Date: Mon, 2 Oct 2023 18:32:41 -0500
Subject: [PATCH] Clarify default scheduling strategy used by Ray Data (#39929)

As https://github.com/ray-project/ray/issues/39871 indicated, The [current Data Internals page](https://docs.ray.io/en/latest/data/data-internals.html#scheduling) has a section on Scheduling, which confusingly states that both SPREAD and DEFAULT are the default scheduling strategies used.  This PR summarized the scheduling strategy used by Ray Data as follows:

1. By default, the scheduling strategy is set to Default Hybrid Policy([code](https://github.com/ray-project/ray/blob/9c143f63233d5cbde8a6943db31b91fb3b05f017/python/ray/data/_internal/remote_fn.py#L26), [related PR](https://github.com/ray-project/ray/pull/36722)).
2. Read operation overrides the scheduling strategy to Spread Policy if the file is not located locally; otherwise, it is scheduled to the current node([code](https://github.com/Yicheng-Lu-llll/ray/blob/9c143f63233d5cbde8a6943db31b91fb3b05f017/python/ray/data/read_api.py#L338)).
3. Map operation overrides the scheduling strategy to Spread Policy if total argument size <50MB([code](https://github.com/ray-project/ray/blob/9c143f63233d5cbde8a6943db31b91fb3b05f017/python/ray/data/_internal/execution/operators/map_operator.py#L213), [related PR](https://github.com/ray-project/ray/pull/36290)).

Slack discussion: https://ray-distributed.slack.com/archives/C02PHB3SQHH/p1695756535614819

---------

Signed-off-by: Yicheng-Lu-llll <luyc58576@gmail.com>
Signed-off-by: Victor <vctr.y.m@example.com>
---
 doc/source/data/data-internals.rst | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/doc/source/data/data-internals.rst b/doc/source/data/data-internals.rst
index 119fc0c1769c..00f02239017f 100644
--- a/doc/source/data/data-internals.rst
+++ b/doc/source/data/data-internals.rst
@@ -77,10 +77,13 @@ For an in-depth guide on shuffle performance, see :ref:`Performance Tips and Tun
 Scheduling
 ==========
 
-Ray Data uses Ray Core for execution, and is subject to the same scheduling considerations as normal Ray Tasks and Actors. Ray Data uses the following custom scheduling settings by default for improved performance:
+Ray Data uses Ray Core for execution. Below is a summary of the :ref:`scheduling strategy <ray-scheduling-strategies>` for Ray Data:
 
 * The ``SPREAD`` scheduling strategy ensures that data blocks and map tasks are evenly balanced across the cluster.
 * Dataset tasks ignore placement groups by default, see :ref:`Ray Data and Placement Groups <datasets_pg>`.
+* Map operations use the ``SPREAD`` scheduling strategy if the total argument size is less than 50 MB; otherwise, they use the ``DEFAULT`` scheduling strategy.
+* Read operations use the ``SPREAD`` scheduling strategy.
+* All other operations, such as split, sort, and shuffle, use the ``DEFAULT`` scheduling strategy.
 
 .. _datasets_pg: