[Core] Introduce fail_on_unavailable option for hard NodeAffinitySchedulingStrategy #36718

jjyao · 2023-06-22T19:47:37Z

Why are these changes needed?

Add an experimental fail_on_unavailable option to try out application level scheduling

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

…trategy Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

rkooo567

Are we planning to promote this API to public at some point? Feel like the combination of soft / fail / spill is a bit confusing, and there may be a better way to structure the API?

rkooo567 · 2023-06-22T22:41:54Z

python/ray/tests/test_scheduling_2.py

+        )
+    ).remote()
+
+    with pytest.raises(ray.exceptions.ActorUnschedulableError):


Should we have a better error message (and a test) in this case? I think it'd be great the exception contains a message like the task couldn't be scheduled, and _fail_on_unavailable is set to true?

Yea, I think we should if we make it public. For now, I think it's fine to not have an error message since it's private and I will just use it in serve and I don't need to know the error message.

src/ray/raylet/scheduling/policy/scheduling_options.h

rkooo567 · 2023-06-22T22:51:08Z

python/ray/tests/test_scheduling_2.py

+    a1 = Actor.remote()
+    target_node_id = ray.get(a1.get_node_id.remote())
+
+    a2 = Actor.options(


Is all the combination tested actually? IIUC, the behavior is

spill: True fail: True -> makes no sense (maybe raise an exception?) spill:True fail:False -> spill to other node if other node is available spill: False fail:True -> fail if the node is not available spill:False fail:False not scheduled until the node is available

can you make sure all these scenarios are tested?

Currently invalid combinations will check failure since these are private options now and not used by users. Once we make them public, we need to throw proper exceptions. All the valid combinations are tested.

jjyao · 2023-06-23T04:04:27Z

Are we planning to promote this API to public at some point? Feel like the combination of soft / fail / spill is a bit confusing, and there may be a better way to structure the API?

Yea, once we decide to make them public, we will definitely find a better way to structure the API. It's tracked here: #34283.

For fail_on_unavailable, it's experimental for serve scheduling support and we may remove it in the future.

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

…dulingStrategy (ray-project#36718) Add an experimental fail_on_unavailable option to try out application level scheduling Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com> Signed-off-by: e428265 <arvind.chandramouli@lmco.com>

Introduce fail_on_unavailable option for hard NodeAffinitySchedulingS…

026663f

…trategy Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

jjyao requested review from wuisawesome, ericl, AmeerHajAli, robertnishihara, pcmoritz, raulchen and a team as code owners June 22, 2023 19:47

jjyao assigned scv119 and rkooo567 Jun 22, 2023

rkooo567 reviewed Jun 22, 2023

View reviewed changes

rkooo567 added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jun 22, 2023

jjyao removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jun 23, 2023

up

c4576a3

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

jjyao requested a review from rkooo567 June 23, 2023 04:05

rkooo567 approved these changes Jun 23, 2023

View reviewed changes

jjyao merged commit df42883 into ray-project:master Jun 23, 2023

jjyao deleted the jjyao/fail branch June 23, 2023 18:32

akshay-anyscale mentioned this pull request Jul 21, 2023

Add service deployment instructions to stable diffusion template #37645

Closed

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core] Introduce fail_on_unavailable option for hard NodeAffinitySchedulingStrategy #36718

[Core] Introduce fail_on_unavailable option for hard NodeAffinitySchedulingStrategy #36718

jjyao commented Jun 22, 2023

rkooo567 left a comment •

edited

Loading

rkooo567 Jun 22, 2023

jjyao Jun 23, 2023

rkooo567 Jun 22, 2023

jjyao Jun 23, 2023

jjyao commented Jun 23, 2023

[Core] Introduce fail_on_unavailable option for hard NodeAffinitySchedulingStrategy #36718

[Core] Introduce fail_on_unavailable option for hard NodeAffinitySchedulingStrategy #36718

Conversation

jjyao commented Jun 22, 2023

Why are these changes needed?

Related issue number

Checks

rkooo567 left a comment • edited Loading

Choose a reason for hiding this comment

rkooo567 Jun 22, 2023

Choose a reason for hiding this comment

jjyao Jun 23, 2023

Choose a reason for hiding this comment

rkooo567 Jun 22, 2023

Choose a reason for hiding this comment

jjyao Jun 23, 2023

Choose a reason for hiding this comment

jjyao commented Jun 23, 2023

rkooo567 left a comment •

edited

Loading