Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Introduce fail_on_unavailable option for hard NodeAffinitySchedulingStrategy #36718

Merged
merged 2 commits into from
Jun 23, 2023

Conversation

jjyao
Copy link
Collaborator

@jjyao jjyao commented Jun 22, 2023

Why are these changes needed?

Add an experimental fail_on_unavailable option to try out application level scheduling

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

…trategy

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Copy link
Contributor

@rkooo567 rkooo567 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we planning to promote this API to public at some point? Feel like the combination of soft / fail / spill is a bit confusing, and there may be a better way to structure the API?

)
).remote()

with pytest.raises(ray.exceptions.ActorUnschedulableError):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have a better error message (and a test) in this case? I think it'd be great the exception contains a message like the task couldn't be scheduled, and _fail_on_unavailable is set to true?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I think we should if we make it public. For now, I think it's fine to not have an error message since it's private and I will just use it in serve and I don't need to know the error message.

src/ray/raylet/scheduling/policy/scheduling_options.h Outdated Show resolved Hide resolved
a1 = Actor.remote()
target_node_id = ray.get(a1.get_node_id.remote())

a2 = Actor.options(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is all the combination tested actually? IIUC, the behavior is


spill: True fail: True
 -> makes no sense (maybe raise an exception?)

spill:True fail:False

-> spill to other node if other node is available

spill: False fail:True
-> fail if the node is not available

spill:False fail:False
not scheduled until the node is available

can you make sure all these scenarios are tested?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently invalid combinations will check failure since these are private options now and not used by users. Once we make them public, we need to throw proper exceptions. All the valid combinations are tested.

@rkooo567 rkooo567 added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jun 22, 2023
@jjyao
Copy link
Collaborator Author

jjyao commented Jun 23, 2023

Are we planning to promote this API to public at some point? Feel like the combination of soft / fail / spill is a bit confusing, and there may be a better way to structure the API?

Yea, once we decide to make them public, we will definitely find a better way to structure the API. It's tracked here: #34283.

For fail_on_unavailable, it's experimental for serve scheduling support and we may remove it in the future.

@jjyao jjyao removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jun 23, 2023
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
@jjyao jjyao requested a review from rkooo567 June 23, 2023 04:05
@jjyao jjyao merged commit df42883 into ray-project:master Jun 23, 2023
@jjyao jjyao deleted the jjyao/fail branch June 23, 2023 18:32
arvind-chandra pushed a commit to lmco/ray that referenced this pull request Aug 31, 2023
…dulingStrategy (ray-project#36718)

Add an experimental fail_on_unavailable option to try out application level scheduling

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: e428265 <arvind.chandramouli@lmco.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants