Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] add option for raylet to inform whether a task should be retried #31230

Merged
merged 3 commits into from
Jan 3, 2023

Conversation

clarng
Copy link
Contributor

@clarng clarng commented Dec 20, 2022

Signed-off-by: Clarence Ng clarence.wyng@gmail.com

Why are these changes needed?

Add plumbing for raylet to inform whether the worker that died should be retried. This is a no-op change for now, to be used in a follow up PR once we implement group-by-owner-id policy that will report task failure on deadlock.

  • change oom killer to expose a bit on whether to retry the task. A future policy that can detect deadlock will set this bit to false. For now, the current policy will set this bit to true and allow the task to use its retry.
  • if the task manager fetches this bit it will fail the task immediately, ignoring available retry.

Related issue number

#30900

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Clarence Ng <clarence.wyng@gmail.com>
@clarng clarng marked this pull request as ready for review December 20, 2022 18:06
Signed-off-by: Clarence Ng <clarence.wyng@gmail.com>
@clarng
Copy link
Contributor Author

clarng commented Dec 21, 2022

tests look ok

gentle ping: @rkooo567 @stephanie-wang @scv119

@clarng clarng added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Dec 21, 2022
@rkooo567
Copy link
Contributor

^I will review this tmrw!

Signed-off-by: Clarence Ng <clarence.wyng@gmail.com>
@scv119 scv119 merged commit 0c3d32d into ray-project:master Jan 3, 2023
AmeerHajAli pushed a commit that referenced this pull request Jan 12, 2023
…ed (#31230)

Add plumbing for raylet to inform whether the worker that died should be retried. This is a no-op change for now, to be used in a follow up PR once we implement group-by-owner-id policy that will report task failure on deadlock.

change oom killer to expose a bit on whether to retry the task. A future policy that can detect deadlock will set this bit to false. For now, the current policy will set this bit to true and allow the task to use its retry.
if the task manager fetches this bit it will fail the task immediately, ignoring available retry.

Signed-off-by: Clarence Ng <clarence.wyng@gmail.com>
tamohannes pushed a commit to ju2ez/ray that referenced this pull request Jan 25, 2023
…ed (ray-project#31230)

Add plumbing for raylet to inform whether the worker that died should be retried. This is a no-op change for now, to be used in a follow up PR once we implement group-by-owner-id policy that will report task failure on deadlock.

change oom killer to expose a bit on whether to retry the task. A future policy that can detect deadlock will set this bit to false. For now, the current policy will set this bit to true and allow the task to use its retry.
if the task manager fetches this bit it will fail the task immediately, ignoring available retry.

Signed-off-by: Clarence Ng <clarence.wyng@gmail.com>
Signed-off-by: tmynn <hovhannes.tamoyan@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tests-ok The tagger certifies test failures are unrelated and assumes personal liability.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants