Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core][state] Task state fault tolerance - GCS marks children task failed if parent task fails. [1/x] #31524

Merged
merged 11 commits into from
Jan 14, 2023

Conversation

rickyyx
Copy link
Contributor

@rickyyx rickyyx commented Jan 7, 2023

Signed-off-by: rickyyx rickyx@anyscale.com

Why are these changes needed?

We currently don't mark children tasks as failed if the parent fails. This PR adds a synchronous routine MarkTaskTreeFailed for each newly reported task event. And marks all children tasks in the task tree, with the reported task as the root, as failed, if:

  1. The reported task itself is a task failure.
  2. The parent of the reported task has already failed.

In the next PRs:

  • When a job failed, all tasks and their children with the failed job id will be marked as failed.
  • Detached actor's tasks should not be failed if their parents fail.
  • We will add a column to TaskState for task exit details, e.g.exit_detail = Task failed because of parent task XXX failed at timestamp YYY.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: rickyyx <rickyx@anyscale.com>
Signed-off-by: rickyyx <rickyx@anyscale.com>
Signed-off-by: rickyyx <rickyx@anyscale.com>
@rickyyx rickyyx changed the title [core][state] Task state fault tolerance - GCS marks children task failed if parent task fails. [1/2] [core][state] Task state fault tolerance - GCS marks children task failed if parent task fails. [1/x] Jan 9, 2023
@rickyyx rickyyx marked this pull request as ready for review January 9, 2023 17:53
@ericl
Copy link
Contributor

ericl commented Jan 9, 2023

@rkooo567 , could you do the main review?

@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 9, 2023
@rkooo567
Copy link
Contributor

@rickyyx lmk when it is ready to review again (plz remove the label!)

Signed-off-by: rickyyx <rickyx@anyscale.com>
@rickyyx rickyyx removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 10, 2023
@ericl ericl removed their assignment Jan 11, 2023
Signed-off-by: rickyyx <rickyx@anyscale.com>
Copy link
Contributor

@rkooo567 rkooo567 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am going to have another round of review (haven't reviewed child/parent update part and unit tests)!

python/ray/tests/test_task_events.py Outdated Show resolved Hide resolved
src/ray/gcs/pb_util.h Show resolved Hide resolved
src/ray/gcs/gcs_server/gcs_task_manager.h Outdated Show resolved Hide resolved
src/ray/gcs/gcs_server/gcs_task_manager.h Outdated Show resolved Hide resolved
src/ray/gcs/gcs_server/gcs_task_manager.h Outdated Show resolved Hide resolved
src/ray/gcs/gcs_server/gcs_task_manager.cc Outdated Show resolved Hide resolved
src/ray/gcs/gcs_server/gcs_task_manager.cc Outdated Show resolved Hide resolved
src/ray/gcs/gcs_server/gcs_task_manager.cc Outdated Show resolved Hide resolved
src/ray/gcs/gcs_server/gcs_task_manager.cc Show resolved Hide resolved
src/ray/gcs/gcs_server/gcs_task_manager.cc Show resolved Hide resolved
src/ray/gcs/gcs_server/gcs_task_manager.cc Show resolved Hide resolved
src/ray/gcs/gcs_server/gcs_task_manager.cc Outdated Show resolved Hide resolved
src/ray/gcs/gcs_server/gcs_task_manager.cc Outdated Show resolved Hide resolved
@rkooo567 rkooo567 added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 12, 2023
Signed-off-by: rickyyx <rickyx@anyscale.com>
Signed-off-by: rickyyx <rickyx@anyscale.com>
Signed-off-by: rickyyx <rickyx@anyscale.com>
Signed-off-by: rickyyx <rickyx@anyscale.com>
@rickyyx
Copy link
Contributor Author

rickyyx commented Jan 13, 2023

Running release tests: https://buildkite.com/ray-project/release-tests-pr/builds/25388

Also see unresolved comments for follow-ups.

@rickyyx rickyyx removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 13, 2023
@rkooo567 rkooo567 added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 13, 2023
@rickyyx rickyyx added tests-ok The tagger certifies test failures are unrelated and assumes personal liability. and removed @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. labels Jan 13, 2023
@rickyyx
Copy link
Contributor Author

rickyyx commented Jan 13, 2023

No regression from runs: https://buildkite.com/ray-project/release-tests-pr/builds/25408 , many_tasks is failing in master.
Other CI tests are not relevant or already fixed in master.
cc @rkooo567

@rkooo567 rkooo567 merged commit cdf5947 into ray-project:master Jan 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tests-ok The tagger certifies test failures are unrelated and assumes personal liability.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants