Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Stop the GRPC server before Shut down the Object Store #48572

Merged
merged 5 commits into from
Nov 7, 2024

Conversation

MengjinYan
Copy link
Collaborator

Why are these changes needed?

In a recent investigation, we saw Broken pipe issue when a node is being preempted and going through the shutdown process.

This is issue is due to a race condition:

  • The node is being preempted and is being gracefully shut down.
  • At the same time, remote hosts receives the notifications that the node state changes and send FreeObjects requests to the node
  • When handling the FreeObjects request, the Raylet will call Delete on the local Object store.
  • However, when the Delete happens, the object store has already been stopped and thus caused the Broken Pipe issue.

The fix in the code is to move the logic to shutdown GRPC server before stopping the object store. In this sense, we can make sure when stopping the Object Store, there will not be on-going or future GRPC requests.

Related issue number

Close #48568

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

…e conditions

Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>
Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>
Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>
Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>
@MengjinYan MengjinYan added the go add ONLY when ready to merge, run all tests label Nov 6, 2024
@MengjinYan MengjinYan marked this pull request as ready for review November 7, 2024 17:39
@MengjinYan MengjinYan requested a review from rynewang November 7, 2024 17:39
@rynewang rynewang enabled auto-merge (squash) November 7, 2024 18:10
@rynewang rynewang merged commit cb5c29e into master Nov 7, 2024
6 checks passed
@rynewang rynewang deleted the issue-48568 branch November 7, 2024 20:30
JP-sDEV pushed a commit to JP-sDEV/ray that referenced this pull request Nov 14, 2024
…oject#48572)

In a recent investigation, we saw `Broken pipe` issue when a node is
being preempted and going through the shutdown process.

This is issue is due to a race condition:
* The node is being preempted and is being gracefully shut down.
* At the same time, remote hosts receives the notifications that the
node state changes and send `FreeObjects` requests to the node
* When handling the `FreeObjects` request, the Raylet will call `Delete`
on the local Object store.
* However, when the `Delete` happens, the object store has already been
stopped and thus caused the Broken Pipe issue.

The fix in the code is to move the logic to shutdown GRPC server before
stopping the object store. In this sense, we can make sure when stopping
the Object Store, there will not be on-going or future GRPC requests.

Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>
mohitjain2504 pushed a commit to mohitjain2504/ray that referenced this pull request Nov 15, 2024
…oject#48572)

In a recent investigation, we saw `Broken pipe` issue when a node is
being preempted and going through the shutdown process.

This is issue is due to a race condition:
* The node is being preempted and is being gracefully shut down.
* At the same time, remote hosts receives the notifications that the
node state changes and send `FreeObjects` requests to the node
* When handling the `FreeObjects` request, the Raylet will call `Delete`
on the local Object store.
* However, when the `Delete` happens, the object store has already been
stopped and thus caused the Broken Pipe issue.

The fix in the code is to move the logic to shutdown GRPC server before
stopping the object store. In this sense, we can make sure when stopping
the Object Store, there will not be on-going or future GRPC requests.

Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>
Signed-off-by: mohitjain2504 <mohit.jain@dream11.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Core] Broken Pipe Seen in Raylet When a Node is Gracefully Shutdown due to Preemption
2 participants