Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[serve] add early return when there are still in progress http health check #36461

Merged

Conversation

GeneDer
Copy link
Contributor

@GeneDer GeneDer commented Jun 15, 2023

Why are these changes needed?

Currently there is a missing early return when there is a in progress http health check. Due to the default health check period (10s) shorter than the health check timeout (30s), we are always recreating new health check objects and will never fall into the timeout case when the health check hangs. TDD seeing this is the case in the test and seeing adding the missing early return fixes the issue.

Related issue number

Closes #36458

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

… health check before trying to recreating a new one

issue: ray-project#36458

Signed-off-by: Gene Su <e870252314@gmail.com>
@GeneDer GeneDer requested a review from edoakes June 15, 2023 16:27
assert proxy_state._consecutive_health_check_failures == 3


@patch("ray.serve._private.http_state.DEFAULT_HEALTH_CHECK_TIMEOUT_S", 1)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed the name of previous test case and added new test case testing the same exact behavior. The only difference is the new one has a longer DEFAULT_HEALTH_CHECK_TIMEOUT_S than PROXY_HEALTH_CHECK_PERIOD_S in the setup

@@ -191,6 +191,7 @@ def update(self):
f"{self._node_id} after {DEFAULT_HEALTH_CHECK_TIMEOUT_S}s"
)
self.try_update_status(HTTPProxyStatus.UNHEALTHY)
return
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prefer to put this in an explicit else block and please add a comment about why this early return is necessary

…or the return

Signed-off-by: Gene Su <e870252314@gmail.com>
Copy link
Contributor Author

@GeneDer GeneDer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also did a manual test with a very long sleep in the check_health method and seeing the proxy being taken down and restarted

INFO 2023-06-15 11:08:39,167 controller 36188 application_state.py:203 - Deploy task for app 'default' ran successfully.
WARNING 2023-06-15 11:09:16,783 controller 36188 http_state.py:189 - Didn't receive health check response for HTTP proxy cff6a8b153af4714b63a272b8f98e7a1cab1f13c4f4c016e9aac402f after 30s
WARNING 2023-06-15 11:09:46,878 controller 36188 http_state.py:189 - Didn't receive health check response for HTTP proxy cff6a8b153af4714b63a272b8f98e7a1cab1f13c4f4c016e9aac402f after 30s
WARNING 2023-06-15 11:10:16,981 controller 36188 http_state.py:189 - Didn't receive health check response for HTTP proxy cff6a8b153af4714b63a272b8f98e7a1cab1f13c4f4c016e9aac402f after 30s
WARNING 2023-06-15 11:10:47,022 controller 36188 http_state.py:189 - Didn't receive health check response for HTTP proxy cff6a8b153af4714b63a272b8f98e7a1cab1f13c4f4c016e9aac402f after 30s
WARNING 2023-06-15 11:10:47,022 controller 36188 http_state.py:106 - HTTP proxy SERVE_CONTROLLER_ACTOR:SERVE_PROXY_ACTOR-cff6a8b153af4714b63a272b8f98e7a1cab1f13c4f4c016e9aac402f failed the health check 3 times in a row, marking it unhealthy.
INFO 2023-06-15 11:10:47,134 controller 36188 http_state.py:384 - HTTP proxy on node 'cff6a8b153af4714b63a272b8f98e7a1cab1f13c4f4c016e9aac402f' UNHEALTHY. Shut down the unhealthy proxy and restart a new one.
INFO 2023-06-15 11:10:47,240 controller 36188 http_state.py:357 - Starting HTTP proxy with name 'SERVE_CONTROLLER_ACTOR:SERVE_PROXY_ACTOR-cff6a8b153af4714b63a272b8f98e7a1cab1f13c4f4c016e9aac402f' on node 'cff6a8b153af4714b63a272b8f98e7a1cab1f13c4f4c016e9aac402f' listening on '127.0.0.1:8000'

python/ray/serve/tests/test_http_state.py Outdated Show resolved Hide resolved
python/ray/serve/tests/test_http_state.py Outdated Show resolved Hide resolved
GeneDer and others added 2 commits June 15, 2023 12:02
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: Gene Der Su <gdsu@ucdavis.edu>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: Gene Der Su <gdsu@ucdavis.edu>
@edoakes edoakes merged commit d3250ec into ray-project:master Jun 15, 2023
@GeneDer GeneDer deleted the fix-http-proxy-timeout-not-met-edge-case branch June 15, 2023 22:11
arvind-chandra pushed a commit to lmco/ray that referenced this pull request Aug 31, 2023
… check (ray-project#36461)

Currently there is a missing early return when there is a in progress http health check. Due to the default health check period (10s) shorter than the health check timeout (30s), we are always recreating new health check objects and will never fall into the timeout case when the health check hangs. TDD seeing this is the case in the test and seeing adding the missing early return fixes the issue.

Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: e428265 <arvind.chandramouli@lmco.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Serve] http proxy timeout are not getting triggered
3 participants