-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[serve] add early return when there are still in progress http health check #36461
[serve] add early return when there are still in progress http health check #36461
Conversation
… health check before trying to recreating a new one issue: ray-project#36458 Signed-off-by: Gene Su <e870252314@gmail.com>
assert proxy_state._consecutive_health_check_failures == 3 | ||
|
||
|
||
@patch("ray.serve._private.http_state.DEFAULT_HEALTH_CHECK_TIMEOUT_S", 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed the name of previous test case and added new test case testing the same exact behavior. The only difference is the new one has a longer DEFAULT_HEALTH_CHECK_TIMEOUT_S
than PROXY_HEALTH_CHECK_PERIOD_S
in the setup
@@ -191,6 +191,7 @@ def update(self): | |||
f"{self._node_id} after {DEFAULT_HEALTH_CHECK_TIMEOUT_S}s" | |||
) | |||
self.try_update_status(HTTPProxyStatus.UNHEALTHY) | |||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
prefer to put this in an explicit else
block and please add a comment about why this early return is necessary
…or the return Signed-off-by: Gene Su <e870252314@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also did a manual test with a very long sleep in the check_health
method and seeing the proxy being taken down and restarted
INFO 2023-06-15 11:08:39,167 controller 36188 application_state.py:203 - Deploy task for app 'default' ran successfully.
WARNING 2023-06-15 11:09:16,783 controller 36188 http_state.py:189 - Didn't receive health check response for HTTP proxy cff6a8b153af4714b63a272b8f98e7a1cab1f13c4f4c016e9aac402f after 30s
WARNING 2023-06-15 11:09:46,878 controller 36188 http_state.py:189 - Didn't receive health check response for HTTP proxy cff6a8b153af4714b63a272b8f98e7a1cab1f13c4f4c016e9aac402f after 30s
WARNING 2023-06-15 11:10:16,981 controller 36188 http_state.py:189 - Didn't receive health check response for HTTP proxy cff6a8b153af4714b63a272b8f98e7a1cab1f13c4f4c016e9aac402f after 30s
WARNING 2023-06-15 11:10:47,022 controller 36188 http_state.py:189 - Didn't receive health check response for HTTP proxy cff6a8b153af4714b63a272b8f98e7a1cab1f13c4f4c016e9aac402f after 30s
WARNING 2023-06-15 11:10:47,022 controller 36188 http_state.py:106 - HTTP proxy SERVE_CONTROLLER_ACTOR:SERVE_PROXY_ACTOR-cff6a8b153af4714b63a272b8f98e7a1cab1f13c4f4c016e9aac402f failed the health check 3 times in a row, marking it unhealthy.
INFO 2023-06-15 11:10:47,134 controller 36188 http_state.py:384 - HTTP proxy on node 'cff6a8b153af4714b63a272b8f98e7a1cab1f13c4f4c016e9aac402f' UNHEALTHY. Shut down the unhealthy proxy and restart a new one.
INFO 2023-06-15 11:10:47,240 controller 36188 http_state.py:357 - Starting HTTP proxy with name 'SERVE_CONTROLLER_ACTOR:SERVE_PROXY_ACTOR-cff6a8b153af4714b63a272b8f98e7a1cab1f13c4f4c016e9aac402f' on node 'cff6a8b153af4714b63a272b8f98e7a1cab1f13c4f4c016e9aac402f' listening on '127.0.0.1:8000'
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> Signed-off-by: Gene Der Su <gdsu@ucdavis.edu>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> Signed-off-by: Gene Der Su <gdsu@ucdavis.edu>
… check (ray-project#36461) Currently there is a missing early return when there is a in progress http health check. Due to the default health check period (10s) shorter than the health check timeout (30s), we are always recreating new health check objects and will never fall into the timeout case when the health check hangs. TDD seeing this is the case in the test and seeing adding the missing early return fixes the issue. Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> Signed-off-by: e428265 <arvind.chandramouli@lmco.com>
Why are these changes needed?
Currently there is a missing early return when there is a in progress http health check. Due to the default health check period (10s) shorter than the health check timeout (30s), we are always recreating new health check objects and will never fall into the timeout case when the health check hangs. TDD seeing this is the case in the test and seeing adding the missing early return fixes the issue.
Related issue number
Closes #36458
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.