Skip to content

Commit

Permalink
[serve] Fix Router race condition (#46864)
Browse files Browse the repository at this point in the history
Fix race condition in Router when `update_deployment_config` is called
but `_metrics_manager` instance is not ready.

Which resulted in error message:
AttributeError: 'Router' object has no attribute '_metrics_manager'

## Why are these changes needed?

While using Ray Serve, we occasionally encountered this error, which
caused Ray cluster failed to start.
It turns out the ordering of initialization inside `Route.__init__`
might cause this race condition in distributed environments.

```(ProxyActor pid=368) Exception in callback <function LongPollClient._process_update.<locals>.chained at 0x7f6b11644a60>
(ProxyActor pid=368) handle: <Handle LongPollClient._process_update.<locals>.chained>
(ProxyActor pid=368) Traceback (most recent call last):
(ProxyActor pid=368)   File "uvloop/cbhandles.pyx", line 61, in uvloop.loop.Handle._run
(ProxyActor pid=368)   File "/usr/local/lib/python3.10/site-packages/ray/serve/_private/long_poll.py", line 171, in chained
(ProxyActor pid=368)     callback(arg)
(ProxyActor pid=368)   File "/usr/local/lib/python3.10/site-packages/ray/serve/_private/router.py", line 416, in update_deployment_config
(ProxyActor pid=368)     self._metrics_manager.update_deployment_config(
(ProxyActor pid=368) AttributeError: 'Router' object has no attribute '_metrics_manager'
```

## Related issue number

N/A

## Checks

- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [x] Integration test: this fixed the issue in our environment

---------

Signed-off-by: tungh2 <105205092+tungh2@users.noreply.github.com>
  • Loading branch information
tungh2 authored Aug 7, 2024
1 parent bf32fa4 commit 4d0ca91
Showing 1 changed file with 18 additions and 14 deletions.
32 changes: 18 additions & 14 deletions python/ray/serve/_private/router.py
Original file line number Diff line number Diff line change
Expand Up @@ -371,21 +371,10 @@ def __init__(
# by the controller. That means it is not available until we get the first
# update. This includes an optional autoscaling config.
self.deployment_config: Optional[DeploymentConfig] = None
self.long_poll_client = LongPollClient(
controller_handle,
{
(
LongPollNamespace.RUNNING_REPLICAS,
deployment_id,
): self.update_running_replicas,
(
LongPollNamespace.DEPLOYMENT_CONFIG,
deployment_id,
): self.update_deployment_config,
},
call_in_event_loop=self._event_loop,
)

# Initializing `self._metrics_manager` before `self.long_poll_client` is
# necessary to avoid race condition where `self.update_deployment_config()`
# might be called before `self._metrics_manager` instance is created.
self._metrics_manager = RouterMetricsManager(
deployment_id,
handle_id,
Expand Down Expand Up @@ -415,6 +404,21 @@ def __init__(
),
)

self.long_poll_client = LongPollClient(
controller_handle,
{
(
LongPollNamespace.RUNNING_REPLICAS,
deployment_id,
): self.update_running_replicas,
(
LongPollNamespace.DEPLOYMENT_CONFIG,
deployment_id,
): self.update_deployment_config,
},
call_in_event_loop=self._event_loop,
)

def update_running_replicas(self, running_replicas: List[RunningReplicaInfo]):
self._replica_scheduler.update_running_replicas(running_replicas)
self._metrics_manager.update_running_replicas(running_replicas)
Expand Down

0 comments on commit 4d0ca91

Please sign in to comment.