[Serve] HttpProxyActor prevents downscaling when in use with external load balancer #36944

kyle-v6x · 2023-06-29T06:11:34Z

What happened + What you expected to happen

Foundations

We are using Ray to serve Torch models at a large scale on AWS and GCP. Due to large varations in traffic, and cold-start constraints, we discovered a failure state where the HttpProxyActor of the head node freezes when it's queue becomes too large, and timeouts result in clawbacks. This completely cripples the server for tens of minutes.

Solution

The documentation suggests that we make use of HttpProxyActors on all worker nodes, and use an external load balancer. We've implimented this using an AWS ApplicationLoadBalancer, and adding the workers to the TargetGroup on initialization.

Issue/Bug

The external load balancer has no direct communication from the HTTP servers other than health checks. This is fine for scaling up, but once we start sending requests to a worker node, the node is never scaled down even when the serve replicas are removed from it. If we manually stop sending requests, then the node is scaled down approriately.

Perhaps fixed by this: #36652 (?)

Is this the ideal way to scale Serve clusters? The documentation on http scaling is quite minimal.

Versions / Dependencies

Ubuntu 20.04.6 LTS (Focal Fossa)
Python 3.8.10
Ray 2.2.0

Reproduction script

Recreation requires quite a complex setup with multi-nodes and sending continuous requests to each. I hope I have conveyed the issue clearly enough to recreate.

Issue Severity

High: It blocks me from completing my task.

GeneDer · 2023-06-29T21:07:32Z

@kyle-v6x Thanks for submitting the issue. Yes, other customers have faced the same issue and we pritorized on the fix CR #36652 This should be released with the upcoming Ray 2.6.0 🙂

kyle-v6x added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jun 29, 2023

GeneDer closed this as completed Jun 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Serve] HttpProxyActor prevents downscaling when in use with external load balancer #36944

[Serve] HttpProxyActor prevents downscaling when in use with external load balancer #36944

kyle-v6x commented Jun 29, 2023 •

edited

Loading

GeneDer commented Jun 29, 2023

[Serve] HttpProxyActor prevents downscaling when in use with external load balancer #36944

[Serve] HttpProxyActor prevents downscaling when in use with external load balancer #36944

Comments

kyle-v6x commented Jun 29, 2023 • edited Loading

What happened + What you expected to happen

Foundations

Solution

Issue/Bug

Versions / Dependencies

Reproduction script

Issue Severity

GeneDer commented Jun 29, 2023

kyle-v6x commented Jun 29, 2023 •

edited

Loading