You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We are using Ray to serve Torch models at a large scale on AWS and GCP. Due to large varations in traffic, and cold-start constraints, we discovered a failure state where the HttpProxyActor of the head node freezes when it's queue becomes too large, and timeouts result in clawbacks. This completely cripples the server for tens of minutes.
Solution
The documentation suggests that we make use of HttpProxyActors on all worker nodes, and use an external load balancer. We've implimented this using an AWS ApplicationLoadBalancer, and adding the workers to the TargetGroup on initialization.
Issue/Bug
The external load balancer has no direct communication from the HTTP servers other than health checks. This is fine for scaling up, but once we start sending requests to a worker node, the node is never scaled down even when the serve replicas are removed from it. If we manually stop sending requests, then the node is scaled down approriately.
Is this the ideal way to scale Serve clusters? The documentation on http scaling is quite minimal.
Versions / Dependencies
Ubuntu 20.04.6 LTS (Focal Fossa)
Python 3.8.10
Ray 2.2.0
Reproduction script
Recreation requires quite a complex setup with multi-nodes and sending continuous requests to each. I hope I have conveyed the issue clearly enough to recreate.
Issue Severity
High: It blocks me from completing my task.
The text was updated successfully, but these errors were encountered:
kyle-v6x
added
bug
Something that is supposed to be working; but isn't
triage
Needs triage (eg: priority, bug/not-bug, and owning component)
labels
Jun 29, 2023
@kyle-v6x Thanks for submitting the issue. Yes, other customers have faced the same issue and we pritorized on the fix CR #36652 This should be released with the upcoming Ray 2.6.0 🙂
What happened + What you expected to happen
Foundations
We are using Ray to serve Torch models at a large scale on AWS and GCP. Due to large varations in traffic, and cold-start constraints, we discovered a failure state where the HttpProxyActor of the head node freezes when it's queue becomes too large, and timeouts result in clawbacks. This completely cripples the server for tens of minutes.
Solution
The documentation suggests that we make use of HttpProxyActors on all worker nodes, and use an external load balancer. We've implimented this using an AWS ApplicationLoadBalancer, and adding the workers to the TargetGroup on initialization.
Issue/Bug
The external load balancer has no direct communication from the HTTP servers other than health checks. This is fine for scaling up, but once we start sending requests to a worker node, the node is never scaled down even when the serve replicas are removed from it. If we manually stop sending requests, then the node is scaled down approriately.
Perhaps fixed by this: #36652 (?)
Is this the ideal way to scale Serve clusters? The documentation on http scaling is quite minimal.
Versions / Dependencies
Ubuntu 20.04.6 LTS (Focal Fossa)
Python 3.8.10
Ray 2.2.0
Reproduction script
Recreation requires quite a complex setup with multi-nodes and sending continuous requests to each. I hope I have conveyed the issue clearly enough to recreate.
Issue Severity
High: It blocks me from completing my task.
The text was updated successfully, but these errors were encountered: