Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Serve] HttpProxyActor prevents downscaling when in use with external load balancer #36944

Closed
kyle-v6x opened this issue Jun 29, 2023 · 1 comment
Labels
bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component)

Comments

@kyle-v6x
Copy link

kyle-v6x commented Jun 29, 2023

What happened + What you expected to happen

Foundations

We are using Ray to serve Torch models at a large scale on AWS and GCP. Due to large varations in traffic, and cold-start constraints, we discovered a failure state where the HttpProxyActor of the head node freezes when it's queue becomes too large, and timeouts result in clawbacks. This completely cripples the server for tens of minutes.

Solution

The documentation suggests that we make use of HttpProxyActors on all worker nodes, and use an external load balancer. We've implimented this using an AWS ApplicationLoadBalancer, and adding the workers to the TargetGroup on initialization.

Issue/Bug

The external load balancer has no direct communication from the HTTP servers other than health checks. This is fine for scaling up, but once we start sending requests to a worker node, the node is never scaled down even when the serve replicas are removed from it. If we manually stop sending requests, then the node is scaled down approriately.

Perhaps fixed by this: #36652 (?)

Is this the ideal way to scale Serve clusters? The documentation on http scaling is quite minimal.

Versions / Dependencies

Ubuntu 20.04.6 LTS (Focal Fossa)
Python 3.8.10
Ray 2.2.0

Reproduction script

Recreation requires quite a complex setup with multi-nodes and sending continuous requests to each. I hope I have conveyed the issue clearly enough to recreate.

Issue Severity

High: It blocks me from completing my task.

@kyle-v6x kyle-v6x added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jun 29, 2023
@GeneDer
Copy link
Contributor

GeneDer commented Jun 29, 2023

@kyle-v6x Thanks for submitting the issue. Yes, other customers have faced the same issue and we pritorized on the fix CR #36652 This should be released with the upcoming Ray 2.6.0 🙂

@GeneDer GeneDer closed this as completed Jun 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component)
Projects
None yet
Development

No branches or pull requests

2 participants