-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Buffer overload requests in Activator #1846
Comments
Some lively discussion on Slack: https://knative.slack.com/archives/C94SPR60H/p1534184535000428 History (shared with knative-dev@googlegroups.com): https://docs.google.com/document/d/1n9EXkt4P0qoKJpONKcHdgNlFHPlOpIRAhAhe-kCXs1U/edit#heading=h.bbz6ggsnfgoc |
Scaling working group discussion on this topic (shared with knative-dev@googlegroups.com): https://docs.google.com/document/d/1FoLJqbDJM8_tw7CON-CJZsO2mlF8Ia1cWzCjWX8HDAI/edit#heading=h.rvrncnfixoc7 |
Been thinking through your proposal again (under the shower, that helped!). I believe we can make this work! I'd change the drawing up a bit, so that work from the KQueue is retried to the primary path. That leaves operations to deal with that path to the pods only. The "only" question remaining is: How do pods signal free resources to the KQueue? Maybe a combination of an exponential backoffed retry + a metrics pipeline from the queue-proxies to the KQueue can solve this without tight coupling? The retry can start pretty tight to recover very short-running functions quickly but after a second two the metrics pipeline can signal that things have been worked on and its worth a second try. This also completely leaves the implementation of buffering up to the KQueue. It's only its own concern whether it does buffering in-memory or to an actual queue. Detecting the primary road is not viable can possibly even be done in Ingress, to prevent an overload of retries just returning 503s. Envoy for instance has the notion of circuit-breaking which allows it to define how many outstanding requests to send upstream. So in a completely overloaded scenario, Ingress could tell that all Pods have exhausted their concurrency limits locally (not perfectly because the ingress pods are distributed as well, but probably good enough to not hammer the system with constant retries). It can also tell once there is actual capacity again (when the circuit breaker lifts) without having to do a downstream retry. |
Yeah, that's kind of the point of having them reach out to the KQueue. To tell it when they have capacity. Maybe they can still do that and the KQueue can forward the request directly to the pod? Does that work with the Istio mesh?
They are synchronous, stateful requests, so I think it has to do it in memory, no? And it stays in the serving path while the request is being handled since it's a proxy. All this together, I wonder if we should call it the KLB instead of the KQueue. It does some queueing but the other half of it's value is perfect load balancing with full knowledge of the capacity of each pod. |
Pod are already reporting their load directly to the Autoscaler, which essentially implies their capacity. They could just as well report capacity explicitly. Perhaps we should have pods reporting load and capacity to an intermediate endpoint which implements the custom metrics api. Then the autoscaler and activator can both consume these metrics to find pods to route requests to and to calculate the desired scale. |
@josephburnett What became of the idea of having the service mesh provide scaling metrics? Is that a non-starter or simply deferred? |
Some concerns I've heard so far:
I've updated my design to push requests directly to pods which eliminates the additional complexity in the queue-proxy. And the authentication issues. However it requires that the Activator (KQueue or KLB) be able to connect directly to a Pod's IP address. @tcnghia is working on making this an option and I believe that Istio is also planning to add that capability without losing the mesh. Key features of the design above:
@glyn, this is deferred. We can consider other metrics pipelines, including using Istio's metrics, as long as it's fast enough for scaling and routing. |
@josephburnett good next step 👍. On exchanging pull for push and directing the request to the pod directly: As I understand the picture, the KQueue would be informed about pod metrics (like free capacity?) via the pod metrics in your picture. With the way envoy might work (see circuit breakers above) wouldn't that notion of "free" race with the primary path thinking that the pod is free as well (because an http request got finished on that pod), especially if (like today) the metrics are locally aggregated before sending them out. What I'm alluding to: How does the system guarantee, that the KQueue will find a free pod when it forwards a request to it? If it cannot guarantee to do that: Would it make sense to actually have the KQueue retry the request onto the primary path, since the routers can potentially at least have a rough idea of the current concurrency on a pod (of course only locally per router, but better than blind guessing). That'd leave the entire concurrency/find a good pod decision up to the routing mesh, and the KQueue would "just" be a backpressure mechanism to wait for new capacity to arrive. I think that would also solve the "does this work with istio" question. |
@markusthoemmes I think that I missed your point earlier about always using the Service route. Here is an updated diagram that does that. A few points to note:
This does eliminate the "does this wok with Istio" question. Thanks @markusthoemmes! Open question:
|
Once KQueue forwards stats to KPA, the KPA will solely be responsible for scaling and "activation", so the KQueue will just be a queue. Since the KQueue would just forward to the underlying Service, it would not be doing any load-balancing. So I prefer the name "KQueue" over "KLB". |
How will this feature will affect adding support for gRPC? @dprotaso any thoughts from your work on gRPC? |
(assuming "Service" refers to a Kubernetes Service that load balances requests across a revision's pods.) Be mindful that a HTTP/2 & GRPC connection implies a single long-lived connection that contains multiple streams. This implies the KQueue could have many long-lived connections that could at some point exhaust it's ability to receive & create any new connections. Thus the KQueue takes on a similar characteristics as the Ingress. It's still a load balancer even when you don't want it to be. I believe @cppforlife brought up this up as a concern in the WG call when talking about web sockets. Another item, that is somewhat related, is that the k8s service is a L4 load balancer so it won't spread the load of several gRPC calls across Pods. Thus ideally, I'd want to make a Revision's k8s Service headless. This pushes the load balancing onto the caller. I know Envoy supports this - so I'm assuming Istio does as well. This would then become a concern for the KQueue. |
I think (@josephburnett please keep me honest here), that "Service" in this case is actually a Knative service and that route should actually resemble the "usual" through the mesh routing.
I think the latest revision addresses that concern in that the KQueue will route the requests back through the mesh. It defers balancing completely to the mesh's layer. The only concern of the KQueue would be to keep connections to clients open and retry them on the main (through the mesh path) once it thinks that's a good idea, i.e. when it got a signal to do so through metrics. |
@markusthoemmes yes. We just forward to the Knative Service by using the DNS name. The KQueue's sidecar will do the load balancing. @tcnghia to check me.
@glyn this feature doesn't change our situation regarding gRPC. That is, it doesn't make it harder. But we still need to solve that problem.
@dprotaso yes, this has come up a couple times. My thinking on HTTP2 and gRPC has been that our proxies (Activator/KQueue and queue-proxy pod) will need to be stream aware in order to: 1) report "concurrency" or "qps" which is defined as total count of streams, 2) limit concurrency/qps and 3) limit connection lifetime (send GOAWAY) in the case of the Activator/KQueue so it can shunt traffic away from the cold-start path. But we should really open a separate issue to discuss our streaming strategy in depth. @tcnghia or @dprotaso, would you like to spearhead that? |
@josephburnett I think you mean Kubernetes Service. Since you can use Configuration & Routes without a Knative Service. |
OMFG #1397 |
Just wait for a KQueue issue ;) - https://en.wikipedia.org/wiki/Kqueue |
Here is how we can play around with some of these different ideas:
This is what our problem looks like when we land a crap load of concurrent connections on a single-threaded revision scaled to zero: The first pod's queues fill up almost immediately and it mostly throws 503s (errors above): But when we make a deep queue (hardcoded to 100 in this case) the first pod to show up gets all the load and takes forever to work through it's backlog: New requests are distributed evenly, but the activator already forwarded all the pending requests to one pod: Test scenario was:
What we want is no errors, but no pod to get swamped. |
@josephburnett As discussed in the WG meeting, here are some recent measurements. I've been running a similar test using hey, played around with some parameters (container execution time and the number of parallel requests). And I was able to reproduce the problem with the latest master and remediate it with #2653. In my test I sent batches of 300 concurrent requests (instead of 100) to highlight the 503s. Using the latest master:
Using the throttling PR
|
With Revision managed activation (#1997) closed and the SKS API available to the KPA this is now plausible to do via the activator. Anyone interested inpursuing this in 0.7? |
@mattmoor I'd very much be, if I'm able to finish my pluggable autoscaler work in time for that. Leaving this unassigned until I know I'll have the cycles. |
/assign |
@vagababov Trying to understand what work is left here. Just tests? |
Affirmative.
…On Thu, Jul 11, 2019 at 7:17 AM Matt Moore ***@***.***> wrote:
@vagababov <https://github.com/vagababov> Trying to understand what work
is left here. Just tests?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1846?email_source=notifications&email_token=AAF2WXYX5SZLWJXXFPCFJ7DP646IHA5CNFSM4FPL6LKKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZW3AFQ#issuecomment-510504982>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAF2WX2Q5D7XJZJHPCNEHXTP646IHANCNFSM4FPL6LKA>
.
|
/close |
@vagababov: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Problem
Knative Serving provides concurrency controls to limit request concurrency for each pod. (Currently
concurencyModel
, soon to bemaxConcurrency
). Each pod also has a small queue for pending requests. However with less-than-perfect load balancing, autoscaler lag and application startup latency, it is possible for the per-pod queues to overflow in which case requests must be rejected with a 503.This problem is most severe when scaling from 0 under heavy load because the first pod takes a few seconds to come online and is immediately overwhelmed with requests forwarded from the Activator. Some optimizations in the metrics pipeline can mitigate this issue, such as forwarding request metrics from the Activator straight to the Autoscaler. But it doesn't solve the problem of what to do with overflowing request queues in general.
We could configure Istio to retry on 503 errors, essentially queueing overflowing requests in Istio retries, but that will result in a lot of extra network traffic and sub-optimal load balancing. This came up in the Scaling Working Group meeting and @markusthoemmes had the idea to overflow to a centralized queue. This proposal is a continuation of that idea.
Proposal
Rename the Activator as KBuffer(dropped because rename was too complex -- kbuffer rename breaks updating #2509)Non-Goals
Diagram
Original Proposal (rejected)
The idea to overflow to a queue and gracefully degrade from a push model to a pull model is a very compelling one. And we already have a centralized queue in the request path--the Activator.
I propose that we change the role of the Activator to that of a queue (Knative Queue or KQueue). This would require the following changes:
Diagram
The text was updated successfully, but these errors were encountered: