Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Buffer overload requests in Activator #1846

Closed
josephburnett opened this issue Aug 13, 2018 · 27 comments
Closed

Buffer overload requests in Activator #1846

josephburnett opened this issue Aug 13, 2018 · 27 comments
Assignees
Labels
area/autoscale area/networking kind/feature Well-understood/specified features, ready for coding. P1 P1
Milestone

Comments

@josephburnett
Copy link
Contributor

josephburnett commented Aug 13, 2018

Problem

Knative Serving provides concurrency controls to limit request concurrency for each pod. (Currently concurencyModel, soon to be maxConcurrency). Each pod also has a small queue for pending requests. However with less-than-perfect load balancing, autoscaler lag and application startup latency, it is possible for the per-pod queues to overflow in which case requests must be rejected with a 503.

This problem is most severe when scaling from 0 under heavy load because the first pod takes a few seconds to come online and is immediately overwhelmed with requests forwarded from the Activator. Some optimizations in the metrics pipeline can mitigate this issue, such as forwarding request metrics from the Activator straight to the Autoscaler. But it doesn't solve the problem of what to do with overflowing request queues in general.

We could configure Istio to retry on 503 errors, essentially queueing overflowing requests in Istio retries, but that will result in a lot of extra network traffic and sub-optimal load balancing. This came up in the Scaling Working Group meeting and @markusthoemmes had the idea to overflow to a centralized queue. This proposal is a continuation of that idea.

Proposal

  1. Rename the Activator as KBuffer (dropped because rename was too complex -- kbuffer rename breaks updating #2509)
  2. Do activation exclusively in the KPA (Autoscaler) based on metrics (Expose auto-scaling metrics from the activator. #1623)
  3. Push capacity metrics from the KPA to the Activator to throttle proxies requests based on capacity
  4. Include a NO_QUEUE header in proxied requests from the Activator so the queue-proxy will not enqueue requests locally (intent is to avoid hotspots during 0->1 scalilng)

Non-Goals

  1. Automatic failover from the Istio Mesh (Service) to the Activator. This is the next logical step which would make 0->1 scaling a degenerate case and avoid route reprogramming.
  2. Moving Activator semantics up into Istio Ingress (Envoy). This would be a long-term goal to have only one network path and use Istio / Envoy hooks for activation and queuing.

Diagram

image

Original Proposal (rejected)

The idea to overflow to a queue and gracefully degrade from a push model to a pull model is a very compelling one. And we already have a centralized queue in the request path--the Activator.

I propose that we change the role of the Activator to that of a queue (Knative Queue or KQueue). This would require the following changes:

  1. Modify the queue-proxy to also pull requests from the KQueue (formerly Activator). This would require some kind of wrapping / multiplexing since the actual work is an incoming HTTP request.
  2. Leave activation of the Revision / KPA to the Autoscaler. This will be possible anyway once we start forwarding metrics directly there (as linked under problem statement). This leaves the KQueue to be only a queue.
  3. Configure Istio to "fail over" to the KQueue (formerly Activator) when the direct Service path is unavailable. This can be in an idle state or when the service is overwhelmed. @grantr already suggested using this failover mechanism for putting the Activator in the request path, and it's nice because it doesn't require route reconfiguration at idle.

Diagram

image

@knative-prow-robot knative-prow-robot added the kind/feature Well-understood/specified features, ready for coding. label Aug 13, 2018
@josephburnett
Copy link
Contributor Author

josephburnett commented Aug 13, 2018

@josephburnett
Copy link
Contributor Author

@markusthoemmes
Copy link
Contributor

markusthoemmes commented Aug 15, 2018

Been thinking through your proposal again (under the shower, that helped!). I believe we can make this work!

I'd change the drawing up a bit, so that work from the KQueue is retried to the primary path. That leaves operations to deal with that path to the pods only. The "only" question remaining is: How do pods signal free resources to the KQueue? Maybe a combination of an exponential backoffed retry + a metrics pipeline from the queue-proxies to the KQueue can solve this without tight coupling? The retry can start pretty tight to recover very short-running functions quickly but after a second two the metrics pipeline can signal that things have been worked on and its worth a second try.

This also completely leaves the implementation of buffering up to the KQueue. It's only its own concern whether it does buffering in-memory or to an actual queue.

Detecting the primary road is not viable can possibly even be done in Ingress, to prevent an overload of retries just returning 503s. Envoy for instance has the notion of circuit-breaking which allows it to define how many outstanding requests to send upstream. So in a completely overloaded scenario, Ingress could tell that all Pods have exhausted their concurrency limits locally (not perfectly because the ingress pods are distributed as well, but probably good enough to not hammer the system with constant retries). It can also tell once there is actual capacity again (when the circuit breaker lifts) without having to do a downstream retry.

@josephburnett
Copy link
Contributor Author

How do pods signal free resources to the KQueue?

Yeah, that's kind of the point of having them reach out to the KQueue. To tell it when they have capacity. Maybe they can still do that and the KQueue can forward the request directly to the pod? Does that work with the Istio mesh?

It's only its own concern whether it does buffering in-memory or to an actual queue.

They are synchronous, stateful requests, so I think it has to do it in memory, no? And it stays in the serving path while the request is being handled since it's a proxy.

All this together, I wonder if we should call it the KLB instead of the KQueue. It does some queueing but the other half of it's value is perfect load balancing with full knowledge of the capacity of each pod.

@josephburnett
Copy link
Contributor Author

How do pods signal free resources to the KQueue?

Pod are already reporting their load directly to the Autoscaler, which essentially implies their capacity. They could just as well report capacity explicitly.

Perhaps we should have pods reporting load and capacity to an intermediate endpoint which implements the custom metrics api. Then the autoscaler and activator can both consume these metrics to find pods to route requests to and to calculate the desired scale.

@glyn
Copy link
Contributor

glyn commented Aug 16, 2018

@josephburnett What became of the idea of having the service mesh provide scaling metrics? Is that a non-starter or simply deferred?

@josephburnett
Copy link
Contributor Author

Some concerns I've heard so far:

  1. complexity of having two request paths
  2. authentication of having pull requests
  3. how specifically to handle fail-over

I've updated my design to push requests directly to pods which eliminates the additional complexity in the queue-proxy. And the authentication issues. However it requires that the Activator (KQueue or KLB) be able to connect directly to a Pod's IP address. @tcnghia is working on making this an option and I believe that Istio is also planning to add that capability without losing the mesh.

image

Key features of the design above:

  1. pod metrics which are currently pushed to the KPA will be forwarded to the KQueue / KLB (name undecided) which will maintain only the last stat for the purpose of load balancing.
  2. KQueue / KLB will also push queue stats back to the autoscaler for incorporating queue length into scaling decisions. @markusthoemmes is working on this in Expose auto-scaling metrics from the activator. #1623.

What became of the idea of having the service mesh provide scaling metrics? Is that a non-starter is simply deferred?

@glyn, this is deferred. We can consider other metrics pipelines, including using Istio's metrics, as long as it's fast enough for scaling and routing.

@markusthoemmes
Copy link
Contributor

@josephburnett good next step 👍.

On exchanging pull for push and directing the request to the pod directly: As I understand the picture, the KQueue would be informed about pod metrics (like free capacity?) via the pod metrics in your picture. With the way envoy might work (see circuit breakers above) wouldn't that notion of "free" race with the primary path thinking that the pod is free as well (because an http request got finished on that pod), especially if (like today) the metrics are locally aggregated before sending them out.

What I'm alluding to: How does the system guarantee, that the KQueue will find a free pod when it forwards a request to it? If it cannot guarantee to do that: Would it make sense to actually have the KQueue retry the request onto the primary path, since the routers can potentially at least have a rough idea of the current concurrency on a pod (of course only locally per router, but better than blind guessing). That'd leave the entire concurrency/find a good pod decision up to the routing mesh, and the KQueue would "just" be a backpressure mechanism to wait for new capacity to arrive. I think that would also solve the "does this work with istio" question.

@josephburnett
Copy link
Contributor Author

@markusthoemmes I think that I missed your point earlier about always using the Service route. Here is an updated diagram that does that.

A few points to note:

  1. The KPA forwards capacity stats to the KQueue so it can know when to forward requests to the service and how many.
  2. The KQueue forwards queue length stats to the KPA so it can know how much to scale up.
  3. The KQueue injects a NO_QUEUE header so the queue-proxy will not enqueue the request locally. This is necessary to redistribute requests (e.g. when scaling from 0 under load).

This does eliminate the "does this wok with Istio" question. Thanks @markusthoemmes!

Open question:

  1. How specifically do we fail over to the KQueue when revision capacity is exhausted? Once we have a reliable way of doing this, we can reduce the local pod queue to zero or near-zero. We need the local queue to survive traffic bursts when not scaled to zero (KQueue not in request path).

image

@josephburnett
Copy link
Contributor Author

Once KQueue forwards stats to KPA, the KPA will solely be responsible for scaling and "activation", so the KQueue will just be a queue.

Since the KQueue would just forward to the underlying Service, it would not be doing any load-balancing. So I prefer the name "KQueue" over "KLB".

@glyn
Copy link
Contributor

glyn commented Aug 21, 2018

How will this feature will affect adding support for gRPC? @dprotaso any thoughts from your work on gRPC?

@dprotaso
Copy link
Member

(assuming "Service" refers to a Kubernetes Service that load balances requests across a revision's pods.)

Be mindful that a HTTP/2 & GRPC connection implies a single long-lived connection that contains multiple streams. This implies the KQueue could have many long-lived connections that could at some point exhaust it's ability to receive & create any new connections. Thus the KQueue takes on a similar characteristics as the Ingress. It's still a load balancer even when you don't want it to be. I believe @cppforlife brought up this up as a concern in the WG call when talking about web sockets.

Another item, that is somewhat related, is that the k8s service is a L4 load balancer so it won't spread the load of several gRPC calls across Pods. Thus ideally, I'd want to make a Revision's k8s Service headless. This pushes the load balancing onto the caller. I know Envoy supports this - so I'm assuming Istio does as well. This would then become a concern for the KQueue.

@markusthoemmes
Copy link
Contributor

markusthoemmes commented Aug 21, 2018

(assuming "Service" refers to a Kubernetes Service that load balances requests across a revision's pods.)

I think (@josephburnett please keep me honest here), that "Service" in this case is actually a Knative service and that route should actually resemble the "usual" through the mesh routing.

It's still a load balancer even when you don't want it to be.

I think the latest revision addresses that concern in that the KQueue will route the requests back through the mesh. It defers balancing completely to the mesh's layer. The only concern of the KQueue would be to keep connections to clients open and retry them on the main (through the mesh path) once it thinks that's a good idea, i.e. when it got a signal to do so through metrics.

@josephburnett
Copy link
Contributor Author

josephburnett commented Aug 21, 2018

I think (@josephburnett please keep me honest here), that "Service" in this case is actually a Knative service and that route should actually resemble the "usual" through the mesh routing.

@markusthoemmes yes. We just forward to the Knative Service by using the DNS name. The KQueue's sidecar will do the load balancing. @tcnghia to check me.

How will this feature will affect adding support for gRPC?

@glyn this feature doesn't change our situation regarding gRPC. That is, it doesn't make it harder. But we still need to solve that problem.

Be mindful that a HTTP/2 & GRPC connection implies a single long-lived connection that contains multiple streams.

@dprotaso yes, this has come up a couple times. My thinking on HTTP2 and gRPC has been that our proxies (Activator/KQueue and queue-proxy pod) will need to be stream aware in order to: 1) report "concurrency" or "qps" which is defined as total count of streams, 2) limit concurrency/qps and 3) limit connection lifetime (send GOAWAY) in the case of the Activator/KQueue so it can shunt traffic away from the cold-start path.

But we should really open a separate issue to discuss our streaming strategy in depth. @tcnghia or @dprotaso, would you like to spearhead that?

@dprotaso
Copy link
Member

@markusthoemmes yes. We just forward to the Knative Service by using the DNS name. The KQueue's sidecar will do the load balancing. @tcnghia to check me.

@josephburnett I think you mean Kubernetes Service. Since you can use Configuration & Routes without a Knative Service.

@josephburnett
Copy link
Contributor Author

OMFG #1397

@dprotaso
Copy link
Member

dprotaso commented Aug 21, 2018

Just wait for a KQueue issue ;) - https://en.wikipedia.org/wiki/Kqueue

@josephburnett josephburnett changed the title KQueue to replace Activator KBuffer to replace Activator Aug 22, 2018
@josephburnett
Copy link
Contributor Author

josephburnett commented Aug 28, 2018

Here is how we can play around with some of these different ideas:

  1. patch Emit pod scoped metrics from the autoscaler #1967 or wait for it to land (pod scope metrics)
  2. turn on enable-pod-scope-metrics in config/config-autoscaler.yaml
  3. run an interesting load test: https://github.com/knative/docs/tree/master/serving/samples/autoscale-go
  4. look at stacked pod metrics: https://github.com/knative/serving/blob/master/docs/telemetry.md#metrics-troubleshooting

This is what our problem looks like when we land a crap load of concurrent connections on a single-threaded revision scaled to zero:

Errors due to overload:
image

The first pod's queues fill up almost immediately and it mostly throws 503s (errors above):
image

But when we make a deep queue (hardcoded to 100 in this case) the first pod to show up gets all the load and takes forever to work through it's backlog:

No errors:
image

But one pod gets swamped:
image

New requests are distributed evenly, but the activator already forwarded all the pending requests to one pod:
image

Test scenario was:

go run ../docs/serving/samples/autoscale-go/test/test.go -qps 1000 -concurrency 100 -sleep 100 -duration '30m'

What we want is no errors, but no pod to get swamped.

@josephburnett josephburnett changed the title KBuffer to replace Activator Buffer all overload requests in Activator Nov 26, 2018
@josephburnett josephburnett changed the title Buffer all overload requests in Activator Buffer overload requests in Activator Nov 27, 2018
@vvraskin
Copy link
Contributor

@josephburnett As discussed in the WG meeting, here are some recent measurements.

I've been running a similar test using hey, played around with some parameters (container execution time and the number of parallel requests). And I was able to reproduce the problem with the latest master and remediate it with #2653. In my test I sent batches of 300 concurrent requests (instead of 100) to highlight the 503s.

Using the latest master:

screenshot
The rate of 503s is higher then the one of 200s presumably because of the retries that istio and activator did. The absolute values are:

hey -c 300 -n 900 -t 60 -host cpu-devourer.default.example.com "http://redacted/memory?duration=100ms"

// truncated
Latency distribution:
  10% in 0.1945 secs
  25% in 0.5582 secs
  50% in 1.1696 secs
  75% in 22.2478 secs
  90% in 28.0494 secs
  95% in 32.2450 secs
  99% in 32.7891 secs
// truncated
Status code distribution:
  [200]	737 responses
  [503]	163 responses

Using the throttling PR

screenshot
So there are no 503.

hey -c 300 -n 900 -t 60 -host cpu-devourer.default.example.com "http://redacted/memory?duration=100ms"
// truncated
Latency distribution:
  10% in 0.2987 secs
  25% in 0.3989 secs
  50% in 0.5310 secs
  75% in 25.0532 secs
  90% in 30.7413 secs
  95% in 32.2458 secs
  99% in 33.2804 secs
// truncated
Status code distribution:
  [200]	900 responses

@mattmoor
Copy link
Member

With Revision managed activation (#1997) closed and the SKS API available to the KPA this is now plausible to do via the activator.

Anyone interested inpursuing this in 0.7?

@markusthoemmes
Copy link
Contributor

@mattmoor I'd very much be, if I'm able to finish my pluggable autoscaler work in time for that. Leaving this unassigned until I know I'll have the cycles.

@vagababov
Copy link
Contributor

/assign

@vagababov
Copy link
Contributor

@mattmoor
Copy link
Member

@vagababov Trying to understand what work is left here. Just tests?

@vagababov
Copy link
Contributor

vagababov commented Jul 11, 2019 via email

@vagababov
Copy link
Contributor

/close

@knative-prow-robot
Copy link
Contributor

@vagababov: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/autoscale area/networking kind/feature Well-understood/specified features, ready for coding. P1 P1
Projects
None yet
Development

No branches or pull requests

10 participants