Buffer overload requests in Activator #1846

josephburnett · 2018-08-13T18:19:22Z

Problem

Knative Serving provides concurrency controls to limit request concurrency for each pod. (Currently concurencyModel, soon to be maxConcurrency). Each pod also has a small queue for pending requests. However with less-than-perfect load balancing, autoscaler lag and application startup latency, it is possible for the per-pod queues to overflow in which case requests must be rejected with a 503.

This problem is most severe when scaling from 0 under heavy load because the first pod takes a few seconds to come online and is immediately overwhelmed with requests forwarded from the Activator. Some optimizations in the metrics pipeline can mitigate this issue, such as forwarding request metrics from the Activator straight to the Autoscaler. But it doesn't solve the problem of what to do with overflowing request queues in general.

We could configure Istio to retry on 503 errors, essentially queueing overflowing requests in Istio retries, but that will result in a lot of extra network traffic and sub-optimal load balancing. This came up in the Scaling Working Group meeting and @markusthoemmes had the idea to overflow to a centralized queue. This proposal is a continuation of that idea.

Proposal

~~Rename the Activator as KBuffer~~ (dropped because rename was too complex -- kbuffer rename breaks updating #2509)
Do activation exclusively in the KPA (Autoscaler) based on metrics (Expose auto-scaling metrics from the activator. #1623)
Push capacity metrics from the KPA to the Activator to throttle proxies requests based on capacity
Include a NO_QUEUE header in proxied requests from the Activator so the queue-proxy will not enqueue requests locally (intent is to avoid hotspots during 0->1 scalilng)

Non-Goals

Automatic failover from the Istio Mesh (Service) to the Activator. This is the next logical step which would make 0->1 scaling a degenerate case and avoid route reprogramming.
Moving Activator semantics up into Istio Ingress (Envoy). This would be a long-term goal to have only one network path and use Istio / Envoy hooks for activation and queuing.

Diagram

Original Proposal (rejected)

The idea to overflow to a queue and gracefully degrade from a push model to a pull model is a very compelling one. And we already have a centralized queue in the request path--the Activator.

I propose that we change the role of the Activator to that of a queue (Knative Queue or KQueue). This would require the following changes:

Modify the queue-proxy to also pull requests from the KQueue (formerly Activator). This would require some kind of wrapping / multiplexing since the actual work is an incoming HTTP request.
Leave activation of the Revision / KPA to the Autoscaler. This will be possible anyway once we start forwarding metrics directly there (as linked under problem statement). This leaves the KQueue to be only a queue.
Configure Istio to "fail over" to the KQueue (formerly Activator) when the direct Service path is unavailable. This can be in an idle state or when the service is overwhelmed. @grantr already suggested using this failover mechanism for putting the Activator in the request path, and it's nice because it doesn't require route reconfiguration at idle.

Diagram

The text was updated successfully, but these errors were encountered:

josephburnett · 2018-08-13T19:27:05Z

Some lively discussion on Slack: https://knative.slack.com/archives/C94SPR60H/p1534184535000428

History (shared with knative-dev@googlegroups.com): https://docs.google.com/document/d/1n9EXkt4P0qoKJpONKcHdgNlFHPlOpIRAhAhe-kCXs1U/edit#heading=h.bbz6ggsnfgoc

josephburnett · 2018-08-15T17:06:56Z

Scaling working group discussion on this topic (shared with knative-dev@googlegroups.com): https://docs.google.com/document/d/1FoLJqbDJM8_tw7CON-CJZsO2mlF8Ia1cWzCjWX8HDAI/edit#heading=h.rvrncnfixoc7

markusthoemmes · 2018-08-15T17:43:59Z

Been thinking through your proposal again (under the shower, that helped!). I believe we can make this work!

I'd change the drawing up a bit, so that work from the KQueue is retried to the primary path. That leaves operations to deal with that path to the pods only. The "only" question remaining is: How do pods signal free resources to the KQueue? Maybe a combination of an exponential backoffed retry + a metrics pipeline from the queue-proxies to the KQueue can solve this without tight coupling? The retry can start pretty tight to recover very short-running functions quickly but after a second two the metrics pipeline can signal that things have been worked on and its worth a second try.

This also completely leaves the implementation of buffering up to the KQueue. It's only its own concern whether it does buffering in-memory or to an actual queue.

Detecting the primary road is not viable can possibly even be done in Ingress, to prevent an overload of retries just returning 503s. Envoy for instance has the notion of circuit-breaking which allows it to define how many outstanding requests to send upstream. So in a completely overloaded scenario, Ingress could tell that all Pods have exhausted their concurrency limits locally (not perfectly because the ingress pods are distributed as well, but probably good enough to not hammer the system with constant retries). It can also tell once there is actual capacity again (when the circuit breaker lifts) without having to do a downstream retry.

josephburnett · 2018-08-15T20:44:52Z

How do pods signal free resources to the KQueue?

Yeah, that's kind of the point of having them reach out to the KQueue. To tell it when they have capacity. Maybe they can still do that and the KQueue can forward the request directly to the pod? Does that work with the Istio mesh?

It's only its own concern whether it does buffering in-memory or to an actual queue.

They are synchronous, stateful requests, so I think it has to do it in memory, no? And it stays in the serving path while the request is being handled since it's a proxy.

All this together, I wonder if we should call it the KLB instead of the KQueue. It does some queueing but the other half of it's value is perfect load balancing with full knowledge of the capacity of each pod.

josephburnett · 2018-08-15T20:48:29Z

How do pods signal free resources to the KQueue?

Pod are already reporting their load directly to the Autoscaler, which essentially implies their capacity. They could just as well report capacity explicitly.

Perhaps we should have pods reporting load and capacity to an intermediate endpoint which implements the custom metrics api. Then the autoscaler and activator can both consume these metrics to find pods to route requests to and to calculate the desired scale.

glyn · 2018-08-16T07:42:48Z

@josephburnett What became of the idea of having the service mesh provide scaling metrics? Is that a non-starter or simply deferred?

josephburnett · 2018-08-20T16:27:17Z

Some concerns I've heard so far:

complexity of having two request paths
authentication of having pull requests
how specifically to handle fail-over

I've updated my design to push requests directly to pods which eliminates the additional complexity in the queue-proxy. And the authentication issues. However it requires that the Activator (KQueue or KLB) be able to connect directly to a Pod's IP address. @tcnghia is working on making this an option and I believe that Istio is also planning to add that capability without losing the mesh.

Key features of the design above:

pod metrics which are currently pushed to the KPA will be forwarded to the KQueue / KLB (name undecided) which will maintain only the last stat for the purpose of load balancing.
KQueue / KLB will also push queue stats back to the autoscaler for incorporating queue length into scaling decisions. @markusthoemmes is working on this in Expose auto-scaling metrics from the activator. #1623.

What became of the idea of having the service mesh provide scaling metrics? Is that a non-starter is simply deferred?

@glyn, this is deferred. We can consider other metrics pipelines, including using Istio's metrics, as long as it's fast enough for scaling and routing.

markusthoemmes · 2018-08-20T18:40:57Z

@josephburnett good next step 👍.

On exchanging pull for push and directing the request to the pod directly: As I understand the picture, the KQueue would be informed about pod metrics (like free capacity?) via the pod metrics in your picture. With the way envoy might work (see circuit breakers above) wouldn't that notion of "free" race with the primary path thinking that the pod is free as well (because an http request got finished on that pod), especially if (like today) the metrics are locally aggregated before sending them out.

What I'm alluding to: How does the system guarantee, that the KQueue will find a free pod when it forwards a request to it? If it cannot guarantee to do that: Would it make sense to actually have the KQueue retry the request onto the primary path, since the routers can potentially at least have a rough idea of the current concurrency on a pod (of course only locally per router, but better than blind guessing). That'd leave the entire concurrency/find a good pod decision up to the routing mesh, and the KQueue would "just" be a backpressure mechanism to wait for new capacity to arrive. I think that would also solve the "does this work with istio" question.

josephburnett · 2018-08-20T19:34:00Z

@markusthoemmes I think that I missed your point earlier about always using the Service route. Here is an updated diagram that does that.

A few points to note:

The KPA forwards capacity stats to the KQueue so it can know when to forward requests to the service and how many.
The KQueue forwards queue length stats to the KPA so it can know how much to scale up.
The KQueue injects a NO_QUEUE header so the queue-proxy will not enqueue the request locally. This is necessary to redistribute requests (e.g. when scaling from 0 under load).

This does eliminate the "does this wok with Istio" question. Thanks @markusthoemmes!

Open question:

How specifically do we fail over to the KQueue when revision capacity is exhausted? Once we have a reliable way of doing this, we can reduce the local pod queue to zero or near-zero. We need the local queue to survive traffic bursts when not scaled to zero (KQueue not in request path).

josephburnett · 2018-08-20T19:36:10Z

Once KQueue forwards stats to KPA, the KPA will solely be responsible for scaling and "activation", so the KQueue will just be a queue.

Since the KQueue would just forward to the underlying Service, it would not be doing any load-balancing. So I prefer the name "KQueue" over "KLB".

glyn · 2018-08-21T13:29:03Z

How will this feature will affect adding support for gRPC? @dprotaso any thoughts from your work on gRPC?

dprotaso · 2018-08-21T16:11:05Z

(assuming "Service" refers to a Kubernetes Service that load balances requests across a revision's pods.)

Be mindful that a HTTP/2 & GRPC connection implies a single long-lived connection that contains multiple streams. This implies the KQueue could have many long-lived connections that could at some point exhaust it's ability to receive & create any new connections. Thus the KQueue takes on a similar characteristics as the Ingress. It's still a load balancer even when you don't want it to be. I believe @cppforlife brought up this up as a concern in the WG call when talking about web sockets.

Another item, that is somewhat related, is that the k8s service is a L4 load balancer so it won't spread the load of several gRPC calls across Pods. Thus ideally, I'd want to make a Revision's k8s Service headless. This pushes the load balancing onto the caller. I know Envoy supports this - so I'm assuming Istio does as well. This would then become a concern for the KQueue.

markusthoemmes · 2018-08-21T16:32:45Z

(assuming "Service" refers to a Kubernetes Service that load balances requests across a revision's pods.)

I think (@josephburnett please keep me honest here), that "Service" in this case is actually a Knative service and that route should actually resemble the "usual" through the mesh routing.

It's still a load balancer even when you don't want it to be.

I think the latest revision addresses that concern in that the KQueue will route the requests back through the mesh. It defers balancing completely to the mesh's layer. The only concern of the KQueue would be to keep connections to clients open and retry them on the main (through the mesh path) once it thinks that's a good idea, i.e. when it got a signal to do so through metrics.

josephburnett · 2018-08-21T17:19:21Z

I think (@josephburnett please keep me honest here), that "Service" in this case is actually a Knative service and that route should actually resemble the "usual" through the mesh routing.

@markusthoemmes yes. We just forward to the Knative Service by using the DNS name. The KQueue's sidecar will do the load balancing. @tcnghia to check me.

How will this feature will affect adding support for gRPC?

@glyn this feature doesn't change our situation regarding gRPC. That is, it doesn't make it harder. But we still need to solve that problem.

Be mindful that a HTTP/2 & GRPC connection implies a single long-lived connection that contains multiple streams.

@dprotaso yes, this has come up a couple times. My thinking on HTTP2 and gRPC has been that our proxies (Activator/KQueue and queue-proxy pod) will need to be stream aware in order to: 1) report "concurrency" or "qps" which is defined as total count of streams, 2) limit concurrency/qps and 3) limit connection lifetime (send GOAWAY) in the case of the Activator/KQueue so it can shunt traffic away from the cold-start path.

But we should really open a separate issue to discuss our streaming strategy in depth. @tcnghia or @dprotaso, would you like to spearhead that?

dprotaso · 2018-08-21T17:40:46Z

@markusthoemmes yes. We just forward to the Knative Service by using the DNS name. The KQueue's sidecar will do the load balancing. @tcnghia to check me.

@josephburnett I think you mean Kubernetes Service. Since you can use Configuration & Routes without a Knative Service.

josephburnett · 2018-08-21T18:28:07Z

OMFG #1397

dprotaso · 2018-08-21T18:37:52Z

Just wait for a KQueue issue ;) - https://en.wikipedia.org/wiki/Kqueue

josephburnett · 2018-08-28T23:50:10Z

Here is how we can play around with some of these different ideas:

patch Emit pod scoped metrics from the autoscaler #1967 or wait for it to land (pod scope metrics)
turn on enable-pod-scope-metrics in config/config-autoscaler.yaml
run an interesting load test: https://github.com/knative/docs/tree/master/serving/samples/autoscale-go
look at stacked pod metrics: https://github.com/knative/serving/blob/master/docs/telemetry.md#metrics-troubleshooting

This is what our problem looks like when we land a crap load of concurrent connections on a single-threaded revision scaled to zero:

Errors due to overload:

The first pod's queues fill up almost immediately and it mostly throws 503s (errors above):

But when we make a deep queue (hardcoded to 100 in this case) the first pod to show up gets all the load and takes forever to work through it's backlog:

No errors:

But one pod gets swamped:

New requests are distributed evenly, but the activator already forwarded all the pending requests to one pod:

Test scenario was:

go run ../docs/serving/samples/autoscale-go/test/test.go -qps 1000 -concurrency 100 -sleep 100 -duration '30m'

What we want is no errors, but no pod to get swamped.

vvraskin · 2019-01-31T13:09:54Z

@josephburnett As discussed in the WG meeting, here are some recent measurements.

I've been running a similar test using hey, played around with some parameters (container execution time and the number of parallel requests). And I was able to reproduce the problem with the latest master and remediate it with #2653. In my test I sent batches of 300 concurrent requests (instead of 100) to highlight the 503s.

Using the latest master:

The rate of 503s is higher then the one of 200s presumably because of the retries that istio and activator did. The absolute values are:

hey -c 300 -n 900 -t 60 -host cpu-devourer.default.example.com "http://redacted/memory?duration=100ms"

// truncated
Latency distribution:
  10% in 0.1945 secs
  25% in 0.5582 secs
  50% in 1.1696 secs
  75% in 22.2478 secs
  90% in 28.0494 secs
  95% in 32.2450 secs
  99% in 32.7891 secs
// truncated
Status code distribution:
  [200]	737 responses
  [503]	163 responses

Using the throttling PR

So there are no 503.

hey -c 300 -n 900 -t 60 -host cpu-devourer.default.example.com "http://redacted/memory?duration=100ms"
// truncated
Latency distribution:
  10% in 0.2987 secs
  25% in 0.3989 secs
  50% in 0.5310 secs
  75% in 25.0532 secs
  90% in 30.7413 secs
  95% in 32.2458 secs
  99% in 33.2804 secs
// truncated
Status code distribution:
  [200]	900 responses

mattmoor · 2019-04-26T20:52:51Z

With Revision managed activation (#1997) closed and the SKS API available to the KPA this is now plausible to do via the activator.

Anyone interested inpursuing this in 0.7?

markusthoemmes · 2019-04-29T06:14:09Z

@mattmoor I'd very much be, if I'm able to finish my pluggable autoscaler work in time for that. Leaving this unassigned until I know I'll have the cycles.

vagababov · 2019-05-20T14:25:36Z

/assign

vagababov · 2019-06-15T17:07:47Z

https://docs.google.com/document/d/1KK32OG-QzOXFrFcD9O7L2udMuzpyYdhDN9PoGdw3y_s/preview

mattmoor · 2019-07-11T14:17:38Z

@vagababov Trying to understand what work is left here. Just tests?

vagababov · 2019-07-11T19:09:20Z

Affirmative.

…

On Thu, Jul 11, 2019 at 7:17 AM Matt Moore ***@***.***> wrote: @vagababov <https://github.com/vagababov> Trying to understand what work is left here. Just tests? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1846?email_source=notifications&email_token=AAF2WXYX5SZLWJXXFPCFJ7DP646IHA5CNFSM4FPL6LKKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZW3AFQ#issuecomment-510504982>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAF2WX2Q5D7XJZJHPCNEHXTP646IHANCNFSM4FPL6LKA> .

vagababov · 2019-07-15T20:21:41Z

/close

knative-prow-robot · 2019-07-15T20:21:43Z

@vagababov: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

josephburnett added area/networking area/autoscale labels Aug 13, 2018

knative-prow-robot added the kind/feature Well-understood/specified features, ready for coding. label Aug 13, 2018

markusthoemmes mentioned this issue Aug 20, 2018

Requests buffered at the receiving side generate unbounded latency #1409

Closed

josephburnett mentioned this issue Aug 20, 2018

Expose auto-scaling metrics from the activator. #1623

Closed

markusthoemmes mentioned this issue Aug 21, 2018

Activator retries its own requests. #1907

Closed

josephburnett changed the title ~~KQueue to replace Activator~~ KBuffer to replace Activator Aug 22, 2018

julz mentioned this issue Aug 23, 2018

KBuffer in 1-N case (and n-N case) #1933

Closed

josephburnett mentioned this issue Aug 28, 2018

Emit pod scoped metrics from the autoscaler #1967

Closed

tcnghia mentioned this issue Aug 29, 2018

http 503 responses during load test on GKE during a node auto-scale up #1797

Closed

josephburnett mentioned this issue Sep 25, 2018

Extract out networking.internal.knative.dev.ClusterIngress #1963

Closed

josephburnett added this to the Scaling: Revision Overload Handling milestone Oct 4, 2018

josephburnett changed the title ~~KBuffer to replace Activator~~ Buffer all overload requests in Activator Nov 26, 2018

josephburnett changed the title ~~Buffer all overload requests in Activator~~ Buffer overload requests in Activator Nov 27, 2018

vvraskin mentioned this issue Jan 24, 2019

Requests are failing with 503s when scale from 0 to N #2988

Closed

vvraskin mentioned this issue Feb 8, 2019

Enable throttler in activator #3139

Merged

kameshsampath mentioned this issue Feb 10, 2019

revert containerConcurrency to annotation redhat-developer-demos/knative-tutorial#9

Closed

mattmoor modified the milestones: Networking: Revision Overload Handling, Serving "v1" (ready for production) Apr 26, 2019

mattmoor modified the milestones: Serving "v1" (ready for production), Serving 0.7 Apr 26, 2019

vagababov mentioned this issue May 15, 2019

Auto-scaling is reactionary #4107

Closed

knative-prow-robot assigned vagababov May 20, 2019

eallred-google added the P1 P1 label Jun 5, 2019

tcnghia modified the milestones: Serving 0.7, Serving 0.8 Jun 13, 2019

This was referenced Jun 19, 2019

Introduce new knob for configuring burst capacity. #4433

Merged

BurstCapacity computation and Decider changes #4516

Merged

vagababov mentioned this issue Jul 11, 2019

Initial TBC test. #4710

Merged

knative-prow-robot closed this as completed Jul 15, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Buffer overload requests in Activator #1846

Buffer overload requests in Activator #1846

josephburnett commented Aug 13, 2018 •

edited

Loading

josephburnett commented Aug 13, 2018 •

edited

Loading

josephburnett commented Aug 15, 2018

markusthoemmes commented Aug 15, 2018 •

edited

Loading

josephburnett commented Aug 15, 2018

josephburnett commented Aug 15, 2018

glyn commented Aug 16, 2018 •

edited

Loading

josephburnett commented Aug 20, 2018

markusthoemmes commented Aug 20, 2018

josephburnett commented Aug 20, 2018

josephburnett commented Aug 20, 2018

glyn commented Aug 21, 2018

dprotaso commented Aug 21, 2018

markusthoemmes commented Aug 21, 2018 •

edited

Loading

josephburnett commented Aug 21, 2018 •

edited

Loading

dprotaso commented Aug 21, 2018

josephburnett commented Aug 21, 2018

dprotaso commented Aug 21, 2018 •

edited

Loading

josephburnett commented Aug 28, 2018 •

edited

Loading

vvraskin commented Jan 31, 2019

mattmoor commented Apr 26, 2019

markusthoemmes commented Apr 29, 2019

vagababov commented May 20, 2019

vagababov commented Jun 15, 2019

mattmoor commented Jul 11, 2019

vagababov commented Jul 11, 2019 via email

vagababov commented Jul 15, 2019

knative-prow-robot commented Jul 15, 2019

Buffer overload requests in Activator #1846

Buffer overload requests in Activator #1846

Comments

josephburnett commented Aug 13, 2018 • edited Loading

Problem

Proposal

Non-Goals

Diagram

Original Proposal (rejected)

Diagram

josephburnett commented Aug 13, 2018 • edited Loading

josephburnett commented Aug 15, 2018

markusthoemmes commented Aug 15, 2018 • edited Loading

josephburnett commented Aug 15, 2018

josephburnett commented Aug 15, 2018

glyn commented Aug 16, 2018 • edited Loading

josephburnett commented Aug 20, 2018

markusthoemmes commented Aug 20, 2018

josephburnett commented Aug 20, 2018

josephburnett commented Aug 20, 2018

glyn commented Aug 21, 2018

dprotaso commented Aug 21, 2018

markusthoemmes commented Aug 21, 2018 • edited Loading

josephburnett commented Aug 21, 2018 • edited Loading

dprotaso commented Aug 21, 2018

josephburnett commented Aug 21, 2018

dprotaso commented Aug 21, 2018 • edited Loading

josephburnett commented Aug 28, 2018 • edited Loading

vvraskin commented Jan 31, 2019

Using the latest master:

Using the throttling PR

mattmoor commented Apr 26, 2019

markusthoemmes commented Apr 29, 2019

vagababov commented May 20, 2019

vagababov commented Jun 15, 2019

mattmoor commented Jul 11, 2019

vagababov commented Jul 11, 2019 via email

vagababov commented Jul 15, 2019

knative-prow-robot commented Jul 15, 2019

josephburnett commented Aug 13, 2018 •

edited

Loading

josephburnett commented Aug 13, 2018 •

edited

Loading

markusthoemmes commented Aug 15, 2018 •

edited

Loading

glyn commented Aug 16, 2018 •

edited

Loading

markusthoemmes commented Aug 21, 2018 •

edited

Loading

josephburnett commented Aug 21, 2018 •

edited

Loading

dprotaso commented Aug 21, 2018 •

edited

Loading

josephburnett commented Aug 28, 2018 •

edited

Loading