High proxy memory consumption in high concurrency when h2-upgrade disabled #5146

alpeb · 2020-10-27T18:43:52Z

Bug Report

What is the issue?

When disabling H2 upgrading in the proxy, high concurrency scenarios connecting to lots of different endpoints consumes a lot of memory.

How can it be reproduced?

install linkerd with h2-upgrade disabled, bounded proxy resources and no addons (so that we leave resources available for the emojivoto instances):

$ cat config.yml
grafana:
  enabled: false
prometheus:
  enabled: false

$ linkerd install --proxy-cpu-limit 1 --proxy-cpu-request 100m --proxy-memory-limit 250Mi --proxy-memory-request 20Mi --disable-h2-upgrade --config config.yml |kubectl apply -f -

Download and extract this chart for emojivoto that:
- doesn't have vote-bot
- resource consumption is minimal
- only web-svc is injected
- its namespace is parametrized (emojivoto-%num)
Use it to install 30 instances of emojivoto:

for num in $(seq 0 1 29); do
   { 
         kubectl create namespace emojivoto-$num
         helm install emojivoto-$num --namespace emojivoto-$num emojivoto
   } &
done
wait

Download and extract this chart which creates a job that triggers wrk2 to hit the 30 web-svc instances
Trigger the benchmark with:

$ helm install benchmark --namespace benchmark --create-namespace --set wrk2.serviceMesh=linkerd --set wrk2.app.count=30 --set wrk2.RPS=4000 --set wrk2.duration=240 --set wrk2.connections=128 --set wrk.initDelay=10 benchmark

During the benchmark you'll see the memory consumption of the linkerd-proxy container in the wrk2-prometheus pod grow:

$ kubectl -n benchmark exec -ti wrk2-prometheus-xxx -c linkerd-proxy sh
cat /sys/fs/cgroup/memory/memory.usage_in_bytes
262135808

In this particular example using KinD the memory consumption stayed right under the 250Mi limit, but in the Equinix benchmarking boxes (using k8s lokomotive) the pod gets OOMKilled.
When running with h2-upgrade enabled the consumption only goes up to 35Mi

Logs, error output, etc

In the Equinix box where the pod gets OOMKilled:

$ kubectl -n benchmark describe po wrk2-prometheus-xxx
...
  linkerd-proxy:                                                                                                                                                                                                                               
    Container ID:  docker://d4fc07e671a1c61680611a48380746e026234f0b88a382675dce6133d0988a45                        
    Image:         ghcr.io/linkerd/proxy:edge-20.10.5                                                                  
    Image ID:      docker-pullable://ghcr.io/linkerd/proxy@sha256:47ae5d9b2467d380cd87d5845ac69f80991cd575b03ad090a5cadfe515061b86
    Ports:         4143/TCP, 4191/TCP                                                                                  
    Host Ports:    0/TCP, 0/TCP                         
    State:         Terminated                                                                                          
      Reason:      OOMKilled                                                                                           
      Message:     3.137.233:80}: linkerd2_app_core::serve: Connection closed error=connection closed before message completed                                                                                                                 
[    28.253779s]  WARN ThreadId(01) outbound:http{v=1.x}: linkerd2_app_core::errors: Failed to proxy request: Service in fail-fast
[    28.271951s]  INFO ThreadId(01) outbound:accept{peer.addr=10.2.230.93:33996 target.addr=10.3.157.228:80}: linkerd2_app_core::serve: Connection closed error=connection closed before message completed
[    28.432310s]  INFO ThreadId(01) outbound:accept{peer.addr=10.2.230.93:34190 target.addr=10.3.157.228:80}: linkerd2_app_core::serve: Connection closed error=connection closed before message completed
[    28.441865s]  INFO ThreadId(01) outbound:accept{peer.addr=10.2.230.93:34232 target.addr=10.3.157.228:80}: linkerd2_app_core::serve: Connection closed error=connection closed before message completed
[    28.829912s]  WARN ThreadId(01) outbound:http{v=1.x}: linkerd2_app_core::errors: Failed to proxy request: Service in fail-fast
[    29.076363s]  INFO ThreadId(01) outbound:accept{peer.addr=10.2.230.93:33344 target.addr=10.3.248.42:80}: linkerd2_app_core::serve: Connection closed error=connection closed before message completed
[    29.083364s]  INFO ThreadId(01) outbound:accept{peer.addr=10.2.230.93:33354 target.addr=10.3.248.42:80}: linkerd2_app_core::serve: Connection closed error=connection closed before message completed
[    29.453894s]  WARN ThreadId(01) outbound:http{v=1.x}: linkerd2_app_core::errors: Failed to proxy request: Service in fail-fast
[    30.097276s]  WARN ThreadId(01) outbound:http{v=1.x}: linkerd2_app_core::errors: Failed to proxy request: Service in fail-fast
[    30.940508s]  INFO ThreadId(01) outbound:accept{peer.addr=10.2.230.93:6312 target.addr=10.3.33.189:80}: linkerd2_app_core::serve: Connection closed error=connection closed before message completed
[    30.944335s]  INFO ThreadId(01) outbound:accept{peer.addr=10.2.230.93:55330 target.addr=10.3.33.189:80}: linkerd2_app_core::serve: Connection closed error=connection closed before message completed

      Exit Code:    137
      Started:      Tue, 27 Oct 2020 18:14:43 +0000
      Finished:     Tue, 27 Oct 2020 18:15:24 +0000
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:     1
      memory:  250Mi
    Requests:
      cpu:      100m
      memory:   20Mi
...

Environment

Local environment (high mem consumption but no OOMKill)

Kubernetes Version: 1.18.2
Cluster Environment: (GKE, AKS, kops, ...): KinD
Host OS: Linux
Linkerd version: edge-20.10.5

Equinix box (OOMKilled)

Kubernetes Version: 1.18.8
Cluster Environment: (GKE, AKS, kops, ...): lokomotive
Host OS: Linux
Linkerd version: edge-20.10.5

The text was updated successfully, but these errors were encountered:

olix0r · 2020-10-30T01:54:49Z

I was unable to reproduce this leak outside of a Kubernetes environment (even using the same client and server).
I was able (after a lot of fighting) to get heaptrack running in k8s to get more data
When running under heaptrack, the proxy's memory stabilizes around ~140M (or seems to) -- without heaptrack it grows unbounded.
The heaptrack data indicates that leaked data is related to tokio::sync primitives -- perhaps triggered by our access patterns?

olix0r · 2020-10-30T22:49:15Z

In order to test whether this leak is triggered by something in the profile discovery, I put together a simple TCP echo test.

I set it up with a similar load profile to the wrk2 test (128 client connections to 30 unique services):

:; helm install tcp-echo tcp-echo --create-namespace --namespace=tcp-echo --set=client.concurrency=128 --set=server.services=30

Memory remains stable <20MB

:; while linkerd -n tcp-echo metrics deploy/client |grep ^process_res ; do date ; sleep 10 ; done
process_resident_memory_bytes 19251200
Fri Oct 30 22:19:22 UTC 2020
...
process_resident_memory_bytes 19521536
Fri Oct 30 22:26:52 UTC 2020

This gives us a few signals: It's probably not related to service discovery, protocol detection, or load balancing. The core implementations of these features are used across both the TCP and HTTP stacks. The access patterns are different--we're only transporting ~600 connections per second, as opposed to ~4000 HTTP requests per second (in the wrk2 case), but this may indicate that the issue is caused by something specific to the HTTP stack.

To test these assumptions, we could reduce the wrk2 RPS to ~600 to match the TCP use case's load--if the problem can still be observed at this lesser load, it would confirm this hunch. If this problem can only be observed at higher loads, however, we might want to look more closely at our behavior when buffers fill up. Anecdotally, while I was testing yesterday, I did observe that the buffer in the logical stack reached capacity at times...

olix0r · 2020-10-30T23:33:54Z

Testing wrk2 at lower volumes, a leak is still apparent:

at 300 RPS, the proxy reaches ~90MB in a 5 minute run
at 30 RPS, the proxy reaches ~40MB in a 5 minute run

In both cases, the memory growth is steady throughout the run. This points strongly to an issue triggered by HTTP-specific logic.

hawkw · 2020-10-30T23:57:04Z

@olix0r is that with the proxy on master, or a different build?

olix0r · 2020-10-31T00:09:14Z

@hawkw Using the last edge release.

olix0r · 2020-11-05T06:41:49Z

I've written a load tester in Rust, ort -- basically porting strest, with support for
both HTTP/1 and gRPC. Its helm chart makes it simple to deploy a load test configuration to a
Linkerd-enabled cluster:

:; helm install ort ./ort --namespace ort --create-namespace \
   --set load.flags.concurrencyLimit=$c \
   --set load.flags.requestsPerClient=10000 \
   --set load.flags.requestLimit=4000 \
   --set load.linkerd.inject=enabled \
   --set load.threads=$t \
   --set server.linkerd.inject=$s \
   --set server.services=30

In this case, I'm running k3d with the latest linkerd edge-20.11.1.

This configuration approximates the wrk2 environment originally described in this issue:

We run 30 target services, each with a single server instance.
The load generator iterates through the list of target services, sending 10K requests at a time
before moving to the next service.
The load generator sends at most $r requests per second.
There are no more than $c requests in flight at once.
When $s is enabled, the server pods are injected with a proxy sidecar and the client will multiplex HTTP/1.1 requests over H2 clients.

The results of this tests don't exactly match the original issue description (probably due to
differences in the load generator); but they do tell an interesting storY:

`s`	`r`	`c`	http proxy RSS (MB)	grpc proxy RSS (MB)
disabled	4000	8	28	23
disabled	4000	24	45	23
disabled	4000	48	68	24
disabled	4000	64	85	24
disabled	4000	128	153	24
enabled	4000	128	35	24
disabled	8000	128	165	28
enabled	8000	128	40	28

In all cases, the proxy's memory climbs fairly slowly and eventually ~stabilizes. It is possible
that more growth is possible as the proxy runs for a long time; but each of these tests were
limited to about 10-20 minutes (when the value appears to be fairly stable).

It's also worth noting that H2 isn't a pure win over HTTP/1 -- we appear to use 10%+ more CPU for
outbound H2 gRPC clients than we do with HTTP/1.1 clients.

When the server is meshed, its proxies each consume <20MB.

The gRPC client proxies are stable at <30MB in all cases while the HTTP proxies grow according to
request concurrency. Furthermore, we see that meshing HTTP/1.1 traffic so that outbound requests
are multiplexed over HTTP/2, we see a substantial reduction in memory consumption! This all points
pretty strongly to this memory pressure originating from our HTTP/1.1 client: the outbound server
is communicating via HTTP/1.1 in all cases; and requests are not upgraded to HTTP/2 until the
HTTP endpoint stack (which is i.e. below discovery and the load balancer).

While this doesn't exactly replicate the original described behavior, it gives us some strong
pointers about where to start looking.

olix0r · 2020-11-06T07:50:59Z

Turning the concurrency up even more, the problems become more pronounced:

:; helm upgrade ort ./ort --namespace ort --create-namespace --set server.services=30 --set load.linkerd.inject=enabled --set server.linkerd.inject=disabled --set load.flags.requestsPerTarget=10000 --set load.flags.requestLimit=4000 --set load.flags.concurrencyLimit=1000 --set load.threads=2

The http proxy's memory shoots up in excess of 750MB, while the h2 client stays stable:

# Fri Nov  3 07:11:23 UTC 2020
pod/load-http-7997b84478-gfd9t    794.453MB
pod/load-grpc-b76cb9485-nfqfc      28.605M

This, again, points to something in the HTTP client stack, or potentially the outbound connection stack.

Comparing the proxies the number of open fds, we notice immediately that the http proxy has many more connections that we'd expect:

http    24631
grpc    47

The outbound client is maintaining 700+ connections to most services

server-000.ort.svc.cluster.local        768
server-001.ort.svc.cluster.local        725
server-002.ort.svc.cluster.local        814
server-003.ort.svc.cluster.local        725
server-004.ort.svc.cluster.local        768
server-005.ort.svc.cluster.local        725
server-006.ort.svc.cluster.local        841
server-007.ort.svc.cluster.local        725
server-008.ort.svc.cluster.local        981
server-009.ort.svc.cluster.local        725
server-010.ort.svc.cluster.local        725
server-011.ort.svc.cluster.local        725
server-012.ort.svc.cluster.local        810
server-013.ort.svc.cluster.local        640
server-014.ort.svc.cluster.local        913
server-015.ort.svc.cluster.local        725
server-016.ort.svc.cluster.local        768
server-017.ort.svc.cluster.local        0
server-018.ort.svc.cluster.local        0
server-019.ort.svc.cluster.local        0
server-020.ort.svc.cluster.local        768
server-021.ort.svc.cluster.local        0
server-022.ort.svc.cluster.local        768
server-023.ort.svc.cluster.local        725
server-024.ort.svc.cluster.local        829
server-025.ort.svc.cluster.local        725
server-026.ort.svc.cluster.local        768
server-027.ort.svc.cluster.local        725
server-028.ort.svc.cluster.local        785
server-029.ort.svc.cluster.local        725

This probably points to the hyper connection pool and connection reuse. I'm not sure if this is necessarily wrong behavior: in order to serve concurrent requests, the proxy needs to spin up new connections. It would probably be awkward to enforce a maximum concurrency; but we might consider decreasing the idle timeout... this is basically a trade-off between latency and memory cost. The current caching behavior is probably especially costly for applications like Prometheus that connect to many endpoints, though, especially in Prometheus's case, the latency hit of establishing new connections is acceptable.

In fact, it seems preferable to be more conservative about caching connections to unmeshed endpoints, in general. We assume the common case is th communication is fully meshed and multiplexed and the cost of maintaining a cached connection is fairly low and we can more aggressively cache connections (though, still, for meshed prometheus scrapes, this could be costly).

Hyper offers two pool configuration options that we should set:

pool_idle_timeout is currently set at the default 90s. we should probably dropt his to 10s (or the proxy's configured client cache timeout), in general.
pool_max_Idle_per_host is currently unlimited. We should probably change this to a small value, like 2.

With regard to Prometheus instances, we may want to expose the proxy's cache duration as an externally configurable flag, but this warrants some further thought.

olix0r · 2020-11-06T08:12:12Z

linkerd/linkerd2-proxy#734 makes the proposed changes to the http/1.1 client

olix0r · 2020-11-06T08:30:23Z

Running with the modified connection pool we see the above workload that ran in 700MBs down to ~155MB, though it appears that the proxy now uses substantially more CPU...

With concurrency at 128, where, on main, the proxy runs at ~150MB, the new configuration runs at ~35MB (again, with elevated CPU).

I'm inclined to go forward with these changes...

olix0r · 2020-11-06T17:29:22Z

After further testing, it seems the best approach is to leave the number of idle connections unbounded, but to limit the cache idle age. The current default is actually 60s, not 10s. linkerd/linkerd2-proxy@c470643 changes these defaults so that the inbound timeout is 10s and the outbound timeout is 3s. This will cause additional discovery lookups when traffic is sporadic, but it should dramatically reduce the memory overhead for idle & sporadic connections. The change warrants a bit more testing and thought, so I don't think we should rush it into stable-2.9.0, but it seems like a good candidate for a followup patch release.

hawkw · 2020-11-06T18:19:52Z

@olix0r do you think we should consider having different idle ages for HTTP/1 and HTTP/2 services, in order to limit the increased discovery load when idling out services is less expensive?

I suppose that if we wanted to do that, it would be a pretty big change, since there's currently One Big Cache that everything lives in...may not be worth it.

olix0r · 2020-11-06T18:44:33Z

@hawkw Thinking about the Prometheus case, I think we actually really do want to limit the amount of time idle connections stay around for h2 as well. We could potentially have separate idle timeouts for the services and the connections if we see problems, but I think we should be more eager about evicting idle outbound things...

hawkw · 2020-11-06T19:02:45Z

Hmm, yeah, that makes sense; I suppose for pretty much any outbound that's hot enough that we do want it in the cave, 10s is plenty!

olix0r · 2020-11-10T16:39:20Z

Addressed by #5200

alpeb added the area/proxy label Oct 27, 2020

olix0r self-assigned this Oct 28, 2020

hawkw mentioned this issue Nov 6, 2020

http: Configure client connection pools linkerd/linkerd2-proxy#734

Merged

olix0r closed this as completed Nov 10, 2020

github-actions bot locked as resolved and limited conversation to collaborators Jul 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High proxy memory consumption in high concurrency when h2-upgrade disabled #5146

High proxy memory consumption in high concurrency when h2-upgrade disabled #5146

alpeb commented Oct 27, 2020

olix0r commented Oct 30, 2020 •

edited

Loading

olix0r commented Oct 30, 2020 •

edited

Loading

olix0r commented Oct 30, 2020

hawkw commented Oct 30, 2020

olix0r commented Oct 31, 2020

olix0r commented Nov 5, 2020 •

edited

Loading

olix0r commented Nov 6, 2020 •

edited

Loading

olix0r commented Nov 6, 2020

olix0r commented Nov 6, 2020

olix0r commented Nov 6, 2020 •

edited

Loading

hawkw commented Nov 6, 2020

olix0r commented Nov 6, 2020

hawkw commented Nov 6, 2020

olix0r commented Nov 10, 2020

High proxy memory consumption in high concurrency when h2-upgrade disabled #5146

High proxy memory consumption in high concurrency when h2-upgrade disabled #5146

Comments

alpeb commented Oct 27, 2020

Bug Report

What is the issue?

How can it be reproduced?

Logs, error output, etc

Environment

Local environment (high mem consumption but no OOMKill)

Equinix box (OOMKilled)

olix0r commented Oct 30, 2020 • edited Loading

olix0r commented Oct 30, 2020 • edited Loading

olix0r commented Oct 30, 2020

hawkw commented Oct 30, 2020

olix0r commented Oct 31, 2020

olix0r commented Nov 5, 2020 • edited Loading

olix0r commented Nov 6, 2020 • edited Loading

olix0r commented Nov 6, 2020

olix0r commented Nov 6, 2020

olix0r commented Nov 6, 2020 • edited Loading

hawkw commented Nov 6, 2020

olix0r commented Nov 6, 2020

hawkw commented Nov 6, 2020

olix0r commented Nov 10, 2020

olix0r commented Oct 30, 2020 •

edited

Loading

olix0r commented Oct 30, 2020 •

edited

Loading

olix0r commented Nov 5, 2020 •

edited

Loading

olix0r commented Nov 6, 2020 •

edited

Loading

olix0r commented Nov 6, 2020 •

edited

Loading