Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High proxy memory consumption in high concurrency when h2-upgrade disabled #5146

Closed
alpeb opened this issue Oct 27, 2020 · 14 comments
Closed
Assignees

Comments

@alpeb
Copy link
Member

alpeb commented Oct 27, 2020

Bug Report

What is the issue?

When disabling H2 upgrading in the proxy, high concurrency scenarios connecting to lots of different endpoints consumes a lot of memory.

How can it be reproduced?

  • install linkerd with h2-upgrade disabled, bounded proxy resources and no addons (so that we leave resources available for the emojivoto instances):
$ cat config.yml
grafana:
  enabled: false
prometheus:
  enabled: false

$ linkerd install --proxy-cpu-limit 1 --proxy-cpu-request 100m --proxy-memory-limit 250Mi --proxy-memory-request 20Mi --disable-h2-upgrade --config config.yml |kubectl apply -f -
  • Download and extract this chart for emojivoto that:
    • doesn't have vote-bot
    • resource consumption is minimal
    • only web-svc is injected
    • its namespace is parametrized (emojivoto-%num)
  • Use it to install 30 instances of emojivoto:
for num in $(seq 0 1 29); do
   { 
         kubectl create namespace emojivoto-$num
         helm install emojivoto-$num --namespace emojivoto-$num emojivoto
   } &
done
wait
  • Download and extract this chart which creates a job that triggers wrk2 to hit the 30 web-svc instances
  • Trigger the benchmark with:
$ helm install benchmark --namespace benchmark --create-namespace --set wrk2.serviceMesh=linkerd --set wrk2.app.count=30 --set wrk2.RPS=4000 --set wrk2.duration=240 --set wrk2.connections=128 --set wrk.initDelay=10 benchmark
  • During the benchmark you'll see the memory consumption of the linkerd-proxy container in the wrk2-prometheus pod grow:
$ kubectl -n benchmark exec -ti wrk2-prometheus-xxx -c linkerd-proxy sh
cat /sys/fs/cgroup/memory/memory.usage_in_bytes
262135808

In this particular example using KinD the memory consumption stayed right under the 250Mi limit, but in the Equinix benchmarking boxes (using k8s lokomotive) the pod gets OOMKilled.
When running with h2-upgrade enabled the consumption only goes up to 35Mi

Logs, error output, etc

In the Equinix box where the pod gets OOMKilled:

$ kubectl -n benchmark describe po wrk2-prometheus-xxx
...
  linkerd-proxy:                                                                                                                                                                                                                               
    Container ID:  docker://d4fc07e671a1c61680611a48380746e026234f0b88a382675dce6133d0988a45                        
    Image:         ghcr.io/linkerd/proxy:edge-20.10.5                                                                  
    Image ID:      docker-pullable://ghcr.io/linkerd/proxy@sha256:47ae5d9b2467d380cd87d5845ac69f80991cd575b03ad090a5cadfe515061b86
    Ports:         4143/TCP, 4191/TCP                                                                                  
    Host Ports:    0/TCP, 0/TCP                         
    State:         Terminated                                                                                          
      Reason:      OOMKilled                                                                                           
      Message:     3.137.233:80}: linkerd2_app_core::serve: Connection closed error=connection closed before message completed                                                                                                                 
[    28.253779s]  WARN ThreadId(01) outbound:http{v=1.x}: linkerd2_app_core::errors: Failed to proxy request: Service in fail-fast
[    28.271951s]  INFO ThreadId(01) outbound:accept{peer.addr=10.2.230.93:33996 target.addr=10.3.157.228:80}: linkerd2_app_core::serve: Connection closed error=connection closed before message completed
[    28.432310s]  INFO ThreadId(01) outbound:accept{peer.addr=10.2.230.93:34190 target.addr=10.3.157.228:80}: linkerd2_app_core::serve: Connection closed error=connection closed before message completed
[    28.441865s]  INFO ThreadId(01) outbound:accept{peer.addr=10.2.230.93:34232 target.addr=10.3.157.228:80}: linkerd2_app_core::serve: Connection closed error=connection closed before message completed
[    28.829912s]  WARN ThreadId(01) outbound:http{v=1.x}: linkerd2_app_core::errors: Failed to proxy request: Service in fail-fast
[    29.076363s]  INFO ThreadId(01) outbound:accept{peer.addr=10.2.230.93:33344 target.addr=10.3.248.42:80}: linkerd2_app_core::serve: Connection closed error=connection closed before message completed
[    29.083364s]  INFO ThreadId(01) outbound:accept{peer.addr=10.2.230.93:33354 target.addr=10.3.248.42:80}: linkerd2_app_core::serve: Connection closed error=connection closed before message completed
[    29.453894s]  WARN ThreadId(01) outbound:http{v=1.x}: linkerd2_app_core::errors: Failed to proxy request: Service in fail-fast
[    30.097276s]  WARN ThreadId(01) outbound:http{v=1.x}: linkerd2_app_core::errors: Failed to proxy request: Service in fail-fast
[    30.940508s]  INFO ThreadId(01) outbound:accept{peer.addr=10.2.230.93:6312 target.addr=10.3.33.189:80}: linkerd2_app_core::serve: Connection closed error=connection closed before message completed
[    30.944335s]  INFO ThreadId(01) outbound:accept{peer.addr=10.2.230.93:55330 target.addr=10.3.33.189:80}: linkerd2_app_core::serve: Connection closed error=connection closed before message completed

      Exit Code:    137
      Started:      Tue, 27 Oct 2020 18:14:43 +0000
      Finished:     Tue, 27 Oct 2020 18:15:24 +0000
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:     1
      memory:  250Mi
    Requests:
      cpu:      100m
      memory:   20Mi
...

Environment

Local environment (high mem consumption but no OOMKill)

  • Kubernetes Version: 1.18.2
  • Cluster Environment: (GKE, AKS, kops, ...): KinD
  • Host OS: Linux
  • Linkerd version: edge-20.10.5

Equinix box (OOMKilled)

  • Kubernetes Version: 1.18.8
  • Cluster Environment: (GKE, AKS, kops, ...): lokomotive
  • Host OS: Linux
  • Linkerd version: edge-20.10.5
@olix0r olix0r self-assigned this Oct 28, 2020
@olix0r
Copy link
Member

olix0r commented Oct 30, 2020

  • I was unable to reproduce this leak outside of a Kubernetes environment (even using the same client and server).
  • I was able (after a lot of fighting) to get heaptrack running in k8s to get more data
  • When running under heaptrack, the proxy's memory stabilizes around ~140M (or seems to) -- without heaptrack it grows unbounded.
  • The heaptrack data indicates that leaked data is related to tokio::sync primitives -- perhaps triggered by our access patterns?

image
image

@olix0r
Copy link
Member

olix0r commented Oct 30, 2020

In order to test whether this leak is triggered by something in the profile discovery, I put together a simple TCP echo test.

I set it up with a similar load profile to the wrk2 test (128 client connections to 30 unique services):

:; helm install tcp-echo tcp-echo --create-namespace --namespace=tcp-echo --set=client.concurrency=128 --set=server.services=30

Memory remains stable <20MB

:; while linkerd -n tcp-echo metrics deploy/client |grep ^process_res ; do date ; sleep 10 ; done
process_resident_memory_bytes 19251200
Fri Oct 30 22:19:22 UTC 2020
...
process_resident_memory_bytes 19521536
Fri Oct 30 22:26:52 UTC 2020

This gives us a few signals: It's probably not related to service discovery, protocol detection, or load balancing. The core implementations of these features are used across both the TCP and HTTP stacks. The access patterns are different--we're only transporting ~600 connections per second, as opposed to ~4000 HTTP requests per second (in the wrk2 case), but this may indicate that the issue is caused by something specific to the HTTP stack.

To test these assumptions, we could reduce the wrk2 RPS to ~600 to match the TCP use case's load--if the problem can still be observed at this lesser load, it would confirm this hunch. If this problem can only be observed at higher loads, however, we might want to look more closely at our behavior when buffers fill up. Anecdotally, while I was testing yesterday, I did observe that the buffer in the logical stack reached capacity at times...

@olix0r
Copy link
Member

olix0r commented Oct 30, 2020

Testing wrk2 at lower volumes, a leak is still apparent:

  • at 300 RPS, the proxy reaches ~90MB in a 5 minute run
  • at 30 RPS, the proxy reaches ~40MB in a 5 minute run

In both cases, the memory growth is steady throughout the run. This points strongly to an issue triggered by HTTP-specific logic.

@hawkw
Copy link
Contributor

hawkw commented Oct 30, 2020

@olix0r is that with the proxy on master, or a different build?

@olix0r
Copy link
Member

olix0r commented Oct 31, 2020

@hawkw Using the last edge release.

@olix0r
Copy link
Member

olix0r commented Nov 5, 2020

I've written a load tester in Rust, ort -- basically porting strest, with support for
both HTTP/1 and gRPC. Its helm chart makes it simple to deploy a load test configuration to a
Linkerd-enabled cluster:

:; helm install ort ./ort --namespace ort --create-namespace \
   --set load.flags.concurrencyLimit=$c \
   --set load.flags.requestsPerClient=10000 \
   --set load.flags.requestLimit=4000 \
   --set load.linkerd.inject=enabled \
   --set load.threads=$t \
   --set server.linkerd.inject=$s \
   --set server.services=30

In this case, I'm running k3d with the latest linkerd edge-20.11.1.

This configuration approximates the wrk2 environment originally described in this issue:

  • We run 30 target services, each with a single server instance.
  • The load generator iterates through the list of target services, sending 10K requests at a time
    before moving to the next service.
  • The load generator sends at most $r requests per second.
  • There are no more than $c requests in flight at once.
  • When $s is enabled, the server pods are injected with a proxy sidecar and the client will multiplex HTTP/1.1 requests over H2 clients.

The results of this tests don't exactly match the original issue description (probably due to
differences in the load generator); but they do tell an interesting storY:

s r c http proxy RSS (MB) grpc proxy RSS (MB)
disabled 4000 8 28 23
disabled 4000 24 45 23
disabled 4000 48 68 24
disabled 4000 64 85 24
disabled 4000 128 153 24
enabled 4000 128 35 24
disabled 8000 128 165 28
enabled 8000 128 40 28

In all cases, the proxy's memory climbs fairly slowly and eventually ~stabilizes. It is possible
that more growth is possible as the proxy runs for a long time; but each of these tests were
limited to about 10-20 minutes (when the value appears to be fairly stable).

It's also worth noting that H2 isn't a pure win over HTTP/1 -- we appear to use 10%+ more CPU for
outbound H2 gRPC clients than we do with HTTP/1.1 clients.

When the server is meshed, its proxies each consume <20MB.

The gRPC client proxies are stable at <30MB in all cases while the HTTP proxies grow according to
request concurrency. Furthermore, we see that meshing HTTP/1.1 traffic so that outbound requests
are multiplexed over HTTP/2, we see a substantial reduction in memory consumption! This all points
pretty strongly to this memory pressure originating from our HTTP/1.1 client: the outbound server
is communicating via HTTP/1.1 in all cases; and requests are not upgraded to HTTP/2 until the
HTTP endpoint stack (which is i.e. below discovery and the load balancer).

While this doesn't exactly replicate the original described behavior, it gives us some strong
pointers about where to start looking.

@olix0r
Copy link
Member

olix0r commented Nov 6, 2020

Turning the concurrency up even more, the problems become more pronounced:

:; helm upgrade ort ./ort --namespace ort --create-namespace --set server.services=30 --set load.linkerd.inject=enabled --set server.linkerd.inject=disabled --set load.flags.requestsPerTarget=10000 --set load.flags.requestLimit=4000 --set load.flags.concurrencyLimit=1000 --set load.threads=2

The http proxy's memory shoots up in excess of 750MB, while the h2 client stays stable:

# Fri Nov  3 07:11:23 UTC 2020
pod/load-http-7997b84478-gfd9t    794.453MB
pod/load-grpc-b76cb9485-nfqfc      28.605M

This, again, points to something in the HTTP client stack, or potentially the outbound connection stack.

Comparing the proxies the number of open fds, we notice immediately that the http proxy has many more connections that we'd expect:

http    24631
grpc    47

The outbound client is maintaining 700+ connections to most services

server-000.ort.svc.cluster.local        768
server-001.ort.svc.cluster.local        725
server-002.ort.svc.cluster.local        814
server-003.ort.svc.cluster.local        725
server-004.ort.svc.cluster.local        768
server-005.ort.svc.cluster.local        725
server-006.ort.svc.cluster.local        841
server-007.ort.svc.cluster.local        725
server-008.ort.svc.cluster.local        981
server-009.ort.svc.cluster.local        725
server-010.ort.svc.cluster.local        725
server-011.ort.svc.cluster.local        725
server-012.ort.svc.cluster.local        810
server-013.ort.svc.cluster.local        640
server-014.ort.svc.cluster.local        913
server-015.ort.svc.cluster.local        725
server-016.ort.svc.cluster.local        768
server-017.ort.svc.cluster.local        0
server-018.ort.svc.cluster.local        0
server-019.ort.svc.cluster.local        0
server-020.ort.svc.cluster.local        768
server-021.ort.svc.cluster.local        0
server-022.ort.svc.cluster.local        768
server-023.ort.svc.cluster.local        725
server-024.ort.svc.cluster.local        829
server-025.ort.svc.cluster.local        725
server-026.ort.svc.cluster.local        768
server-027.ort.svc.cluster.local        725
server-028.ort.svc.cluster.local        785
server-029.ort.svc.cluster.local        725

This probably points to the hyper connection pool and connection reuse. I'm not sure if this is necessarily wrong behavior: in order to serve concurrent requests, the proxy needs to spin up new connections. It would probably be awkward to enforce a maximum concurrency; but we might consider decreasing the idle timeout... this is basically a trade-off between latency and memory cost. The current caching behavior is probably especially costly for applications like Prometheus that connect to many endpoints, though, especially in Prometheus's case, the latency hit of establishing new connections is acceptable.

In fact, it seems preferable to be more conservative about caching connections to unmeshed endpoints, in general. We assume the common case is th communication is fully meshed and multiplexed and the cost of maintaining a cached connection is fairly low and we can more aggressively cache connections (though, still, for meshed prometheus scrapes, this could be costly).

Hyper offers two pool configuration options that we should set:

  • pool_idle_timeout is currently set at the default 90s. we should probably dropt his to 10s (or the proxy's configured client cache timeout), in general.
  • pool_max_Idle_per_host is currently unlimited. We should probably change this to a small value, like 2.

With regard to Prometheus instances, we may want to expose the proxy's cache duration as an externally configurable flag, but this warrants some further thought.

@olix0r
Copy link
Member

olix0r commented Nov 6, 2020

linkerd/linkerd2-proxy#734 makes the proposed changes to the http/1.1 client

@olix0r
Copy link
Member

olix0r commented Nov 6, 2020

Running with the modified connection pool we see the above workload that ran in 700MBs down to ~155MB, though it appears that the proxy now uses substantially more CPU...

With concurrency at 128, where, on main, the proxy runs at ~150MB, the new configuration runs at ~35MB (again, with elevated CPU).

I'm inclined to go forward with these changes...

@olix0r
Copy link
Member

olix0r commented Nov 6, 2020

After further testing, it seems the best approach is to leave the number of idle connections unbounded, but to limit the cache idle age. The current default is actually 60s, not 10s. linkerd/linkerd2-proxy@c470643 changes these defaults so that the inbound timeout is 10s and the outbound timeout is 3s. This will cause additional discovery lookups when traffic is sporadic, but it should dramatically reduce the memory overhead for idle & sporadic connections. The change warrants a bit more testing and thought, so I don't think we should rush it into stable-2.9.0, but it seems like a good candidate for a followup patch release.

@hawkw
Copy link
Contributor

hawkw commented Nov 6, 2020

@olix0r do you think we should consider having different idle ages for HTTP/1 and HTTP/2 services, in order to limit the increased discovery load when idling out services is less expensive?

I suppose that if we wanted to do that, it would be a pretty big change, since there's currently One Big Cache that everything lives in...may not be worth it.

@olix0r
Copy link
Member

olix0r commented Nov 6, 2020

@hawkw Thinking about the Prometheus case, I think we actually really do want to limit the amount of time idle connections stay around for h2 as well. We could potentially have separate idle timeouts for the services and the connections if we see problems, but I think we should be more eager about evicting idle outbound things...

@hawkw
Copy link
Contributor

hawkw commented Nov 6, 2020

Hmm, yeah, that makes sense; I suppose for pretty much any outbound that's hot enough that we do want it in the cave, 10s is plenty!

@olix0r
Copy link
Member

olix0r commented Nov 10, 2020

Addressed by #5200

@olix0r olix0r closed this as completed Nov 10, 2020
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jul 17, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants