Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distributors using spectacular amount of memory #3188

Closed
amckinley opened this issue Sep 15, 2020 · 23 comments
Closed

Distributors using spectacular amount of memory #3188

amckinley opened this issue Sep 15, 2020 · 23 comments

Comments

@amckinley
Copy link
Contributor

We're ingesting about 2M samples per second with ~105M time series across 75 ingesters and 95 distributors, and we're seeing our distributors ballooning past 70GB/pod and continuing to grow until the pod consumes all physical memory on the node and gets OOM killed. Ingester memory consumption, CPU time, and storage performance are all hovering very comfortably at a stable level.The only errors in the logs for both distributors and ingesters are some occasional "duplicate sample for timestamp" messages that we've previously been told to ignore (see #2832).

We're running Cortex v1.3.0 for all components, with the TSDB storage backend. Here's the args we're passing to the distributor pods:

-auth.enabled=false
-consul.hostname=consul.cortex-tsdb.svc.cluster.local:8500
-distributor.ha-tracker.cluster=cortex_ha_cluster
-distributor.ha-tracker.consul.hostname=consul.cortex-tsdb.svc.cluster.local:8500
-distributor.ha-tracker.enable=true
-distributor.ha-tracker.enable-for-all-users=true
-distributor.ha-tracker.prefix=prom_ha/
-distributor.ha-tracker.store=consul
-distributor.health-check-ingesters=true
-distributor.ingestion-burst-size=1000000
-distributor.ingestion-rate-limit=100000
-distributor.ingestion-rate-limit-strategy=global
-distributor.remote-timeout=2s
-distributor.replication-factor=3
-distributor.ring.consul.hostname=consul.cortex-tsdb.svc.cluster.local:8500
-distributor.ring.prefix=
-distributor.shard-by-all-labels=true
-limits.per-user-override-config=/etc/cortex/overrides.yaml
-log.level=info
-mem-ballast-size-bytes=1073741824
-ring.heartbeat-timeout=10m
-ring.prefix=
-server.grpc-max-recv-msg-size-bytes=104857600
-server.grpc-max-send-msg-size-bytes=104857600
-server.grpc.keepalive.max-connection-age=2m
-server.grpc.keepalive.max-connection-age-grace=5m
-server.grpc.keepalive.max-connection-idle=1m
-target=distributor
-validation.reject-old-samples=true
-validation.reject-old-samples.max-age=12h

And here's our JSONnet config (powered by the grafana cortex-mixin library):

local cortex = import 'cortex/cortex.libsonnet';
local tsdb = import 'cortex/tsdb.libsonnet';

cortex + tsdb {
  _config+:: {
    namespace: 'cortex-tsdb',

    storage_backend: 'aws',
    storage_engine: 'blocks',
    blocks_storage_bucket_name: '<snip>'
    storage_tsdb_s3_endpoint: '<snip>',
    cortex_store_gateway_data_disk_class: 'gp2',
    cortex_compactor_data_disk_class: 'gp2',
    cortex_ingester_data_disk_class: 'gp2',
    cortex_ingester_data_disk_size: '200Gi',

    aws_region: 'us-east-1',
    s3_bucket_name: '<snip>',

    // XXX: 3 is the default
    replication_factor: 3,

    // only needed for alert manager?
    external_url: '<snip>',

    query_frontend: {
      sharded_queries_enabled: true,
    },

    ruler_enabled: true,
    ruler_client_type: 's3',
    ruler_s3_bucket_name: '<snip>',

    overrides: {
      fake: {
        max_series_per_metric: 0,  // Disabled in favour of the max global limit
        max_series_per_user: 0,  // Disabled in favour of the max global limit

        max_global_series_per_user: 320000000,  // 320M
        max_global_series_per_metric: 10000000,  // 10M

        max_series_per_query: 100000,
        max_samples_per_query: 1000000,

        ingestion_rate: 25000000,  //25M
        ingestion_burst_size: 50000000,  // 50M
      },
    },
  },

  // route incoming traffic to the correct service
  frontend_ingress: {
    apiVersion: 'extensions/v1beta1',
    kind: 'Ingress',
    metadata: {
      name: 'cortex',
      annotations: {
        'kubernetes.io/ingress.class': 'nginx',
      },
    },
    spec: {
      rules:
        [
          {
            host: '<snip>'
            http: {
              paths: [
              {
                 path: '/',
                 backend: {
                   serviceName: 'query-frontend',
                   servicePort: 80,
                 },
               },
               {
                 path: '/compactor',
                 backend: {
                   serviceName: 'compactor',
                   servicePort: 80,
                 },
               },
               {
                 path: '/distributor',
                 backend: {
                   serviceName: 'distributor',
                   servicePort: 80,
                 },
               },
               {
                 path: '/ingester',
                 backend: {
                   serviceName: 'distributor',
                   servicePort: 80,
                 },
               },
               {
                 path: '/ruler',
                 backend: {
                   serviceName: 'ruler',
                   servicePort: 80,
                 },
               },
               {
                 path: '/store-gateway',
                 backend: {
                   serviceName: 'store-gateway',
                   servicePort: 80,
                 },
               },
               {
                 path: '/api/prom/push',
                 backend: {
                   serviceName: 'distributor',
                   servicePort: 80,
                 },
                },
              ],
            },
          },
        ],
    },
  },
} + {
  namespace+: {
    metadata+: {
      annotations: {
        'iam.amazonaws.com/permitted': '<snip>'),
      },
      labels: {
        name: 'cortex-tsdb',
      }
    },
  },

  // get rid of the CPU limits because of '<snip>'
  compactor_container+::
    $.util.resourcesLimits(null, '6Gi'),

  // raised from limit 4Gi
  distributor_container+::
    # $.util.resourcesLimits(null, '32Gi') +
    $.util.resourcesLimits(null, null) +
    $.util.resourcesRequests(2, '24Gi'),

  ingester_container+::
    $.util.resourcesLimits(null, '40Gi') +
    $.util.resourcesRequests('4', '20Gi'),

  querier_container+::
    $.util.resourcesLimits(null, '80Gi') +
    $.util.resourcesRequests('4', '40Gi'),

  store_gateway_container+::
    $.util.resourcesLimits(null, '48Gi') +
    $.util.resourcesRequests('2', '24Gi'),

  // disable auth everywhere to turn off multi-tenant
  alertmanager_args+:: {
    'auth.enabled': false,
  },
  distributor_args+:: {
    'auth.enabled': false,
    'log.level': 'info',
    'distributor.ha-tracker.store': 'consul',
    'distributor.ha-tracker.consul.hostname': 'consul.%s.svc.cluster.local:8500' % $._config.namespace,
    'distributor.ha-tracker.etcd.endpoints': null,
    'distributor.remote-timeout': '2s',

    // the 'cluster' label is already in use, so we use something else
    'distributor.ha-tracker.cluster': 'cortex_ha_cluster',

    // 100MB (bumped from default of 4MB)
    'server.grpc-max-recv-msg-size-bytes': 1024 * 1024 * 100,
    'server.grpc-max-send-msg-size-bytes': 1024 * 1024 * 100,
  },
  ingester_args+:: {
    'auth.enabled': false,
    'ingester.max-series-per-metric': 10000000,  // added to make our env work
    'ingester.max-series-per-user': 100000000,  // same. default is 5000000

    // 100MB (bumped from default of 4MB)
    'server.grpc-max-recv-msg-size-bytes': 1024 * 1024 * 100,
    'server.grpc-max-send-msg-size-bytes': 1024 * 1024 * 100,
  },
  ruler_args+:: {
    'auth.enabled': false,
  },
  querier_args+:: {
    'auth.enabled': false,
  },
  query_frontend_args+:: {
    'auth.enabled': false,
  },

  // dont create an etcd cluster
  etcd_cluster(name, size=3, version='3.3.13', env=[]):: {},

}

Ingestion is powered by a cluster of grafana-agent machines, each of which is remote_writeing to an Amazon ELB which is backed by our production K8S cluster.

If you need any other piece of our configs, logs, manifests, or metrics data, please don't hesitate to ask. We've been working to adopt Cortex for our production metrics infrastructure, and this is the last blocker before we can cut over all queries to Cortex, so we're very interested in solving this.

@amckinley
Copy link
Contributor Author

Example pprof (original SVG available on request):
Screen Shot 2020-09-15 at 4 45 13 PM

@friedrich-at-adobe
Copy link
Contributor

How many cpus do your kubernetes nodes have? did you configure cpu limits on the distributors?

We had a similar issue because we had 32 cpu machines and the problem that was solved by setting up limits for the cpu of the distributors or using the environment variable GOMAXPROCS=requests.cpu

@amckinley
Copy link
Contributor Author

How many cpus do your kubernetes nodes have? did you configure cpu limits on the distributors?

c5.24xlarge EC2 instances, which have 96 vCPUs. We don't have CPU limits configured on these pods, but we request 2 CPUs. I'll try passing GOMAXPROCS and see if that changes anything.

@amckinley
Copy link
Contributor Author

Also, last night we tried passing GOGC=30 to these containers, and that managed to significantly improve the growth of the heap size. It now looks like the heap maximum size is at least bounded. On a hunch, I also decreased the number of distributors from 95 to 25, and the total amount of memory consumed by the distributor pods stayed the same.

@weeco
Copy link
Contributor

weeco commented Sep 26, 2020

I've seen Distributors OOMing on

  1. Single slow ingesters (this is a bug tracked in Distributors OOM on a single slow ingester in the cluster #1895 , but it hasn't got a lot of attention lately).
  2. After upgrading from v0.4 to v0.6 (tracked in distributor use too much more memory after upgrade to v0.6.0 #2053). This has never really been fixed either, but it's reproducible for multiple users

My assumption is that problem #2 only happens for users using compression / multi zone setups. I shared some pprofs here and there as well and they usually pointed out issues in the (de)compression

@friedrich-at-adobe
Copy link
Contributor

Quick update: setting up GOMAXPROCS=requests.cpu. 2 in my case had unintended consequences, like increasing the latency of the distributors. I am now setting GOMAXPROCS=8 with better latency response.

By the way the memory I use for my measures is container_memory_working_set_bytes{container="distributor"} and not container_memory_usage_bytes{container="distributor"}. The reason why is on https://www.bwplotka.dev/2019/golang-memory-monitoring/

@bboreham
Copy link
Contributor

bboreham commented Oct 1, 2020

Can we get the heap profile data please? (not a screen dump or svg)

@amckinley
Copy link
Contributor Author

Can we get the heap profile data please? (not a screen dump or svg)

Sure, here you go: https://github.com/amckinley/cortex-heap

@amckinley
Copy link
Contributor Author

@bboreham anything else I can provide on our side? Happy to provide more heap profiles or try any tuning suggestions you have.

@pracucci
Copy link
Contributor

pracucci commented Oct 9, 2020

Sure, here you go: https://github.com/amckinley/cortex-heap

The heap profile looks good to me. The inuse space is about 6GB. I assume the 1GB allocated in main is the ballast (-mem-ballast-size-bytes). From this profile I personally can't find any symptom of a memory leak.

My suspect is that what you see is some sort of side effect of go GC behaviour. How often does the GC triffers? I've read above you set GOGC=30 so I would expect it triggers frequently, but when running with the default GOGC it may trigger too late. Just a guess.

@bboreham
Copy link
Contributor

Sorry @amckinley, missed your message.

The profile shows 6GB in use, and 42GB allocated since the start of the program. This is hard to square against the symptom of 70GB, but maybe taken from a different point in time.

Nearly all of the memory in use is from Snappy.Decode(), and there are 703 buffers live. This I think indicates a problem, as each buffer should be unmarshalled then discarded, so memory from Unmarshal() should at least equal that from Decode().

I wonder if you have 700 active push requests, which would suggest to me they're not getting passed through to ingesters fast enough. Can you get a goroutine dump? (curl .../debug/pprof/goroutine) Do you monitor the throughput and latency on calls into and out from distributors?

I see util.ParseProtoReader() returns the buffer, which we never use. It would be worth removing that, in case it is contributing to the problem.

Other possible points of interest: how many distributors? How many samples per incoming push request?

@chancez
Copy link

chancez commented Nov 7, 2020

Just reading this makes me think this could be CPU throttling given all the discussion about CPU requests/limits and GOMAXPROCS. CPU throttling could be leading to GCs not being completed fast enough. It's easy for a program to exhaust it's available CPU CFS periods with a lot of threads/processes, so adjusting GOMAXPROCs could be helping with that.

@amckinley
Copy link
Contributor Author

@bboreham sorry for the delay; I'm back to working on this now. Here's another heap dump (~50GB, this time of the particular distributor that's at max for our cluster), and here's a goroutine dump of the same distributor.

We're currently running with 30 distributors and 75 ingesters. We're using this very... unorthodox queue_config:

            queue_config:
              capacity: 100000
              max_samples_per_send: 20000
              min_shards: 10
              max_shards: 10

We did a tremendous amount of testing using the suggested settings of MUCH higher values for max_shards, and every configuration we tried would always end up with the WAL growing without bound on our grafana-agent hosts. In desperation we tried the above settings and that's when our cluster finally stabilized.

@bboreham
Copy link
Contributor

Thanks, that adds a bit more detail, and you managed to snap the heap at a point where in-use was high.

The memory is still all being allocated by Snappy:

github.com/cortexproject/cortex/pkg/util.ParseProtoReader
/go/src/github.com/cortexproject/cortex/pkg/util/http.go

  Total:           0    33.29GB (flat, cum) 96.85%
    106            .          .           		if err == nil && len(body) <= maxSize { 
    107            .    33.14GB           			body, err = snappy.Decode(nil, body) 
    108            .          .           		} 

This time there are 4,227 blocks allocated, so an average size of 7.5MB,. which not out of line for 20,000 samples (~390 bytes per sample including all label names and values).

The other dump says there are a total of 355 goroutines, which doesn't support my theory of push requests getting stuck, but it was taken 8 minutes after the heap dump so maybe conditions had changed. It would be better to get two dumps close together in time.

Next theory: something is getting retained from inside the buffer. We've had issues like this before, but that was in the ingester; the distributor is more stateless.

Perhaps it could relate to this change in Go: golang/go@2dcbf8b, whereby sync.Pool is not cleared completely on every GC. We do pool outgoing requests to ingesters, and I can believe that those have pointers into the incoming buffer. If that's the case, taking more care to zero out those pointers would fix it.

I'm struck that your 20,000 samples per send is much higher than I use (1,000), but you say you did a lot of experimenting.
Certainly if those incoming buffers were smaller then it wouldn't matter to have 4,000 of them retained.

Ideally what we should really do is walk the chain of pointers to those buffers up to the root, so we know what is keeping them alive. But I don't know of a way to do that in Go - we need a tool like goheapdump which has been updated to match current Go formats.

@chancez
Copy link

chancez commented Nov 20, 2020

This still sounds like it could be the GC not running fast enough/keeping up, especially if container CPU limits are set. Is there a good way to verify that?

@bboreham
Copy link
Contributor

@chancez that doesn't match my understanding of how Go works.

You can set the environment variable GODEBUG=gctrace=1 to get a line about each garbage-collection in the log file.

Note from earlier "We don't have CPU limits configured on these pods".

@chancez
Copy link

chancez commented Nov 20, 2020

@bboreham I wasn't referring to strictly limits, but CPU requests also. It's all CFS quotas, just hard/soft limits. They said they use CPU requests:

We don't have CPU limits configured on these pods, but we request 2 CPUs.

CPU Requests get's translated to --cpu-shares under docker, which will result in less CPU time if there's other CPU activity on the host besides the distributor. The Go program might be getting less CPU time than it needs for keeping up with GC, especially when it has a high GOMAXPROCS, which means there's a lot of threads competing for a small amount of CPU time. This might also be why a few people had success lowering GOMAXPROCS.

@bboreham
Copy link
Contributor

I can believe that those have pointers into the incoming buffer

Well, I checked, and we take care to blank out those pointers. Strike that theory.

for i := 0; i < len(ts.Labels); i++ {
ts.Labels[i].Name = ""
ts.Labels[i].Value = ""
}

@amckinley I see you have 'ha' settings - do you have a lot of clusters? (I.e. could you have thousands of distinct values of cortex_ha_cluster and __replica__) ?

We should copy the strings returned by findHALabels(), so they don't point into the gRPC buffer when retained by the ha tracker.

@amckinley
Copy link
Contributor Author

@bboreham We have 8 clusters, each of which has 2 replicas. (Actually, we have one huge cluster, but in order to make grafana-agent work, we had to create 8 distinct scrape configs and use the Prometheus hashmod relabel operator to get each agent to only scrape 1/8 of our total metrics).
Screen Shot 2020-12-01 at 12 35 57 PM

@anarcher
Copy link

I have similar issue with ha-tracker enabled and 1.7.0. Will the distributor's memory be lower without ha-tracker?

@bboreham
Copy link
Contributor

bboreham commented Apr 8, 2021

@anarcher it is quite possible there is a bug causing this high memory usage in the HA-tracker code.
Without pinpointing the true cause we cannot say for sure.

@bboreham
Copy link
Contributor

Returning to this:

could you have thousands of distinct values of cortex_ha_cluster and __replica__ ?

These metrics use strings which point into the incoming buffer, hence will cause it to be retained:

d.dedupedSamples.WithLabelValues(userID, cluster).Add(float64(numSamples))

c.kvCASCalls.WithLabelValues(userID, cluster).Inc()

So it's really "thousands of distinct combinations of tenant (user) ID and cluster". But I still don't think it matches what was seen.

Still, we should copy the string before using it in a place that may be retained.

@alvinlin123
Copy link
Contributor

Hi, the cause is most likely caused by a memory leak that was fixed in #4739

Another issue reported lower memory usage after they deployed 1.13.0

I will resolve this issue, but feel free to re-open if the issue is still there after upgrading to 1.13.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants