-
Notifications
You must be signed in to change notification settings - Fork 810
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Distributors using spectacular amount of memory #3188
Comments
How many cpus do your kubernetes nodes have? did you configure cpu limits on the distributors? We had a similar issue because we had 32 cpu machines and the problem that was solved by setting up limits for the cpu of the distributors or using the environment variable GOMAXPROCS=requests.cpu |
c5.24xlarge EC2 instances, which have 96 vCPUs. We don't have CPU limits configured on these pods, but we request 2 CPUs. I'll try passing |
Also, last night we tried passing |
I've seen Distributors OOMing on
My assumption is that problem #2 only happens for users using compression / multi zone setups. I shared some pprofs here and there as well and they usually pointed out issues in the (de)compression |
Quick update: setting up GOMAXPROCS=requests.cpu. 2 in my case had unintended consequences, like increasing the latency of the distributors. I am now setting GOMAXPROCS=8 with better latency response. By the way the memory I use for my measures is container_memory_working_set_bytes{container="distributor"} and not container_memory_usage_bytes{container="distributor"}. The reason why is on https://www.bwplotka.dev/2019/golang-memory-monitoring/ |
Can we get the heap profile data please? (not a screen dump or svg) |
Sure, here you go: https://github.com/amckinley/cortex-heap |
@bboreham anything else I can provide on our side? Happy to provide more heap profiles or try any tuning suggestions you have. |
The heap profile looks good to me. The inuse space is about 6GB. I assume the 1GB allocated in main is the ballast ( My suspect is that what you see is some sort of side effect of go GC behaviour. How often does the GC triffers? I've read above you set |
Sorry @amckinley, missed your message. The profile shows 6GB in use, and 42GB allocated since the start of the program. This is hard to square against the symptom of 70GB, but maybe taken from a different point in time. Nearly all of the memory in use is from I wonder if you have 700 active push requests, which would suggest to me they're not getting passed through to ingesters fast enough. Can you get a goroutine dump? ( I see Other possible points of interest: how many distributors? How many samples per incoming push request? |
Just reading this makes me think this could be CPU throttling given all the discussion about CPU requests/limits and GOMAXPROCS. CPU throttling could be leading to GCs not being completed fast enough. It's easy for a program to exhaust it's available CPU CFS periods with a lot of threads/processes, so adjusting GOMAXPROCs could be helping with that. |
@bboreham sorry for the delay; I'm back to working on this now. Here's another heap dump (~50GB, this time of the particular distributor that's at max for our cluster), and here's a goroutine dump of the same distributor. We're currently running with 30 distributors and 75 ingesters. We're using this very... unorthodox
We did a tremendous amount of testing using the suggested settings of MUCH higher values for |
Thanks, that adds a bit more detail, and you managed to snap the heap at a point where in-use was high. The memory is still all being allocated by Snappy:
This time there are 4,227 blocks allocated, so an average size of 7.5MB,. which not out of line for 20,000 samples (~390 bytes per sample including all label names and values). The other dump says there are a total of 355 goroutines, which doesn't support my theory of push requests getting stuck, but it was taken 8 minutes after the heap dump so maybe conditions had changed. It would be better to get two dumps close together in time. Next theory: something is getting retained from inside the buffer. We've had issues like this before, but that was in the ingester; the distributor is more stateless. Perhaps it could relate to this change in Go: golang/go@2dcbf8b, whereby I'm struck that your 20,000 samples per send is much higher than I use (1,000), but you say you did a lot of experimenting. Ideally what we should really do is walk the chain of pointers to those buffers up to the root, so we know what is keeping them alive. But I don't know of a way to do that in Go - we need a tool like goheapdump which has been updated to match current Go formats. |
This still sounds like it could be the GC not running fast enough/keeping up, especially if container CPU limits are set. Is there a good way to verify that? |
@chancez that doesn't match my understanding of how Go works. You can set the environment variable Note from earlier "We don't have CPU limits configured on these pods". |
@bboreham I wasn't referring to strictly limits, but CPU requests also. It's all CFS quotas, just hard/soft limits. They said they use CPU requests:
CPU Requests get's translated to |
Well, I checked, and we take care to blank out those pointers. Strike that theory. cortex/pkg/ingester/client/timeseries.go Lines 280 to 283 in 7f85f92
@amckinley I see you have 'ha' settings - do you have a lot of clusters? (I.e. could you have thousands of distinct values of We should copy the strings returned by |
@bboreham We have 8 clusters, each of which has 2 replicas. (Actually, we have one huge cluster, but in order to make |
I have similar issue with |
@anarcher it is quite possible there is a bug causing this high memory usage in the HA-tracker code. |
Returning to this:
These metrics use strings which point into the incoming buffer, hence will cause it to be retained: cortex/pkg/distributor/distributor.go Line 605 in 523dde1
cortex/pkg/distributor/ha_tracker.go Line 392 in 523dde1
So it's really "thousands of distinct combinations of tenant (user) ID and cluster". But I still don't think it matches what was seen. Still, we should copy the string before using it in a place that may be retained. |
We're ingesting about 2M samples per second with ~105M time series across 75 ingesters and 95 distributors, and we're seeing our distributors ballooning past 70GB/pod and continuing to grow until the pod consumes all physical memory on the node and gets OOM killed. Ingester memory consumption, CPU time, and storage performance are all hovering very comfortably at a stable level.The only errors in the logs for both distributors and ingesters are some occasional "duplicate sample for timestamp" messages that we've previously been told to ignore (see #2832).
We're running Cortex v1.3.0 for all components, with the TSDB storage backend. Here's the args we're passing to the distributor pods:
And here's our JSONnet config (powered by the grafana cortex-mixin library):
Ingestion is powered by a cluster of
grafana-agent
machines, each of which isremote_write
ing to an Amazon ELB which is backed by our production K8S cluster.If you need any other piece of our configs, logs, manifests, or metrics data, please don't hesitate to ask. We've been working to adopt Cortex for our production metrics infrastructure, and this is the last blocker before we can cut over all queries to Cortex, so we're very interested in solving this.
The text was updated successfully, but these errors were encountered: