-
Notifications
You must be signed in to change notification settings - Fork 837
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Potential memory leak with the prometheus metrics exporter #3621
Comments
That does indeed look like a memory leak; however, I'm having some trouble replicating it on my end. 🤔 Could you provide some more info on the metrics that you're recording? A known attribute configuration can cause memory leaks are Attributes with many possible values (for instance, a timestamp or user-id); see #2997. Could this be what's causing your issue? |
@bruce-y any update? 🤔 |
I'm currently tracking an ever increasing CPU usage by aws-otel-collector while using the otlp exporter. So I'm wondering if this is two bugs, the same bug in two implementations, or a continuous accumulation of data that's not being cleared during an export (and/or being sent multiple times). Spiralling data would result in memory increases, and spiralling data sets would result in increasing CPU. |
@bruce-y what's your export interval? Also are you deduping metrics with the same name? One of the first problems I ran into with this ecosystem was treating stats like StatsD, where you can just keep throwing the same name at the API over and over with no consequences. |
Apologies for the late reply.
We aren't dealing with super high cardinality for a single metric but we are exporting many histograms that contain large amounts of buckets each. I would say it's likely that there's potentially 80k lines of metrics when exported. We're exporting the same amount via https://github.com/siimon/prom-client without issue though.
The scrape interval we have set locally is every 30 seconds.
We're exporting the metrics references as a singleton in our application. So anytime something is observed we're just using the existing metrics object and not instantiating a new one with the same name. |
I'm experiencing something similar, but only when I register TWO MetricsReaders (PrometheusExporter and PeriodicExportingMetricReader). Either one by itself seems to work fine. |
Hmm, interesting. Does it just never run out of memory with just one reader, or does it just take longer? 🤔 Which exporter are you using with the PeriodicExportingMetricReader? What's its temporality preference? Maybe there's some combination of temporalities that's causing a memory leak in the SDK 🤔 |
PeriodicExportingMetricReader is using Cumulative temporality as well. |
Thanks for the additional info @cyberw, I'll try to reproduce. |
For my own investigations, I've seen that every combination of labels is saved as a data structure inside of a Metric, in perpetuity. So as our app warms up and all of the cluster eventually sees every variation of every combination of labels, the memory footprint climbs and climbs for hours to days. From conversations with others (such as the one yesterday on Hacker News), some people are implying that this is maybe not the case for opentelemetry implementations in other programming languages. If that's true, I'd very much like for this implementation to reach feature parity. I have some very not kind things to say about OTEL and most of them stem from the scalability implications of total persistent state tracking. (vs say statsd, which forgets everything the moment the stat is sent). The other being the antagonism toward GIL or single-thread-per-process languages which end up running 16-40 copies of the data structures per machine. The aggregate stat size is a huge blind spot and if that's an accident of implementation instead of a design goal, then it definitely needs to be addressed. But from the code I don't see how it could be an accident. |
@jdmarshall that's true, some language implementations do have an experimental feature to evict unused metric streams. Agreed, having it in the JS SDK would be great too (opened #4095 a few days ago to track it). |
I fiddled with the delta based reporting, which seems like the most likely route to dealing with zebras, but Webstorm refuses to breakpoint in opentelemetry.js. And the code is very... Spring-like and so single stepping is an exercise in very deep patience, along with copious note-keeping or a truly cavernous short-term memory. I was unable to get to the bottom of why our collector went to 0 stats when I attempted to use it. |
is this fixed by #4163 ? |
Unfortunately, no. It should solve the problem you had (#4115), but I was never able to reproduce this exact problem reported by @bruce-y (single Prometheus exporter/ metric reader). The fix from #4163 strictly addresses problems with 2+ metric readers. Any help with reproducing this issue here would be greatly appreciated. |
I know in my case I had set up the memory reader for some debugging output. Since I had some data points from before adding that code, I had suspicions that it was involved, but took some pretty deep stepping into the code to see how it mattered. I don't know about @bruce-y but I anticipate that this fixed my issue. |
I'm closing this issue as cannot reproduce (single Prometheus exporter setup). |
What happened?
Steps to Reproduce
We've instrumented some metrics via the Otel metrics SDK and are exposing it with the prometheus exporter
Expected Result
This allows prometheus scrapes against this service without memory continuously increasing.
Actual Result
We are seeing a very steady increase of memory usage until it hits the heap limit where the service terminates and restarts.
Additional Details
OpenTelemetry Setup Code
package.json
Relevant log output
No response
The text was updated successfully, but these errors were encountered: