-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory-leak related to the resourcetotelemetry codepath? #33383
Comments
Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself. |
Attaching the relevent memorylimiter logs... |
Could there be some metrics that have a very large number of resource attributes? We end up allocating extra memory of size roughly I am a bit skeptical of this being an issue on opentelemetry-collector-contrib/pkg/resourcetotelemetry/resource_to_telemetry.go Lines 108 to 112 in e7cf560
and it looks like it allocates exactly what it needs. |
@mx-psi,
That said .. what this feels like is some kind of a n issue where the GC is unable to clean up the data when the memory_limiter is tripped. So not an ongoing memory leak, but perhaps just a stuck pointer or something that prevents the data from being collected in some situations? |
How about we just process the resource attributes in prometheusremotewrite.FromMetrics instead of |
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself. |
Hi @diranged, What (if any) workaround did you implement? |
@grandwizard28 It would be useful if you can share more details about your setup to find out similarities/differences with @diranged's |
@grandwizard28, I had to go digging to find it ... but here's the comment in our code for what we did:
This has mostly worked for us ... rather than trying to let Otel fix itself, we just let it die and restart. We'd rather have it fail fast than get stuck in a bad state. |
Ahhh thanks for this @diranged. I ended up refactoring our custom exporter to remove the dependency on |
Hey @mx-psi, Our custom exporter is based off on a very earlier edition of the |
Component(s)
exporter/prometheusremotewrite, pkg/resourcetotelemetry
What happened?
Description
We're troubleshooting an issue where _a single
otel-collector-...
pod out of a group begins to turn away metrics because thememorylimiter
is tripped. In this situation, we have dozens or hundreds of clients pushingotlp
metrics (prometheus collected, but over the otlp grpc exporter) to multipleotel-collector-...
pods. The metrics are routed with theloadbalancer
exporter usingrouting_key: resource
.The behavior we see is that one collector suddenly starts running out of memory and being limited .. while the other collectors are using half or even less memory to process the same number of events. Here's graphs of the ingestion and the success rate:
The two dips in the
Percentage of Metrics Accepted by Receiver
graphs are different pods in a StatefulSet. Here's the graph of actual memory usage of these three pods:In the first dip - from 8:30AM->9:30AM, I manually restarted the pod to recover it. It's fine now ... but a few hours later, a different one of the pods becomes overloaded. Grabbing a HEAP dump from the
pprof
endpoint on a "good" and "bad" pod shows some stark differences:Bad Pod Pprof: otel-collector-metrics-processor-collector-0.pb.gz
Good Pod Pprof: otel-collector-metrics-processor-collector-1.pb.gz
I should note that this isn't remotely our largest environment - and these pods are handling ~12-15k datapoints/sec, while our larger environments are doing ~20k/sec/pod... so this doesn't feel like a fundamental scale issue. All pods are sized the same across all of our environments.
Collector version
v0.101.0
Environment information
Environment
OS: (e.g., "Ubuntu 20.04")
Compiler(if manually compiled): (e.g., "go 14.2")
OpenTelemetry Collector configuration
Log output
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: