-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[exporter/datadogexporter] Potential memory leak in Datadog exporter #15720
Comments
Pinging code owners: @KSerrania @mx-psi @gbbr @knusbaum @amenasria @dineshg13. See Adding Labels via Comments if you do not have permissions to add labels yourself. |
Hi Indrek! Thanks for reporting this. It's quite an interesting edge-case I've never seen before. Some additional information would help in trying to reproduce this:
I realise it's asking for a lot, but otherwise it will be hard to reproduce. |
Additionally, for not apparent reason other than a hunch - can you try removing the Lastly, is something missing from the yaml config you posted? I see a |
I will start going through things one by one.
You're correct. We're building the config file using Helm and I overlooked a shared file. I've edited the original post. Only the attributes/upsert was missing. |
No change. |
CPU (generated with: Memory (generated with I hope I generated the profiles correctly. I also attached the images I got with it. |
Anything specific I could look into there myself? I can currently only reproduce this easily in production environment and the tcpdump would contain sensitive information, so I'd prefer not to share it. |
I see. Thanks for checking. What about the output of the fileexporter? Would that also be sensitive? Would it help at all if you would send it privately instead of posting it here? |
1 similar comment
I see. Thanks for checking. What about the output of the fileexporter? Would that also be sensitive? Would it help at all if you would send it privately instead of posting it here? |
That might be sensitive as well yes. I haven't tried it out to see what it exactly contains. For the tcpdump, would Any tips on what would you look for from those files? I could try to analyze them myself but I'm not sure exactly what to look for. EDIT: FYI I see traffic to |
What's really weird to me is that I'm unable to find who/when starts the |
@indrekj it's started by To share potentially sensitive data with us, I recommend reaching out via our support. Feel free to mention my name for a faster turnaround. |
I'm dumb. I did a git pull against my old forked repo and that's why it didn't include that part of the code. Thanks. I think I made some progress though. I was investigating the Contentrator flushNow and addNow functions. It seems impossible to have old buckets that are not flushed. But there's no upper bound. So theoretically you can add spans that end in the far future, and those won't be flushed until the far future. So I looked more into the Currently, it seems that this is not Datadog issue, though it would be nice if the I'll keep the issue open until I verify that the timestamp is indeed the actual problem. |
Thanks @indrekj . Please us posted, would have to help to debug this further. |
I think I figured it out: #16062 |
…estamps I think this issue was introduced in d31a5c3. I [noticed a memory leak][1] in DataDog exporter but after a lot of debugging, it turned out that the OpenTelemetry gateway was receiving invalid timestamps: ``` ScopeSpans #0 ScopeSpans SchemaURL: InstrumentationScope Span #0 Trace ID : 0000000000000000c20a2b82c179228a Parent ID : ID : 7c3415ed370f1777 Name : [redacted] Kind : SPAN_KIND_SERVER Start time : 2200-11-16 22:32:41.14035456 +0000 UTC End time : 2200-11-16 22:33:02.68735456 +0000 UTC ``` The year is almost 100 years in the future. Before we send a span to the gateway, it goes through an OpenTelemetry collector sidecar. We use Zipkin V1 endpoint to send the spans to the sidecar. In the same pod, we're also using Zipkin V2 which did not have any issues. The problem itself: zipkin uses microseconds. This has to be multipled with 1e3 to get nanoseconds, not with 1e6. [1]: open-telemetry#15720
…estamps (#16062) I think this issue was introduced in d31a5c3. I [noticed a memory leak][1] in DataDog exporter but after a lot of debugging, it turned out that the OpenTelemetry gateway was receiving invalid timestamps: ``` ScopeSpans #0 ScopeSpans SchemaURL: InstrumentationScope Span #0 Trace ID : 0000000000000000c20a2b82c179228a Parent ID : ID : 7c3415ed370f1777 Name : [redacted] Kind : SPAN_KIND_SERVER Start time : 2200-11-16 22:32:41.14035456 +0000 UTC End time : 2200-11-16 22:33:02.68735456 +0000 UTC ``` The year is almost 100 years in the future. Before we send a span to the gateway, it goes through an OpenTelemetry collector sidecar. We use Zipkin V1 endpoint to send the spans to the sidecar. In the same pod, we're also using Zipkin V2 which did not have any issues. The problem itself: zipkin uses microseconds. This has to be multipled with 1e3 to get nanoseconds, not with 1e6. [1]: #15720
Thanks for the PR and the investigation @indrekj 🙇 Is there anything else to be done on this issue? I guess
this is something we could do on the exporter, I will defer to @gbbr's opinion on how should we handle it |
…estamps (open-telemetry#16062) I think this issue was introduced in d31a5c3. I [noticed a memory leak][1] in DataDog exporter but after a lot of debugging, it turned out that the OpenTelemetry gateway was receiving invalid timestamps: ``` ScopeSpans #0 ScopeSpans SchemaURL: InstrumentationScope Span #0 Trace ID : 0000000000000000c20a2b82c179228a Parent ID : ID : 7c3415ed370f1777 Name : [redacted] Kind : SPAN_KIND_SERVER Start time : 2200-11-16 22:32:41.14035456 +0000 UTC End time : 2200-11-16 22:33:02.68735456 +0000 UTC ``` The year is almost 100 years in the future. Before we send a span to the gateway, it goes through an OpenTelemetry collector sidecar. We use Zipkin V1 endpoint to send the spans to the sidecar. In the same pod, we're also using Zipkin V2 which did not have any issues. The problem itself: zipkin uses microseconds. This has to be multipled with 1e3 to get nanoseconds, not with 1e6. [1]: open-telemetry#15720
…estamps (open-telemetry#16062) I think this issue was introduced in d31a5c3. I [noticed a memory leak][1] in DataDog exporter but after a lot of debugging, it turned out that the OpenTelemetry gateway was receiving invalid timestamps: ``` ScopeSpans #0 ScopeSpans SchemaURL: InstrumentationScope Span #0 Trace ID : 0000000000000000c20a2b82c179228a Parent ID : ID : 7c3415ed370f1777 Name : [redacted] Kind : SPAN_KIND_SERVER Start time : 2200-11-16 22:32:41.14035456 +0000 UTC End time : 2200-11-16 22:33:02.68735456 +0000 UTC ``` The year is almost 100 years in the future. Before we send a span to the gateway, it goes through an OpenTelemetry collector sidecar. We use Zipkin V1 endpoint to send the spans to the sidecar. In the same pod, we're also using Zipkin V2 which did not have any issues. The problem itself: zipkin uses microseconds. This has to be multipled with 1e3 to get nanoseconds, not with 1e6. [1]: open-telemetry#15720
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself. |
Is this leak fixed ? |
Yes, this issue should be closed since the underlying issue on the zipkin receiver was fixed. If you are experiencing something similar please open a new issue |
What happened?
We have an issue where our OpenTelemetry collector (the agent) memory usage is high.
See memory usage:
When the memory usage gets high then the collector starts rejecting spans:
The traffic is not changing much during the graphs. I've also increased the memory limits ~3x without no success (the pictures use the latest and largest values). When the pod is OOM killed at some point, then it works fine until the memory gets high again.
I also ran pprof heap report:
The pprof report seems to indicate that something is up with datadog-agent Concentrator. It seems to be growing without bounds. I tried looking into the otel contrib and datadog-agent codebase, but I didn't really even figure out who exactly calls it.
Collector version
v0.63.0
Environment information
Environment
Kubernetes, running in a pod (2 replicas atm).
OpenTelemetry Collector configuration
Log output
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: