Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[exporter/datadogexporter] Potential memory leak in Datadog exporter #15720

Closed
indrekj opened this issue Oct 31, 2022 · 19 comments
Closed

[exporter/datadogexporter] Potential memory leak in Datadog exporter #15720

indrekj opened this issue Oct 31, 2022 · 19 comments
Labels
bug Something isn't working exporter/datadog Datadog components priority:p1 High Stale

Comments

@indrekj
Copy link
Contributor

indrekj commented Oct 31, 2022

What happened?

We have an issue where our OpenTelemetry collector (the agent) memory usage is high.

See memory usage:
Screenshot from 2022-10-31 18-28-03

When the memory usage gets high then the collector starts rejecting spans:
Screenshot from 2022-10-31 18-29-03

The traffic is not changing much during the graphs. I've also increased the memory limits ~3x without no success (the pictures use the latest and largest values). When the pod is OOM killed at some point, then it works fine until the memory gets high again.

I also ran pprof heap report:
heap

The pprof report seems to indicate that something is up with datadog-agent Concentrator. It seems to be growing without bounds. I tried looking into the otel contrib and datadog-agent codebase, but I didn't really even figure out who exactly calls it.

Collector version

v0.63.0

Environment information

Environment

Kubernetes, running in a pod (2 replicas atm).

  resources:
    requests:
      cpu: 400m
      memory: 2048Mi
    limits:
      cpu: 800m
      memory: 2304Mi

OpenTelemetry Collector configuration

config:
    receivers:
      otlp/sidecar:
        protocols:
          grpc:
            endpoint: 0.0.0.0:4317
            tls:
              cert_file: /etc/certs/opentelemetry/tls.crt
              key_file: /etc/certs/opentelemetry/tls.key
              client_ca_file: /etc/certs/opentelemetry/ca.crt
    processors:
      attributes/upsert:
        actions:
          - key: deployment.environment
            value: 'prod-us'
            action: upsert
      memory_limiter:
        check_interval: 1s
        limit_mib: 2000
        spike_limit_mib: 400
      resourcedetection:
        detectors:
          - env
        override: false
      k8sattributes: {}
      batch:
        # Datadog APM Intake limit is 3.2MB. Let's make sure the batches do not
        # go over that.
        send_batch_max_size: 1000
        send_batch_size: 100
        # This has to be exactly 10 seconds in case DataDog exporter is used:
        # https://docs.datadoghq.com/tracing/setup_overview/open_standards/#ingesting-opentelemetry-traces-with-the-collector
        timeout: 10s
    exporters:
      datadog:
        api:
          key: "${DATADOG_API_KEY}"
          site: datadoghq.com.
        traces:
          span_name_remappings:
            # From ruby services:
            opentelemetry_instrumentation_faraday.client: faraday.client
            opentelemetry_instrumentation_pg.client: pg.client
            opentelemetry_instrumentation_que.consumer: que.consumer
            opentelemetry_instrumentation_que.producer: que.producer
            opentelemetry_instrumentation_rack.server: rack.server
            opentelemetry_instrumentation_redis.client: redis.client
            opentelemetry_instrumentation_sidekiq.consumer: sidekiq.consumer
            opentelemetry_instrumentation_sidekiq.producer: sidekiq.producer
            opentelemetry_instrumentation_sinatra.server: sinatra.server

            # From elixir services:
            opentelemetry_oban.consumer: oban.consumer
            opentelemetry_oban.internal: oban.internal
            opentelemetry_oban.producer: oban.producer
            opentelemetry_tesla.client: tesla.client
            opentelemetry_phoenix.server: phoenix.server
            opentelemetry_ecto.client: ecto.client
    # Extensions are available primarily for tasks that do not involve processing telemetry data.
    # Examples of extensions include health monitoring, service discovery, and data forwarding.
    extensions:
      pprof:
        endpoint: "0.0.0.0:1777"
        block_profile_fraction: 3
        mutex_profile_fraction: 5
      memory_ballast:
        size_mib: 600
    # The service section is used to configure what components are enabled in the Collector
    # based on the configuration found in the receivers, processors, exporters, and extensions sections.
    service:
      # Extensions consist of a list of all extensions to enable.
      extensions:
        - pprof
        - memory_ballast
      # A pipeline consists of a set of receivers, processors and exporters.
      # Pipelines can be of the following types: traces, metrics and logs
      pipelines:
        traces:
          receivers:
            - otlp/sidecar
          processors:
            # The memory_limiter processor should be the first processor
            # defined in the pipeline (immediately after the receivers). This
            # is to ensure that backpressure can be sent to applicable
            # receivers and minimize the likelihood of dropped data when the
            # memory_limiter gets triggered.
            - memory_limiter
            # resourcedetection must come before k8sattributes
            - resourcedetection
            - k8sattributes
            - attributes/upsert
            - batch
          exporters:
            - datadog

Log output

No response

Additional context

No response

@indrekj indrekj added bug Something isn't working needs triage New item requiring triage labels Oct 31, 2022
@indrekj indrekj changed the title Potential memory leak in Datadog exporter [exporter/datadogexporter] Potential memory leak in Datadog exporter Nov 1, 2022
@fatsheep9146 fatsheep9146 added exporter/datadog Datadog components and removed needs triage New item requiring triage labels Nov 1, 2022
@github-actions
Copy link
Contributor

github-actions bot commented Nov 1, 2022

Pinging code owners: @KSerrania @mx-psi @gbbr @knusbaum @amenasria @dineshg13. See Adding Labels via Comments if you do not have permissions to add labels yourself.

@gbbr
Copy link
Member

gbbr commented Nov 1, 2022

Hi Indrek!

Thanks for reporting this. It's quite an interesting edge-case I've never seen before. Some additional information would help in trying to reproduce this:

  • A tcpdump
  • The output of using the fileexporter
  • Memory & CPU profile files

I realise it's asking for a lot, but otherwise it will be hard to reproduce.

@gbbr
Copy link
Member

gbbr commented Nov 1, 2022

Additionally, for not apparent reason other than a hunch - can you try removing the span_name_remappings and see if that changes anything?

Lastly, is something missing from the yaml config you posted? I see a attributes/upsert processor being used but the definition is not there.

@indrekj
Copy link
Contributor Author

indrekj commented Nov 1, 2022

I will start going through things one by one.

Lastly, is something missing from the yaml config you posted?

You're correct. We're building the config file using Helm and I overlooked a shared file. I've edited the original post. Only the attributes/upsert was missing.

@indrekj
Copy link
Contributor Author

indrekj commented Nov 1, 2022

Additionally, for not apparent reason other than a hunch - can you try removing the span_name_remappings and see if that changes anything?

No change.

@indrekj
Copy link
Contributor Author

indrekj commented Nov 1, 2022

Memory & CPU profile files

CPU (generated with: go tool pprof -png "localhost:1777/debug/pprof/profile?seconds=60" > ./cpu2.png)
pprof.otelcol-contrib.samples.cpu.002.pb.gz
cpu2

Memory (generated with go tool pprof -png localhost:1777/debug/pprof/heap > ./heap9.png):
pprof.otelcol-contrib.alloc_objects.alloc_space.inuse_objects.inuse_space.012.pb.gz
heap9

I hope I generated the profiles correctly. I also attached the images I got with it.

@indrekj
Copy link
Contributor Author

indrekj commented Nov 1, 2022

A tcpdump

Anything specific I could look into there myself? I can currently only reproduce this easily in production environment and the tcpdump would contain sensitive information, so I'd prefer not to share it.

@gbbr
Copy link
Member

gbbr commented Nov 1, 2022

I see. Thanks for checking. What about the output of the fileexporter? Would that also be sensitive? Would it help at all if you would send it privately instead of posting it here?

1 similar comment
@gbbr
Copy link
Member

gbbr commented Nov 1, 2022

I see. Thanks for checking. What about the output of the fileexporter? Would that also be sensitive? Would it help at all if you would send it privately instead of posting it here?

@indrekj
Copy link
Contributor Author

indrekj commented Nov 2, 2022

That might be sensitive as well yes. I haven't tried it out to see what it exactly contains.

For the tcpdump, would tcpdump port 443 be useful? That would only contain the requests between the collector and datadog VPC Endpoint.

Any tips on what would you look for from those files? I could try to analyze them myself but I'm not sure exactly what to look for.

EDIT: FYI I see traffic to api.datadoghq.com and trace.agent.datadoghq.com

@indrekj
Copy link
Contributor Author

indrekj commented Nov 2, 2022

What's really weird to me is that I'm unable to find who/when starts the Concentrator. Concentrator is part of the datadog-agent package. But I don't see anything starting it or using it in the datadogexporter. Do you know what starts it and for what it is being used?

@gbbr
Copy link
Member

gbbr commented Nov 3, 2022

@indrekj it's started by agnt.Run in the traces_exporter.go file. Thanks for looking into this yourself.

To share potentially sensitive data with us, I recommend reaching out via our support. Feel free to mention my name for a faster turnaround.

@mx-psi mx-psi added the priority:p1 High label Nov 3, 2022
@indrekj
Copy link
Contributor Author

indrekj commented Nov 3, 2022

@indrekj it's started by agnt.Run in the traces_exporter.go file

I'm dumb. I did a git pull against my old forked repo and that's why it didn't include that part of the code. Thanks.

I think I made some progress though. I was investigating the Contentrator flushNow and addNow functions. It seems impossible to have old buckets that are not flushed. But there's no upper bound. So theoretically you can add spans that end in the far future, and those won't be flushed until the far future.

So I looked more into the fileexporter suggestion you gave. And indeed, there are many odd spans from ingress nginx controller that have startTimeUnixNano over 7278037323140354560. That's the year 2200.

Currently, it seems that this is not Datadog issue, though it would be nice if the addNow function would log a warning if the span date is in the far future.

I'll keep the issue open until I verify that the timestamp is indeed the actual problem.

@dineshg13
Copy link
Member

Thanks @indrekj . Please us posted, would have to help to debug this further.

@indrekj
Copy link
Contributor Author

indrekj commented Nov 3, 2022

I think I figured it out: #16062

indrekj added a commit to indrekj/opentelemetry-collector-contrib that referenced this issue Nov 3, 2022
…estamps

I think this issue was introduced in d31a5c3.

I [noticed a memory leak][1] in DataDog exporter but after a lot of
debugging, it turned out that the OpenTelemetry gateway was receiving
invalid timestamps:

```
ScopeSpans #0
ScopeSpans SchemaURL:
InstrumentationScope
Span #0
    Trace ID       : 0000000000000000c20a2b82c179228a
    Parent ID      :
    ID             : 7c3415ed370f1777
    Name           : [redacted]
    Kind           : SPAN_KIND_SERVER
    Start time     : 2200-11-16 22:32:41.14035456 +0000 UTC
    End time       : 2200-11-16 22:33:02.68735456 +0000 UTC
```

The year is almost 100 years in the future.

Before we send a span to the gateway, it goes through an OpenTelemetry
collector sidecar. We use Zipkin V1 endpoint to send the spans to the
sidecar.

In the same pod, we're also using Zipkin V2 which did not have any
issues.

The problem itself: zipkin uses microseconds. This has to be multipled
with 1e3 to get nanoseconds, not with 1e6.

[1]: open-telemetry#15720
bogdandrutu pushed a commit that referenced this issue Nov 7, 2022
…estamps (#16062)

I think this issue was introduced in d31a5c3.

I [noticed a memory leak][1] in DataDog exporter but after a lot of
debugging, it turned out that the OpenTelemetry gateway was receiving
invalid timestamps:

```
ScopeSpans #0
ScopeSpans SchemaURL:
InstrumentationScope
Span #0
    Trace ID       : 0000000000000000c20a2b82c179228a
    Parent ID      :
    ID             : 7c3415ed370f1777
    Name           : [redacted]
    Kind           : SPAN_KIND_SERVER
    Start time     : 2200-11-16 22:32:41.14035456 +0000 UTC
    End time       : 2200-11-16 22:33:02.68735456 +0000 UTC
```

The year is almost 100 years in the future.

Before we send a span to the gateway, it goes through an OpenTelemetry
collector sidecar. We use Zipkin V1 endpoint to send the spans to the
sidecar.

In the same pod, we're also using Zipkin V2 which did not have any
issues.

The problem itself: zipkin uses microseconds. This has to be multipled
with 1e3 to get nanoseconds, not with 1e6.

[1]: #15720
@mx-psi
Copy link
Member

mx-psi commented Nov 8, 2022

Thanks for the PR and the investigation @indrekj 🙇 Is there anything else to be done on this issue? I guess

it would be nice if the addNow function would log a warning if the span date is in the far future.

this is something we could do on the exporter, I will defer to @gbbr's opinion on how should we handle it

dineshg13 pushed a commit to DataDog/opentelemetry-collector-contrib that referenced this issue Nov 21, 2022
…estamps (open-telemetry#16062)

I think this issue was introduced in d31a5c3.

I [noticed a memory leak][1] in DataDog exporter but after a lot of
debugging, it turned out that the OpenTelemetry gateway was receiving
invalid timestamps:

```
ScopeSpans #0
ScopeSpans SchemaURL:
InstrumentationScope
Span #0
    Trace ID       : 0000000000000000c20a2b82c179228a
    Parent ID      :
    ID             : 7c3415ed370f1777
    Name           : [redacted]
    Kind           : SPAN_KIND_SERVER
    Start time     : 2200-11-16 22:32:41.14035456 +0000 UTC
    End time       : 2200-11-16 22:33:02.68735456 +0000 UTC
```

The year is almost 100 years in the future.

Before we send a span to the gateway, it goes through an OpenTelemetry
collector sidecar. We use Zipkin V1 endpoint to send the spans to the
sidecar.

In the same pod, we're also using Zipkin V2 which did not have any
issues.

The problem itself: zipkin uses microseconds. This has to be multipled
with 1e3 to get nanoseconds, not with 1e6.

[1]: open-telemetry#15720
shalper2 pushed a commit to shalper2/opentelemetry-collector-contrib that referenced this issue Dec 6, 2022
…estamps (open-telemetry#16062)

I think this issue was introduced in d31a5c3.

I [noticed a memory leak][1] in DataDog exporter but after a lot of
debugging, it turned out that the OpenTelemetry gateway was receiving
invalid timestamps:

```
ScopeSpans #0
ScopeSpans SchemaURL:
InstrumentationScope
Span #0
    Trace ID       : 0000000000000000c20a2b82c179228a
    Parent ID      :
    ID             : 7c3415ed370f1777
    Name           : [redacted]
    Kind           : SPAN_KIND_SERVER
    Start time     : 2200-11-16 22:32:41.14035456 +0000 UTC
    End time       : 2200-11-16 22:33:02.68735456 +0000 UTC
```

The year is almost 100 years in the future.

Before we send a span to the gateway, it goes through an OpenTelemetry
collector sidecar. We use Zipkin V1 endpoint to send the spans to the
sidecar.

In the same pod, we're also using Zipkin V2 which did not have any
issues.

The problem itself: zipkin uses microseconds. This has to be multipled
with 1e3 to get nanoseconds, not with 1e6.

[1]: open-telemetry#15720
@github-actions
Copy link
Contributor

github-actions bot commented Jan 9, 2023

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label Jan 9, 2023
@cforce
Copy link

cforce commented Jan 25, 2023

Is this leak fixed ?

@mx-psi
Copy link
Member

mx-psi commented Jan 25, 2023

Is this leak fixed ?

Yes, this issue should be closed since the underlying issue on the zipkin receiver was fixed. If you are experiencing something similar please open a new issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working exporter/datadog Datadog components priority:p1 High Stale
Projects
None yet
Development

No branches or pull requests

6 participants