Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory-leak related to the resourcetotelemetry codepath? #33383

Open
diranged opened this issue Jun 4, 2024 · 11 comments
Open

Memory-leak related to the resourcetotelemetry codepath? #33383

diranged opened this issue Jun 4, 2024 · 11 comments
Labels
bug Something isn't working exporter/prometheusremotewrite never stale Issues marked with this label will be never staled and automatically removed pkg/resourcetotelemetry priority:p2 Medium

Comments

@diranged
Copy link

diranged commented Jun 4, 2024

Component(s)

exporter/prometheusremotewrite, pkg/resourcetotelemetry

What happened?

Description

We're troubleshooting an issue where _a single otel-collector-... pod out of a group begins to turn away metrics because the memorylimiter is tripped. In this situation, we have dozens or hundreds of clients pushing otlp metrics (prometheus collected, but over the otlp grpc exporter) to multiple otel-collector-... pods. The metrics are routed with the loadbalancer exporter using routing_key: resource.

The behavior we see is that one collector suddenly starts running out of memory and being limited .. while the other collectors are using half or even less memory to process the same number of events. Here's graphs of the ingestion and the success rate:

image

The two dips in the Percentage of Metrics Accepted by Receiver graphs are different pods in a StatefulSet. Here's the graph of actual memory usage of these three pods:

image

In the first dip - from 8:30AM->9:30AM, I manually restarted the pod to recover it. It's fine now ... but a few hours later, a different one of the pods becomes overloaded. Grabbing a HEAP dump from the pprof endpoint on a "good" and "bad" pod shows some stark differences:

Bad Pod Pprof: otel-collector-metrics-processor-collector-0.pb.gz
image

Good Pod Pprof: otel-collector-metrics-processor-collector-1.pb.gz
image

I should note that this isn't remotely our largest environment - and these pods are handling ~12-15k datapoints/sec, while our larger environments are doing ~20k/sec/pod... so this doesn't feel like a fundamental scale issue. All pods are sized the same across all of our environments.

Collector version

v0.101.0

Environment information

Environment

OS: (e.g., "Ubuntu 20.04")
Compiler(if manually compiled): (e.g., "go 14.2")

OpenTelemetry Collector configuration

receivers:
  otlp:
    protocols:
      grpc:
        max_recv_msg_size_mib: 128
        tls:
          ca_file: /tls/ca.crt
          cert_file: /tls/tls.crt
          client_ca_file: /tls/ca.crt
          key_file: /tls/tls.key
exporters:
  debug:
    sampling_initial: 15
    sampling_thereafter: 60
  debug/verbose:
    sampling_initial: 15
    sampling_thereafter: 60
    verbosity: detailed
  prometheusremotewrite/amp:
    add_metric_suffixes: true
    auth:
      authenticator: sigv4auth
    endpoint: https://...api/v1/remote_write
    max_batch_size_bytes: "1000000"
    remote_write_queue:
      num_consumers: 5
      queue_size: 50000
    resource_to_telemetry_conversion:
      enabled: true
    retry_on_failure:
      enabled: true
      initial_interval: 200ms
      max_elapsed_time: 60s
      max_interval: 5s
    send_metadata: false
    target_info:
      enabled: false
    timeout: 90s
  prometheusremotewrite/central:
    add_metric_suffixes: true
    endpoint: https://..../api/v1/remote_write
    max_batch_size_bytes: "1000000"
    remote_write_queue:
      num_consumers: 5
      queue_size: 50000
    resource_to_telemetry_conversion:
      enabled: true
    retry_on_failure:
      enabled: true
      initial_interval: 200ms
      max_elapsed_time: 60s
      max_interval: 5s
    send_metadata: false
    target_info:
      enabled: false
    timeout: 90s
    tls:
      ca_file: /tls/ca.crt
      cert_file: /tls/tls.crt
      insecure_skip_verify: true
      key_file: /tls/tls.key
  prometheusremotewrite/staging:
    add_metric_suffixes: true
    endpoint: https://.../api/v1/remote_write
    max_batch_size_bytes: "1000000"
    remote_write_queue:
      num_consumers: 5
      queue_size: 50000
    resource_to_telemetry_conversion:
      enabled: true
    retry_on_failure:
      enabled: true
      initial_interval: 200ms
      max_elapsed_time: 60s
      max_interval: 5s
    send_metadata: false
    target_info:
      enabled: false
    timeout: 90s
    tls:
      ca_file: /tls/ca.crt
      cert_file: /tls/tls.crt
      insecure_skip_verify: true
      key_file: /tls/tls.key
processors:
  attributes/common:
    actions:
      - action: upsert
        key: k8s.cluster.name
        value: ...
  batch/otlp:
    send_batch_max_size: 10000
  batch/prometheus:
    send_batch_max_size: 12384
    send_batch_size: 8192
    timeout: 15s
  filter/drop_unknown_source:
    error_mode: ignore
    metrics:
      exclude:
        match_type: regexp
        metric_names: .*
        resource_attributes:
          - key: _meta.source.type
            value: unknown
  filter/find_unknown_source:
    error_mode: ignore
    metrics:
      include:
        match_type: regexp
        metric_names: .*
        resource_attributes:
          - key: _meta.source.type
            value: unknown
  filter/only_prometheus_metrics:
    error_mode: ignore
    metrics:
      include:
        match_type: regexp
        resource_attributes:
          - key: _meta.source.type
            value: prometheus
  k8sattributes:
    extract:
      labels:
        - from: pod
          key: app.kubernetes.io/name
          tag_name: app.kubernetes.io/name
        - from: pod
          key: app.kubernetes.io/instance
          tag_name: app.kubernetes.io/instance
        - from: pod
          key: app.kubernetes.io/component
          tag_name: app.kubernetes.io/component
        - from: pod
          key: app.kubernetes.io/part-of
          tag_name: app.kubernetes.io/part-of
        - from: pod
          key: app.kubernetes.io/managed-by
          tag_name: app.kubernetes.io/managed-by
      metadata:
        - container.id
        - container.image.name
        - container.image.tag
        - k8s.container.name
        - k8s.cronjob.name
        - k8s.daemonset.name
        - k8s.deployment.name
        - k8s.job.name
        - k8s.namespace.name
        - k8s.node.name
        - k8s.pod.name
        - k8s.pod.uid
        - k8s.replicaset.name
        - k8s.statefulset.name
    passthrough: false
    pod_association:
      - sources:
          - from: resource_attribute
            name: k8s.pod.ip
      - sources:
          - from: resource_attribute
            name: k8s.pod.uid
      - sources:
          - from: connection
  memory_limiter:
    check_interval: 1s
    limit_percentage: 80
    spike_limit_percentage: 10
  transform/clean_metadata:
    log_statements:
      - context: resource
        statements:
          - delete_matching_keys(attributes, "^_meta.*")
    metric_statements:
      - context: resource
        statements:
          - delete_matching_keys(attributes, "^_meta.*")
    trace_statements:
      - context: resource
        statements:
          - delete_matching_keys(attributes, "^_meta.*")
  transform/default_source:
    log_statements:
      - context: resource
        statements:
          - set(attributes["_meta.source.type"], "unknown") where attributes["_meta.source.type"] == nil
          - set(attributes["_meta.source.name"], "unknown") where attributes["_meta.source.name"] == nil
    metric_statements:
      - context: resource
        statements:
          - set(attributes["_meta.source.type"], "unknown") where attributes["_meta.source.type"] == nil
          - set(attributes["_meta.source.name"], "unknown") where attributes["_meta.source.name"] == nil
    trace_statements:
      - context: resource
        statements:
          - set(attributes["_meta.source.type"], "unknown") where attributes["_meta.source.type"] == nil
          - set(attributes["_meta.source.name"], "unknown") where attributes["_meta.source.name"] == nil
  transform/prometheus_label_clean:
    metric_statements:
      - context: resource
        statements:
          - delete_matching_keys(attributes, "^datadog.*")
          - delete_matching_keys(attributes, "^host.cpu.*")
          - delete_matching_keys(attributes, "^host.image.id")
          - delete_matching_keys(attributes, "^host.type")
          - replace_all_patterns(attributes, "key", "^(endpoint|http\\.scheme|net\\.host\\.name|net\\.host\\.port)", "scrape.$$1")
      - context: datapoint
        statements:
          - replace_all_patterns(attributes, "key", "^(endpoint)", "scrape.$$1")
extensions:
  health_check:
    endpoint: 0.0.0.0:13133
  pprof:
    endpoint: :1777
  sigv4auth:
    region: eu-west-1
service:
  extensions:
    - health_check
    - pprof
    - sigv4auth
  telemetry:
    logs:
      level: info
    metrics:
      level: detailed
  pipelines:
    metrics/prometheus:
      exporters:
        - prometheusremotewrite/amp
        - prometheusremotewrite/central
      processors:
        - memory_limiter
        - transform/default_source
        - filter/only_prometheus_metrics
        - transform/prometheus_label_clean
        - transform/clean_metadata
        - attributes/common
        - batch/prometheus
      receivers:
        - otlp

Log output

No response

Additional context

No response

@diranged diranged added bug Something isn't working needs triage New item requiring triage labels Jun 4, 2024
Copy link
Contributor

github-actions bot commented Jun 4, 2024

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@diranged
Copy link
Author

diranged commented Jun 4, 2024

Attaching the relevent memorylimiter logs...
Explore-logs-2024-06-04 13_14_33.txt

@mx-psi
Copy link
Member

mx-psi commented Jun 5, 2024

Could there be some metrics that have a very large number of resource attributes? We end up allocating extra memory of size roughly $\textrm{number of metrics} \times \textrm{avg number of resource attributes}$, so if the number of metrics is not too big, maybe the number of resource attributes explains this.

I am a bit skeptical of this being an issue on pkg/resourcetotelemetry at first, there may be room for improvement but the logic there is pretty simple

to.EnsureCapacity(from.Len() + to.Len())
from.Range(func(k string, v pcommon.Value) bool {
v.CopyTo(to.PutEmpty(k))
return true
})

and it looks like it allocates exactly what it needs.

@mx-psi mx-psi added priority:p2 Medium and removed needs triage New item requiring triage labels Jun 5, 2024
@diranged
Copy link
Author

diranged commented Jun 5, 2024

@mx-psi,
Thanks for the quick response. We do add a decent number of resource attributes to every metric. At this time, we're only dealing with processing metrics collected by the prometheusreceiver. These metrics each get a bunch of standard labels applied by the k8sattributesprocessor:

app_env="xxx",
app_group="xxx",
app_kubernetes_io_instance="xxx",
app_kubernetes_io_managed_by="Helm",
app_kubernetes_io_name="xx",
cloud_account_id="xxx",
cloud_availability_zone="xxx",
cloud_platform="aws_eks",
cloud_provider="aws",
cloud_region="eu-west-1",
cluster="eu1",
component="proxy",
container="istio-proxy",
container_id="xxx",
container_image_name="xxx/istio/proxyv2",
container_image_tag="1.20.6",
host_arch="amd64",
host_id="i-xxx",
instance="xxx:15090",
job="istio-system/envoy-stats-monitor-raw",
k8s_cluster_name="eu1",
k8s_container_name="istio-proxy",
k8s_deployment_name="xxx",
k8s_namespace_name="xxx",
k8s_node_name="xxxeu-west-1.compute.internal",
k8s_node_uid="54084d95-ecc4-406b-8ac7-11c9a9a6bf57",
k8s_pod_name="xxx-6bzl6",
k8s_pod_uid="ca256889-747f-493d-b262-c8c2ae728f5e",
k8s_replicaset_name="xxx",
namespace="xxx",
node_name="xxx.eu-west-1.compute.internal",
os_type="linux",
otel="true",
pod="xxx-6bzl6",
scrape_endpoint="http-envoy-prom",
scrape_http_scheme="http",
scrape_net_host_name="100.64.179.192",
scrape_net_host_port="15090",
service_instance_id="100.64.179.192:15090",

That said .. what this feels like is some kind of a n issue where the GC is unable to clean up the data when the memory_limiter is tripped. So not an ongoing memory leak, but perhaps just a stuck pointer or something that prevents the data from being collected in some situations?

@philchia
Copy link
Contributor

How about we just process the resource attributes in prometheusremotewrite.FromMetrics instead of pkg/resourcetotelemetry?

Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label Aug 26, 2024
@mx-psi mx-psi added never stale Issues marked with this label will be never staled and automatically removed and removed Stale labels Aug 26, 2024
@grandwizard28
Copy link

grandwizard28 commented Jan 25, 2025

Hi @diranged,
We seem to be running into the same issue. I see the same "good" and "bad" pprof dumps.

What (if any) workaround did you implement?

@mx-psi
Copy link
Member

mx-psi commented Jan 27, 2025

@grandwizard28 It would be useful if you can share more details about your setup to find out similarities/differences with @diranged's

@diranged
Copy link
Author

Hi @diranged, We seem to be running into the same issue. I see the same "good" and "bad" pprof dumps.

What (if any) workaround did you implement?

@grandwizard28, I had to go digging to find it ... but here's the comment in our code for what we did:

    metrics/prometheus:
      receivers: [otlp]
      processors:
        # https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/33383
        #
        # I am temporarily disabling this to allow the pod to OOM itself if we
        # run out of memory, rather than getting into a long-lived state where
        # it can't seem to GC itself properly. This is a test, and ideally goes
        # away when the healthcheckv2 is available.
        #
        # - memory_limiter
        #
        - transform/default_source

This has mostly worked for us ... rather than trying to let Otel fix itself, we just let it die and restart. We'd rather have it fail fast than get stuck in a bad state.

@grandwizard28
Copy link

Ahhh thanks for this @diranged.

I ended up refactoring our custom exporter to remove the dependency on resourcetotelemetry.

@grandwizard28
Copy link

grandwizard28 commented Jan 28, 2025

@grandwizard28 It would be useful if you can share more details about your setup to find out similarities/differences with @diranged's

Hey @mx-psi,
We are running a custom exporter which is built on top of the prometheusremotewriteexporter. I saw the exact same heap profile as posted in the description of this issue. Removing the dependency on resourcetotelemetry worked for me.

Our custom exporter is based off on a very earlier edition of the prometheusremotewriteexporter (v0.55.0).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working exporter/prometheusremotewrite never stale Issues marked with this label will be never staled and automatically removed pkg/resourcetotelemetry priority:p2 Medium
Projects
None yet
Development

No branches or pull requests

4 participants