Memory-leak related to the resourcetotelemetry codepath? #33383

diranged · 2024-06-04T20:15:26Z

Component(s)

exporter/prometheusremotewrite, pkg/resourcetotelemetry

What happened?

Description

We're troubleshooting an issue where _a single otel-collector-... pod out of a group begins to turn away metrics because the memorylimiter is tripped. In this situation, we have dozens or hundreds of clients pushing otlp metrics (prometheus collected, but over the otlp grpc exporter) to multiple otel-collector-... pods. The metrics are routed with the loadbalancer exporter using routing_key: resource.

The behavior we see is that one collector suddenly starts running out of memory and being limited .. while the other collectors are using half or even less memory to process the same number of events. Here's graphs of the ingestion and the success rate:

The two dips in the Percentage of Metrics Accepted by Receiver graphs are different pods in a StatefulSet. Here's the graph of actual memory usage of these three pods:

In the first dip - from 8:30AM->9:30AM, I manually restarted the pod to recover it. It's fine now ... but a few hours later, a different one of the pods becomes overloaded. Grabbing a HEAP dump from the pprof endpoint on a "good" and "bad" pod shows some stark differences:

Bad Pod Pprof: otel-collector-metrics-processor-collector-0.pb.gz

Good Pod Pprof: otel-collector-metrics-processor-collector-1.pb.gz

I should note that this isn't remotely our largest environment - and these pods are handling ~12-15k datapoints/sec, while our larger environments are doing ~20k/sec/pod... so this doesn't feel like a fundamental scale issue. All pods are sized the same across all of our environments.

Collector version

v0.101.0

Environment information

Environment

OS: (e.g., "Ubuntu 20.04")
Compiler(if manually compiled): (e.g., "go 14.2")

OpenTelemetry Collector configuration

receivers:
  otlp:
    protocols:
      grpc:
        max_recv_msg_size_mib: 128
        tls:
          ca_file: /tls/ca.crt
          cert_file: /tls/tls.crt
          client_ca_file: /tls/ca.crt
          key_file: /tls/tls.key
exporters:
  debug:
    sampling_initial: 15
    sampling_thereafter: 60
  debug/verbose:
    sampling_initial: 15
    sampling_thereafter: 60
    verbosity: detailed
  prometheusremotewrite/amp:
    add_metric_suffixes: true
    auth:
      authenticator: sigv4auth
    endpoint: https://...api/v1/remote_write
    max_batch_size_bytes: "1000000"
    remote_write_queue:
      num_consumers: 5
      queue_size: 50000
    resource_to_telemetry_conversion:
      enabled: true
    retry_on_failure:
      enabled: true
      initial_interval: 200ms
      max_elapsed_time: 60s
      max_interval: 5s
    send_metadata: false
    target_info:
      enabled: false
    timeout: 90s
  prometheusremotewrite/central:
    add_metric_suffixes: true
    endpoint: https://..../api/v1/remote_write
    max_batch_size_bytes: "1000000"
    remote_write_queue:
      num_consumers: 5
      queue_size: 50000
    resource_to_telemetry_conversion:
      enabled: true
    retry_on_failure:
      enabled: true
      initial_interval: 200ms
      max_elapsed_time: 60s
      max_interval: 5s
    send_metadata: false
    target_info:
      enabled: false
    timeout: 90s
    tls:
      ca_file: /tls/ca.crt
      cert_file: /tls/tls.crt
      insecure_skip_verify: true
      key_file: /tls/tls.key
  prometheusremotewrite/staging:
    add_metric_suffixes: true
    endpoint: https://.../api/v1/remote_write
    max_batch_size_bytes: "1000000"
    remote_write_queue:
      num_consumers: 5
      queue_size: 50000
    resource_to_telemetry_conversion:
      enabled: true
    retry_on_failure:
      enabled: true
      initial_interval: 200ms
      max_elapsed_time: 60s
      max_interval: 5s
    send_metadata: false
    target_info:
      enabled: false
    timeout: 90s
    tls:
      ca_file: /tls/ca.crt
      cert_file: /tls/tls.crt
      insecure_skip_verify: true
      key_file: /tls/tls.key
processors:
  attributes/common:
    actions:
      - action: upsert
        key: k8s.cluster.name
        value: ...
  batch/otlp:
    send_batch_max_size: 10000
  batch/prometheus:
    send_batch_max_size: 12384
    send_batch_size: 8192
    timeout: 15s
  filter/drop_unknown_source:
    error_mode: ignore
    metrics:
      exclude:
        match_type: regexp
        metric_names: .*
        resource_attributes:
          - key: _meta.source.type
            value: unknown
  filter/find_unknown_source:
    error_mode: ignore
    metrics:
      include:
        match_type: regexp
        metric_names: .*
        resource_attributes:
          - key: _meta.source.type
            value: unknown
  filter/only_prometheus_metrics:
    error_mode: ignore
    metrics:
      include:
        match_type: regexp
        resource_attributes:
          - key: _meta.source.type
            value: prometheus
  k8sattributes:
    extract:
      labels:
        - from: pod
          key: app.kubernetes.io/name
          tag_name: app.kubernetes.io/name
        - from: pod
          key: app.kubernetes.io/instance
          tag_name: app.kubernetes.io/instance
        - from: pod
          key: app.kubernetes.io/component
          tag_name: app.kubernetes.io/component
        - from: pod
          key: app.kubernetes.io/part-of
          tag_name: app.kubernetes.io/part-of
        - from: pod
          key: app.kubernetes.io/managed-by
          tag_name: app.kubernetes.io/managed-by
      metadata:
        - container.id
        - container.image.name
        - container.image.tag
        - k8s.container.name
        - k8s.cronjob.name
        - k8s.daemonset.name
        - k8s.deployment.name
        - k8s.job.name
        - k8s.namespace.name
        - k8s.node.name
        - k8s.pod.name
        - k8s.pod.uid
        - k8s.replicaset.name
        - k8s.statefulset.name
    passthrough: false
    pod_association:
      - sources:
          - from: resource_attribute
            name: k8s.pod.ip
      - sources:
          - from: resource_attribute
            name: k8s.pod.uid
      - sources:
          - from: connection
  memory_limiter:
    check_interval: 1s
    limit_percentage: 80
    spike_limit_percentage: 10
  transform/clean_metadata:
    log_statements:
      - context: resource
        statements:
          - delete_matching_keys(attributes, "^_meta.*")
    metric_statements:
      - context: resource
        statements:
          - delete_matching_keys(attributes, "^_meta.*")
    trace_statements:
      - context: resource
        statements:
          - delete_matching_keys(attributes, "^_meta.*")
  transform/default_source:
    log_statements:
      - context: resource
        statements:
          - set(attributes["_meta.source.type"], "unknown") where attributes["_meta.source.type"] == nil
          - set(attributes["_meta.source.name"], "unknown") where attributes["_meta.source.name"] == nil
    metric_statements:
      - context: resource
        statements:
          - set(attributes["_meta.source.type"], "unknown") where attributes["_meta.source.type"] == nil
          - set(attributes["_meta.source.name"], "unknown") where attributes["_meta.source.name"] == nil
    trace_statements:
      - context: resource
        statements:
          - set(attributes["_meta.source.type"], "unknown") where attributes["_meta.source.type"] == nil
          - set(attributes["_meta.source.name"], "unknown") where attributes["_meta.source.name"] == nil
  transform/prometheus_label_clean:
    metric_statements:
      - context: resource
        statements:
          - delete_matching_keys(attributes, "^datadog.*")
          - delete_matching_keys(attributes, "^host.cpu.*")
          - delete_matching_keys(attributes, "^host.image.id")
          - delete_matching_keys(attributes, "^host.type")
          - replace_all_patterns(attributes, "key", "^(endpoint|http\\.scheme|net\\.host\\.name|net\\.host\\.port)", "scrape.$$1")
      - context: datapoint
        statements:
          - replace_all_patterns(attributes, "key", "^(endpoint)", "scrape.$$1")
extensions:
  health_check:
    endpoint: 0.0.0.0:13133
  pprof:
    endpoint: :1777
  sigv4auth:
    region: eu-west-1
service:
  extensions:
    - health_check
    - pprof
    - sigv4auth
  telemetry:
    logs:
      level: info
    metrics:
      level: detailed
  pipelines:
    metrics/prometheus:
      exporters:
        - prometheusremotewrite/amp
        - prometheusremotewrite/central
      processors:
        - memory_limiter
        - transform/default_source
        - filter/only_prometheus_metrics
        - transform/prometheus_label_clean
        - transform/clean_metadata
        - attributes/common
        - batch/prometheus
      receivers:
        - otlp

Log output

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

github-actions · 2024-06-04T20:15:43Z

Pinging code owners:

exporter/prometheusremotewrite: @Aneurysm9 @rapphil
pkg/resourcetotelemetry: @mx-psi

See Adding Labels via Comments if you do not have permissions to add labels yourself.

diranged · 2024-06-04T20:15:52Z

Attaching the relevent memorylimiter logs...
Explore-logs-2024-06-04 13_14_33.txt

mx-psi · 2024-06-05T09:04:59Z

Could there be some metrics that have a very large number of resource attributes? We end up allocating extra memory of size roughly $\textrm{number of metrics} \times \textrm{avg number of resource attributes}$, so if the number of metrics is not too big, maybe the number of resource attributes explains this.

I am a bit skeptical of this being an issue on pkg/resourcetotelemetry at first, there may be room for improvement but the logic there is pretty simple

opentelemetry-collector-contrib/pkg/resourcetotelemetry/resource_to_telemetry.go

Lines 108 to 112 in e7cf560

    
           to.EnsureCapacity(from.Len() + to.Len()) 
        
           from.Range(func(k string, v pcommon.Value) bool { 
        
           	v.CopyTo(to.PutEmpty(k)) 
        
           	return true 
        
           })

and it looks like it allocates exactly what it needs.

diranged · 2024-06-05T15:50:21Z

@mx-psi,
Thanks for the quick response. We do add a decent number of resource attributes to every metric. At this time, we're only dealing with processing metrics collected by the prometheusreceiver. These metrics each get a bunch of standard labels applied by the k8sattributesprocessor:

app_env="xxx",
app_group="xxx",
app_kubernetes_io_instance="xxx",
app_kubernetes_io_managed_by="Helm",
app_kubernetes_io_name="xx",
cloud_account_id="xxx",
cloud_availability_zone="xxx",
cloud_platform="aws_eks",
cloud_provider="aws",
cloud_region="eu-west-1",
cluster="eu1",
component="proxy",
container="istio-proxy",
container_id="xxx",
container_image_name="xxx/istio/proxyv2",
container_image_tag="1.20.6",
host_arch="amd64",
host_id="i-xxx",
instance="xxx:15090",
job="istio-system/envoy-stats-monitor-raw",
k8s_cluster_name="eu1",
k8s_container_name="istio-proxy",
k8s_deployment_name="xxx",
k8s_namespace_name="xxx",
k8s_node_name="xxxeu-west-1.compute.internal",
k8s_node_uid="54084d95-ecc4-406b-8ac7-11c9a9a6bf57",
k8s_pod_name="xxx-6bzl6",
k8s_pod_uid="ca256889-747f-493d-b262-c8c2ae728f5e",
k8s_replicaset_name="xxx",
namespace="xxx",
node_name="xxx.eu-west-1.compute.internal",
os_type="linux",
otel="true",
pod="xxx-6bzl6",
scrape_endpoint="http-envoy-prom",
scrape_http_scheme="http",
scrape_net_host_name="100.64.179.192",
scrape_net_host_port="15090",
service_instance_id="100.64.179.192:15090",

That said .. what this feels like is some kind of a n issue where the GC is unable to clean up the data when the memory_limiter is tripped. So not an ongoing memory leak, but perhaps just a stuck pointer or something that prevents the data from being collected in some situations?

philchia · 2024-06-25T06:52:03Z

How about we just process the resource attributes in prometheusremotewrite.FromMetrics instead of pkg/resourcetotelemetry?

github-actions · 2024-08-26T03:32:20Z

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

exporter/prometheusremotewrite: @Aneurysm9 @rapphil @dashpole
pkg/resourcetotelemetry: @mx-psi

See Adding Labels via Comments if you do not have permissions to add labels yourself.

grandwizard28 · 2025-01-25T11:02:04Z

Hi @diranged,
We seem to be running into the same issue. I see the same "good" and "bad" pprof dumps.

What (if any) workaround did you implement?

mx-psi · 2025-01-27T10:30:23Z

@grandwizard28 It would be useful if you can share more details about your setup to find out similarities/differences with @diranged's

diranged · 2025-01-28T15:53:19Z

Hi @diranged, We seem to be running into the same issue. I see the same "good" and "bad" pprof dumps.

What (if any) workaround did you implement?

@grandwizard28, I had to go digging to find it ... but here's the comment in our code for what we did:

    metrics/prometheus:
      receivers: [otlp]
      processors:
        # https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/33383
        #
        # I am temporarily disabling this to allow the pod to OOM itself if we
        # run out of memory, rather than getting into a long-lived state where
        # it can't seem to GC itself properly. This is a test, and ideally goes
        # away when the healthcheckv2 is available.
        #
        # - memory_limiter
        #
        - transform/default_source

This has mostly worked for us ... rather than trying to let Otel fix itself, we just let it die and restart. We'd rather have it fail fast than get stuck in a bad state.

grandwizard28 · 2025-01-28T20:41:47Z

Ahhh thanks for this @diranged.

I ended up refactoring our custom exporter to remove the dependency on resourcetotelemetry.

grandwizard28 · 2025-01-28T20:43:54Z

@grandwizard28 It would be useful if you can share more details about your setup to find out similarities/differences with @diranged's

Hey @mx-psi,
We are running a custom exporter which is built on top of the prometheusremotewriteexporter. I saw the exact same heap profile as posted in the description of this issue. Removing the dependency on resourcetotelemetry worked for me.

Our custom exporter is based off on a very earlier edition of the prometheusremotewriteexporter (v0.55.0).

diranged added bug Something isn't working needs triage New item requiring triage labels Jun 4, 2024

github-actions bot added exporter/prometheusremotewrite pkg/resourcetotelemetry labels Jun 4, 2024

mx-psi added priority:p2 Medium and removed needs triage New item requiring triage labels Jun 5, 2024

github-actions bot added the Stale label Aug 26, 2024

mx-psi added never stale Issues marked with this label will be never staled and automatically removed and removed Stale labels Aug 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory-leak related to the resourcetotelemetry codepath? #33383

Memory-leak related to the resourcetotelemetry codepath? #33383

diranged commented Jun 4, 2024

github-actions bot commented Jun 4, 2024

diranged commented Jun 4, 2024

mx-psi commented Jun 5, 2024

diranged commented Jun 5, 2024

philchia commented Jun 25, 2024

github-actions bot commented Aug 26, 2024

grandwizard28 commented Jan 25, 2025 •

edited

Loading

mx-psi commented Jan 27, 2025

diranged commented Jan 28, 2025

grandwizard28 commented Jan 28, 2025

grandwizard28 commented Jan 28, 2025 •

edited

Loading

Memory-leak related to the resourcetotelemetry codepath? #33383

Memory-leak related to the resourcetotelemetry codepath? #33383

Comments

diranged commented Jun 4, 2024

Component(s)

What happened?

Description

Collector version

Environment information

Environment

OpenTelemetry Collector configuration

Log output

Additional context

github-actions bot commented Jun 4, 2024

diranged commented Jun 4, 2024

mx-psi commented Jun 5, 2024

diranged commented Jun 5, 2024

philchia commented Jun 25, 2024

github-actions bot commented Aug 26, 2024

grandwizard28 commented Jan 25, 2025 • edited Loading

mx-psi commented Jan 27, 2025

diranged commented Jan 28, 2025

grandwizard28 commented Jan 28, 2025

grandwizard28 commented Jan 28, 2025 • edited Loading

grandwizard28 commented Jan 25, 2025 •

edited

Loading

grandwizard28 commented Jan 28, 2025 •

edited

Loading