Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Load-balancing exporter k8s resolver continuously invokes the OnUpdate() command in the handler #35658

Closed
Tarmander opened this issue Oct 7, 2024 · 0 comments · Fixed by #36505
Labels
bug Something isn't working needs triage New item requiring triage

Comments

@Tarmander
Copy link

Tarmander commented Oct 7, 2024

Component(s)

exporter/loadbalancingexporter

What happened?

Description

When configuring our load-balancing collector to target our backend collectors via the k8s resolver, we noticed that while the DNS resolution worked fine and the collectors received evenly distributed traffic, the load-balancer would consistently recycle the endpoints at a set cadence (around every 3 minutes). The endpoints would be unchanged.

We added some log statements to the k8s resolver/handler, and they revealed that the OnUpdate() function in the handler was being invoked. This would imply that some event was triggering the update, but k get endpoints opentelemetry-global-gateway-collector --watch --output-watch-events=true returned no events for several hours when ran manually.

The net result was no actual changes to the service endpoints, but the exporter would consistently dispose and construct new exporters.

Steps to Reproduce

Configure the k8s resolver to point to a service representing

Expected Result

The OnUpdate() call in k8s handler only runs when updates occur in the service endpoints pointed to by the k8s resolver.

Actual Result

OnUpdate() is invoked at a recurring frequency of around every 3 minutes, regardless of changes to the service it points to.

Collector version

v0.105.0

Environment information

Environment

OS: Ubuntu 22.04
Compiler: go1.22.6

OpenTelemetry Collector configuration

receivers:
      otlp:
        protocols:
          grpc: {}
          http: {}
    processors:
      batch:
        timeout: 1s
      memory_limiter:
        check_interval: 5s
        limit_percentage: 80
        spike_limit_percentage: 20
    exporters:
      loadbalancing:
        protocol:
          otlp:
            tls:
              insecure: true
            sending_queue:
              queue_size: 100000
              num_consumers: 25
        resolver:
          k8s:
            service: opentelemetry-global-gateway-collector-headless.opentelemetry-global-collector
    extensions:
      health_check:
        endpoint: 0.0.0.0:13133
      zpages:
        endpoint: 0.0.0.0:55679
      pprof:
        endpoint: localhost:1777
    service:
      extensions: [health_check, zpages, pprof]
      telemetry:
        logs:
          level: info
          encoding: json
        metrics:
          address: 0.0.0.0:8888
      pipelines:
        traces:
          receivers: [otlp]
          processors: [memory_limiter, batch]
          exporters: [loadbalancing]

Log output

Sample Log Output:
 {"stream":"stderr","timestamp":1727987229349,"log":{"name":"loadbalancing","ts":1.7279872293494482E9,"data_type":"traces","oldEps":"&Endpoints{ObjectMeta:{opentelemetry-global-gateway-collector-headless  opentelemetry-global-collector  6382211d-bb57-4141-8bed-165f8f002e94 2635273389 0 2024-07-18 16:38:27 +0000 UTC <nil> <nil> map[app.kubernetes.io/component:opentelemetry-collector app.kubernetes.io/instance:opentelemetry-global-collector.opentelemetry-global-gateway app.kubernetes.io/managed-by:opentelemetry-operator app.kubernetes.io/name:opentelemetry-global-gateway-collector app.kubernetes.io/part-of:opentelemetry app.kubernetes.io/version:0.105.0 operator.opentelemetry.io/collector-headless-service:Exists operator.opentelemetry.io/collector-service-type:headless service.kubernetes.io/headless:] map[endpoints.kubernetes.io/last-change-trigger-time:2024-10-02T13:28:28Z] [] [] [{kube-controller-manager Update v1 2024-10-02 13:28:28 +0000 UTC FieldsV1 {\"f:metadata\":{\"f:annotations\":{\".\":{},\"f:endpoints.kubernetes.io/last-change-trigger-time\":{}},\"f:labels\":{\".\":{},\"f:app.kubernetes.io/component\":{},\"f:app.kubernetes.io/instance\":{},\"f:app.kubernetes.io/managed-by\":{},\"f:app.kubernetes.io/name\":{},\"f:app.kubernetes.io/part-of\":{},\"f:app.kubernetes.io/version\":{},\"f:operator.opentelemetry.io/collector-headless-service\":{},\"f:operator.opentelemetry.io/collector-service-type\":{},\"f:service.kubernetes.io/headless\":{}}},\"f:subsets\":{}} }]},Subsets:[]EndpointSubset{EndpointSubset{Addresses:[]EndpointAddress{EndpointAddress{IP:10.100.148.213,TargetRef:&ObjectReference{Kind:Pod,Namespace:opentelemetry-global-collector,Name:opentelemetry-global-gateway-collector-55695567c-rgz8b,UID:61cdd493-8900-408a-a0fd-8df916f790d7,APIVersion:,ResourceVersion:,FieldPath:,},Hostname:,NodeName:*ip-10-100-157-74.ec2.internal,},EndpointAddress{IP:10.100.181.244,TargetRef:&ObjectReference{Kind:Pod,Namespace:opentelemetry-global-collector,Name:opentelemetry-global-gateway-collector-55695567c-lk86p,UID:31c74954-6e14-4fe7-8a33-9ce1e48013e2,APIVersion:,ResourceVersion:,FieldPath:,},Hostname:,NodeName:*ip-10-100-187-149.ec2.internal,},},NotReadyAddresses:[]EndpointAddress{},Ports:[]EndpointPort{EndpointPort{Name:otlp-grpc,Port:4317,Protocol:TCP,AppProtocol:*grpc,},EndpointPort{Name:otlp-http,Port:4318,Protocol:TCP,AppProtocol:*http,},},},},}","resolver":"k8s service","msg":"OnUpDate: Old endpoints > 0, deleting them from endpoints. First callback to 'resolve' invoked.","kind":"exporter","caller":"loadbalancingexporter/resolver_k8s_handler.go:60","epRemove":["10.100.148.213","10.100.181.244"],"level":"info"}}
{"stream":"stderr","timestamp":1727987229349,"log":{"name":"loadbalancing","ts":1.7279872293496487E9,"epAdd":["10.100.148.213","10.100.181.244"],"data_type":"traces","newEps":"&Endpoints{ObjectMeta:{opentelemetry-global-gateway-collector-headless  opentelemetry-global-collector  6382211d-bb57-4141-8bed-165f8f002e94 2635273389 0 2024-07-18 16:38:27 +0000 UTC <nil> <nil> map[app.kubernetes.io/component:opentelemetry-collector app.kubernetes.io/instance:opentelemetry-global-collector.opentelemetry-global-gateway app.kubernetes.io/managed-by:opentelemetry-operator app.kubernetes.io/name:opentelemetry-global-gateway-collector app.kubernetes.io/part-of:opentelemetry app.kubernetes.io/version:0.105.0 operator.opentelemetry.io/collector-headless-service:Exists operator.opentelemetry.io/collector-service-type:headless service.kubernetes.io/headless:] map[endpoints.kubernetes.io/last-change-trigger-time:2024-10-02T13:28:28Z] [] [] [{kube-controller-manager Update v1 2024-10-02 13:28:28 +0000 UTC FieldsV1 {\"f:metadata\":{\"f:annotations\":{\".\":{},\"f:endpoints.kubernetes.io/last-change-trigger-time\":{}},\"f:labels\":{\".\":{},\"f:app.kubernetes.io/component\":{},\"f:app.kubernetes.io/instance\":{},\"f:app.kubernetes.io/managed-by\":{},\"f:app.kubernetes.io/name\":{},\"f:app.kubernetes.io/part-of\":{},\"f:app.kubernetes.io/version\":{},\"f:operator.opentelemetry.io/collector-headless-service\":{},\"f:operator.opentelemetry.io/collector-service-type\":{},\"f:service.kubernetes.io/headless\":{}}},\"f:subsets\":{}} }]},Subsets:[]EndpointSubset{EndpointSubset{Addresses:[]EndpointAddress{EndpointAddress{IP:10.100.148.213,TargetRef:&ObjectReference{Kind:Pod,Namespace:opentelemetry-global-collector,Name:opentelemetry-global-gateway-collector-55695567c-rgz8b,UID:61cdd493-8900-408a-a0fd-8df916f790d7,APIVersion:,ResourceVersion:,FieldPath:,},Hostname:,NodeName:*ip-10-100-157-74.ec2.internal,},EndpointAddress{IP:10.100.181.244,TargetRef:&ObjectReference{Kind:Pod,Namespace:opentelemetry-global-collector,Name:opentelemetry-global-gateway-collector-55695567c-lk86p,UID:31c74954-6e14-4fe7-8a33-9ce1e48013e2,APIVersion:,ResourceVersion:,FieldPath:,},Hostname:,NodeName:*ip-10-100-187-149.ec2.internal,},},NotReadyAddresses:[]EndpointAddress{},Ports:[]EndpointPort{EndpointPort{Name:otlp-grpc,Port:4317,Protocol:TCP,AppProtocol:*grpc,},EndpointPort{Name:otlp-http,Port:4318,Protocol:TCP,AppProtocol:*http,},},},},}","resolver":"k8s service","msg":"OnUpDate: endpoint changes detected, second callback to 'resolve' invoked.","kind":"exporter","caller":"loadbalancingexporter/resolver_k8s_handler.go:77","level":"info"}}

Additional context

No response

@Tarmander Tarmander added bug Something isn't working needs triage New item requiring triage labels Oct 7, 2024
jpkrohling pushed a commit that referenced this issue Nov 27, 2024
…e update events (#36505)

<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue.
Ex. Adding a feature - Explain what this achieves.-->
#### Description
The load balancing exporter's k8sresolver was not handling update events
properly. The `callback` function was being executed after cleanup of
old endpoints and also after adding new endpoints. This causes exporter
churn in the case of an event in which the lists contain shared
elements. See the
[documentation](https://pkg.go.dev/k8s.io/client-go/tools/cache#ResourceEventHandler)
for examples where the state might change but the IP Addresses would
not, including the regular re-list events that might have zero changes.

<!-- Issue number (e.g. #1234) or full URL to issue, if applicable. -->
#### Link to tracking issue
Fixes
#35658
May be related to
#35810
as well.

<!--Describe what testing was performed and which tests were added.-->
#### Testing
Added tests for no-change onChange call.

<!--Please delete paragraphs that you did not use before submitting.-->
shivanthzen pushed a commit to shivanthzen/opentelemetry-collector-contrib that referenced this issue Dec 5, 2024
…e update events (open-telemetry#36505)

<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue.
Ex. Adding a feature - Explain what this achieves.-->
#### Description
The load balancing exporter's k8sresolver was not handling update events
properly. The `callback` function was being executed after cleanup of
old endpoints and also after adding new endpoints. This causes exporter
churn in the case of an event in which the lists contain shared
elements. See the
[documentation](https://pkg.go.dev/k8s.io/client-go/tools/cache#ResourceEventHandler)
for examples where the state might change but the IP Addresses would
not, including the regular re-list events that might have zero changes.

<!-- Issue number (e.g. open-telemetry#1234) or full URL to issue, if applicable. -->
#### Link to tracking issue
Fixes
open-telemetry#35658
May be related to
open-telemetry#35810
as well.

<!--Describe what testing was performed and which tests were added.-->
#### Testing
Added tests for no-change onChange call.

<!--Please delete paragraphs that you did not use before submitting.-->
ZenoCC-Peng pushed a commit to ZenoCC-Peng/opentelemetry-collector-contrib that referenced this issue Dec 6, 2024
…e update events (open-telemetry#36505)

<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue.
Ex. Adding a feature - Explain what this achieves.-->
#### Description
The load balancing exporter's k8sresolver was not handling update events
properly. The `callback` function was being executed after cleanup of
old endpoints and also after adding new endpoints. This causes exporter
churn in the case of an event in which the lists contain shared
elements. See the
[documentation](https://pkg.go.dev/k8s.io/client-go/tools/cache#ResourceEventHandler)
for examples where the state might change but the IP Addresses would
not, including the regular re-list events that might have zero changes.

<!-- Issue number (e.g. open-telemetry#1234) or full URL to issue, if applicable. -->
#### Link to tracking issue
Fixes
open-telemetry#35658
May be related to
open-telemetry#35810
as well.

<!--Describe what testing was performed and which tests were added.-->
#### Testing
Added tests for no-change onChange call.

<!--Please delete paragraphs that you did not use before submitting.-->
sbylica-splunk pushed a commit to sbylica-splunk/opentelemetry-collector-contrib that referenced this issue Dec 17, 2024
…e update events (open-telemetry#36505)

<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue.
Ex. Adding a feature - Explain what this achieves.-->
#### Description
The load balancing exporter's k8sresolver was not handling update events
properly. The `callback` function was being executed after cleanup of
old endpoints and also after adding new endpoints. This causes exporter
churn in the case of an event in which the lists contain shared
elements. See the
[documentation](https://pkg.go.dev/k8s.io/client-go/tools/cache#ResourceEventHandler)
for examples where the state might change but the IP Addresses would
not, including the regular re-list events that might have zero changes.

<!-- Issue number (e.g. open-telemetry#1234) or full URL to issue, if applicable. -->
#### Link to tracking issue
Fixes
open-telemetry#35658
May be related to
open-telemetry#35810
as well.

<!--Describe what testing was performed and which tests were added.-->
#### Testing
Added tests for no-change onChange call.

<!--Please delete paragraphs that you did not use before submitting.-->
AkhigbeEromo pushed a commit to sematext/opentelemetry-collector-contrib that referenced this issue Jan 13, 2025
…e update events (open-telemetry#36505)

<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue.
Ex. Adding a feature - Explain what this achieves.-->
#### Description
The load balancing exporter's k8sresolver was not handling update events
properly. The `callback` function was being executed after cleanup of
old endpoints and also after adding new endpoints. This causes exporter
churn in the case of an event in which the lists contain shared
elements. See the
[documentation](https://pkg.go.dev/k8s.io/client-go/tools/cache#ResourceEventHandler)
for examples where the state might change but the IP Addresses would
not, including the regular re-list events that might have zero changes.

<!-- Issue number (e.g. open-telemetry#1234) or full URL to issue, if applicable. -->
#### Link to tracking issue
Fixes
open-telemetry#35658
May be related to
open-telemetry#35810
as well.

<!--Describe what testing was performed and which tests were added.-->
#### Testing
Added tests for no-change onChange call.

<!--Please delete paragraphs that you did not use before submitting.-->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage New item requiring triage
Projects
None yet
1 participant