Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Latest release breaks collector autoscaler #2018

Closed
cmergenthaler opened this issue Aug 14, 2023 · 9 comments
Closed

Latest release breaks collector autoscaler #2018

cmergenthaler opened this issue Aug 14, 2023 · 9 comments
Labels
bug Something isn't working needs-info

Comments

@cmergenthaler
Copy link
Contributor

cmergenthaler commented Aug 14, 2023

After upgrading to the latest release 0.82.0, I have noticed that the operator scaled down my otel-collector to a replica number lower than the configured autoscaler.minReplicas. The underlying HPA keeps showing the minReplica count as the desired and current replica count and also shows following events:

invalid metrics (1 invalid out of 1), first error is: failed to get cpu resource metric value: 
failed to get cpu utilization: did not receive metrics for any ready pods 

My autoscaler is configured as following and the actual number of pods is 2:

spec:
  autoscaler:
    minReplicas: 3
    maxReplicas: 6

Any ideas what could cause this issue?

@jaronoff97
Copy link
Contributor

Looking now

@jaronoff97 jaronoff97 added bug Something isn't working needs-info labels Aug 14, 2023
@jaronoff97
Copy link
Contributor

I tested v0.82.0 with the spec you provided and everything worked as expected. Are your collectors pods up and healthy? Do they have resource requests and limits set?

Screenshot 2023-08-14 at 11 16 41 AM Screenshot 2023-08-14 at 11 17 16 AM

@jaronoff97
Copy link
Contributor

(the label is incorrect, but i'm running an otel-operator pod w/ version 0.82.0)

@cmergenthaler
Copy link
Contributor Author

cmergenthaler commented Aug 16, 2023

@jaronoff97 Thanks for having a look! So after updating from 0.81.0 to 0.82.0, one of my otel-collector gets terminated event though my hpa shows 3 current & desired pods.
The remaining 2 pods are running fine and healthy. Yes I do have set cpu/memory requests and memory limit (no cpu limit).
My Pods:
Bildschirmfoto 2023-08-16 um 08 14 10
HPA:
Bildschirmfoto 2023-08-16 um 09 08 29

Status of my opentelemetrycollector CR (note it says 2/2 replicas here even though the hpa desires 3):
Bildschirmfoto 2023-08-16 um 08 16 28

I don't understand why the HPA says 3 replicas are running when the OpentelemetryCollector only displays 2 replicas

@jaronoff97
Copy link
Contributor

This is indeed very very odd... The only thing I can imagine is happening is that the replicas: 2 is being set on the CRD which is somehow overriding what you have set for the HPA. Are there any logs from the operator?

@jaronoff97
Copy link
Contributor

From a quick glance, i don't see anything that would be causing this in between the releases. But I'm going to do some more testing on my clusters to check this.

@cmergenthaler
Copy link
Contributor Author

This is indeed very very odd... The only thing I can imagine is happening is that the replicas: 2 is being set on the CRD which is somehow overriding what you have set for the HPA. Are there any logs from the operator?

When the operator scales down the otel-collector it logs the following:

  ~ k logs -n monitoring opentelemetry-operator-5475947c7b-lznrz
  { "level": "info","ts": "2023-08-17T06:21:02Z","logger": "collector-upgrade","msg": "no upgrade routines are needed for the OpenTelemetry instance","name": "otel","namespace": "monitoring","version": "0.82.0","latest": "0.61.0" }
  { "level": "info","ts": "2023-08-17T06:21:02Z","logger": "collector-upgrade","msg": "skipping upgrade for OpenTelemetry Collector instance","name": "otel","namespace": "monitoring" }
  { "level": "info","ts": "2023-08-17T06:21:02Z","logger": "instrumentation-upgrade","msg": "no instances to upgrade" }
  { "level": "info","ts": "2023-08-17T06:21:03Z","msg": "Starting workers","controller": "opentelemetrycollector","controllerGroup": "opentelemetry.io","controllerKind": "OpenTelemetryCollector","worker count": 1 }
  { "level": "debug","ts": "2023-08-17T06:21:03Z","logger": "events","msg": "OpenTelemetry Config changed - monitoring/otel-targetallocator","type": "Normal","object": { "kind": "ConfigMap","namespace": "monitoring","name": "otel-targetallocator","uid": "744996b4-0e66-4abe-8ccf-22b363952e50","apiVersion": "v1","resourceVersion": "515704176" },"reason": "ConfigUpdate " }

Here are also my settings for the OpenTelemetryCollector (without the config):

apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata:
  name: otel
spec:
  mode: statefulset
  autoscaler:
    minReplicas: 3
    maxReplicas: 6
  resources:
    requests:
      cpu: 900m
      memory: 800Mi
    limits:
      memory: 2Gi
  targetAllocator:
    enabled: true
    allocationStrategy: consistent-hashing
    replicas: 2
    resources:
      requests:
        cpu: 150m
        memory: 200Mi
      limits:
        cpu: 1000m
        memory: 500Mi
    filterStrategy: relabel-config
    prometheusCR:
      enabled: true

@moh-osman3
Copy link
Contributor

moh-osman3 commented Aug 18, 2023

Hmm I've been looking into what could be causing this issue, but have not had any luck reproducing with the provided config either. I'm wondering what images you're using for your targetallocator and collector?

I'm also wondering how exactly you upgrade from v0.81.0 to v0.82.0 ? Sometimes when I see issues with my kubernetes resources I try to do delete the resources and do a fresh install. To do this I usually delete my namespace (e.g. kubectl delete ns <namespace>). Also if the admission webhook is enabled I manually delete the operator's admission webhook objects (i.e. see
$ kubectl get MutatingWebhookConfiguration -n <namespace> and $ kubectl get ValidatingWebhookConfiguration -n <namespace>)

In the past lingering resources from an old install has given me trouble when upgrading to a new version.

@cmergenthaler
Copy link
Contributor Author

cmergenthaler commented Aug 18, 2023

@moh-osman3 Thanks for having a look!
I am deploying and upgrading the operator with the helm chart.

Was finally able to trace down the error and noticed that in the underlying StatefulSet an error event happens:
create Pod otel-collector-2 in StatefulSet otel-collector failed error: Pod "otel-collector-2" is invalid: spec.containers[0].ports[3].containerPort: Required value.

The ports in the Pod template look like the following:
Ports: 8888/TCP, 4317/TCP, 4318/TCP, 0/TCP

What could cause the port zero being added here? I do use otlp and prometheus receivers

EDIT: After investigating this, it seems like the zero port is being added only if I use a prometheusremotewrite exporter. Not sure why this happens though, maybe something odd happens here?
https://github.com/open-telemetry/opentelemetry-operator/blob/main/internal/manifests/collector/container.go#L217

EDIT 2: Just saw this is the same as in #2016 and has been fixed already with #2017. Would be nice to have a Patch Release for this fix. So closing this, thanks for your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs-info
Projects
None yet
Development

No branches or pull requests

3 participants