Latest release breaks collector autoscaler #2018

cmergenthaler · 2023-08-14T13:29:58Z

After upgrading to the latest release 0.82.0, I have noticed that the operator scaled down my otel-collector to a replica number lower than the configured autoscaler.minReplicas. The underlying HPA keeps showing the minReplica count as the desired and current replica count and also shows following events:

invalid metrics (1 invalid out of 1), first error is: failed to get cpu resource metric value: 
failed to get cpu utilization: did not receive metrics for any ready pods

My autoscaler is configured as following and the actual number of pods is 2:

spec:
  autoscaler:
    minReplicas: 3
    maxReplicas: 6

Any ideas what could cause this issue?

The text was updated successfully, but these errors were encountered:

jaronoff97 · 2023-08-14T14:56:46Z

Looking now

jaronoff97 · 2023-08-14T15:18:33Z

I tested v0.82.0 with the spec you provided and everything worked as expected. Are your collectors pods up and healthy? Do they have resource requests and limits set?

jaronoff97 · 2023-08-14T15:19:03Z

(the label is incorrect, but i'm running an otel-operator pod w/ version 0.82.0)

cmergenthaler · 2023-08-16T06:19:17Z

@jaronoff97 Thanks for having a look! So after updating from 0.81.0 to 0.82.0, one of my otel-collector gets terminated event though my hpa shows 3 current & desired pods.
The remaining 2 pods are running fine and healthy. Yes I do have set cpu/memory requests and memory limit (no cpu limit).
My Pods:

HPA:

Status of my opentelemetrycollector CR (note it says 2/2 replicas here even though the hpa desires 3):

I don't understand why the HPA says 3 replicas are running when the OpentelemetryCollector only displays 2 replicas

jaronoff97 · 2023-08-16T15:20:31Z

This is indeed very very odd... The only thing I can imagine is happening is that the replicas: 2 is being set on the CRD which is somehow overriding what you have set for the HPA. Are there any logs from the operator?

jaronoff97 · 2023-08-16T15:21:31Z

From a quick glance, i don't see anything that would be causing this in between the releases. But I'm going to do some more testing on my clusters to check this.

cmergenthaler · 2023-08-17T06:31:16Z

This is indeed very very odd... The only thing I can imagine is happening is that the replicas: 2 is being set on the CRD which is somehow overriding what you have set for the HPA. Are there any logs from the operator?

When the operator scales down the otel-collector it logs the following:

  ~ k logs -n monitoring opentelemetry-operator-5475947c7b-lznrz
  { "level": "info","ts": "2023-08-17T06:21:02Z","logger": "collector-upgrade","msg": "no upgrade routines are needed for the OpenTelemetry instance","name": "otel","namespace": "monitoring","version": "0.82.0","latest": "0.61.0" }
  { "level": "info","ts": "2023-08-17T06:21:02Z","logger": "collector-upgrade","msg": "skipping upgrade for OpenTelemetry Collector instance","name": "otel","namespace": "monitoring" }
  { "level": "info","ts": "2023-08-17T06:21:02Z","logger": "instrumentation-upgrade","msg": "no instances to upgrade" }
  { "level": "info","ts": "2023-08-17T06:21:03Z","msg": "Starting workers","controller": "opentelemetrycollector","controllerGroup": "opentelemetry.io","controllerKind": "OpenTelemetryCollector","worker count": 1 }
  { "level": "debug","ts": "2023-08-17T06:21:03Z","logger": "events","msg": "OpenTelemetry Config changed - monitoring/otel-targetallocator","type": "Normal","object": { "kind": "ConfigMap","namespace": "monitoring","name": "otel-targetallocator","uid": "744996b4-0e66-4abe-8ccf-22b363952e50","apiVersion": "v1","resourceVersion": "515704176" },"reason": "ConfigUpdate " }

Here are also my settings for the OpenTelemetryCollector (without the config):

apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata:
  name: otel
spec:
  mode: statefulset
  autoscaler:
    minReplicas: 3
    maxReplicas: 6
  resources:
    requests:
      cpu: 900m
      memory: 800Mi
    limits:
      memory: 2Gi
  targetAllocator:
    enabled: true
    allocationStrategy: consistent-hashing
    replicas: 2
    resources:
      requests:
        cpu: 150m
        memory: 200Mi
      limits:
        cpu: 1000m
        memory: 500Mi
    filterStrategy: relabel-config
    prometheusCR:
      enabled: true

moh-osman3 · 2023-08-18T09:00:19Z

Hmm I've been looking into what could be causing this issue, but have not had any luck reproducing with the provided config either. I'm wondering what images you're using for your targetallocator and collector?

I'm also wondering how exactly you upgrade from v0.81.0 to v0.82.0 ? Sometimes when I see issues with my kubernetes resources I try to do delete the resources and do a fresh install. To do this I usually delete my namespace (e.g. kubectl delete ns <namespace>). Also if the admission webhook is enabled I manually delete the operator's admission webhook objects (i.e. see
$ kubectl get MutatingWebhookConfiguration -n <namespace> and $ kubectl get ValidatingWebhookConfiguration -n <namespace>)

In the past lingering resources from an old install has given me trouble when upgrading to a new version.

cmergenthaler · 2023-08-18T12:11:59Z

@moh-osman3 Thanks for having a look!
I am deploying and upgrading the operator with the helm chart.

Was finally able to trace down the error and noticed that in the underlying StatefulSet an error event happens:
create Pod otel-collector-2 in StatefulSet otel-collector failed error: Pod "otel-collector-2" is invalid: spec.containers[0].ports[3].containerPort: Required value.

The ports in the Pod template look like the following:
Ports: 8888/TCP, 4317/TCP, 4318/TCP, 0/TCP

What could cause the port zero being added here? I do use otlp and prometheus receivers

EDIT: After investigating this, it seems like the zero port is being added only if I use a prometheusremotewrite exporter. Not sure why this happens though, maybe something odd happens here?
https://github.com/open-telemetry/opentelemetry-operator/blob/main/internal/manifests/collector/container.go#L217

EDIT 2: Just saw this is the same as in #2016 and has been fixed already with #2017. Would be nice to have a Patch Release for this fix. So closing this, thanks for your help!

jaronoff97 added bug Something isn't working needs-info labels Aug 14, 2023

cmergenthaler closed this as completed Aug 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Latest release breaks collector autoscaler #2018

Latest release breaks collector autoscaler #2018

cmergenthaler commented Aug 14, 2023 •

edited

Loading

jaronoff97 commented Aug 14, 2023

jaronoff97 commented Aug 14, 2023

jaronoff97 commented Aug 14, 2023

cmergenthaler commented Aug 16, 2023 •

edited

Loading

jaronoff97 commented Aug 16, 2023

jaronoff97 commented Aug 16, 2023

cmergenthaler commented Aug 17, 2023

moh-osman3 commented Aug 18, 2023 •

edited

Loading

cmergenthaler commented Aug 18, 2023 •

edited

Loading

Latest release breaks collector autoscaler #2018

Latest release breaks collector autoscaler #2018

Comments

cmergenthaler commented Aug 14, 2023 • edited Loading

jaronoff97 commented Aug 14, 2023

jaronoff97 commented Aug 14, 2023

jaronoff97 commented Aug 14, 2023

cmergenthaler commented Aug 16, 2023 • edited Loading

jaronoff97 commented Aug 16, 2023

jaronoff97 commented Aug 16, 2023

cmergenthaler commented Aug 17, 2023

moh-osman3 commented Aug 18, 2023 • edited Loading

cmergenthaler commented Aug 18, 2023 • edited Loading

cmergenthaler commented Aug 14, 2023 •

edited

Loading

cmergenthaler commented Aug 16, 2023 •

edited

Loading

moh-osman3 commented Aug 18, 2023 •

edited

Loading

cmergenthaler commented Aug 18, 2023 •

edited

Loading