Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[receiver/k8scluster] Detecting API deprecation in k8s cluster receiver #27907

Closed
omrozowicz-splunk opened this issue Oct 23, 2023 · 7 comments
Closed

Comments

@omrozowicz-splunk
Copy link
Contributor

Component(s)

receiver/k8scluster

Is your feature request related to a problem? Please describe.

The problem is that some of the k8s clusters (GKE) can be auto-upgraded only if none of the soon-to-be-removed API is called. As stated in GKE documentation:

If GKE detects usage of a deprecated feature or API, GKE pauses automatic upgrades to prevent your cluster from being upgraded into a broken state. Upgrades to the next Kubernetes minor version are paused, but GKE continues to deliver patch upgrades to the cluster on the current minor version

GKE can resume automatic upgrades only if there were no calls to deprecated endpoints for the last 30 days or until current’s version end-of-life.

We had such issues on the project that uses the contrib repo:
signalfx/splunk-otel-collector-chart#918
signalfx/splunk-otel-collector-chart#897

So if GKE users have an OTel collector on their k8s cluster and use a k8s cluster receiver, it might block them from the automatic upgrade feature. We had this problem when v2beta2/horizontalpodautoscaler and v1beta1/cronjob were enabled in the k8s cluster receiver (it was disabled with this PR https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/26516/files).

One way to fix this problem, which we already used for beta APIs from the above, is to simply delete removed kinds. That approach has one drawback - it will stop polling any metrics for older clusters. For example now users of k8s <1.21 cannot gather any hpa’s and cronjobs metrics.

Describe the solution you'd like

Right now we don't have any betas in versions we support, but in case this situation happens again we could provide one of those solutions:

  1. Always use the newest API group: We can provide a few groups - usually, that would be two, one beta and a future one - but poll metrics only from the latest one. Only in case the latest one is not yet supported on the current cluster, we use the older one.

  2. Provide functionality to disable certain API groups: Another option that would probably be easier to implement is providing a way to configure API groups that we want to exclude from polling. That might be something like:

k8scluster:
  excludeVersionKind:
    - cronjob/v1beta1

The drawback would be that the user has to know what to exclude, so in the case of GKE, they’d need to wait for an additional 30 days after disabling the soon-to-be-removed version.

Describe alternatives you've considered

No response

Additional context

No response

@omrozowicz-splunk omrozowicz-splunk added enhancement New feature or request needs triage New item requiring triage labels Oct 23, 2023
@github-actions
Copy link
Contributor

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@jvoravong
Copy link
Contributor

The root issue here could potentially impact the auto-upgrade feature for any major Kubernetes distribution, not just GKE.

@crobert-1
Copy link
Member

Sounds like a valid enhancement to me. I'll have to defer to code owners to decide which option is the best going forward as I don't have much experience here.

@crobert-1 crobert-1 removed the needs triage New item requiring triage label Nov 13, 2023
Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label Jan 15, 2024
@crobert-1 crobert-1 removed the Stale label Jan 16, 2024
@povilasv
Copy link
Contributor

Hey, this makes sense. I like the second approach:

Always use the newest API group: We can provide a few groups - usually, that would be two, one beta and a future one - but poll metrics only from the latest one. Only in case the latest one is not yet supported on the current cluster, we use the older one.

But given that we don't have beta APIs anymore, let's not add this now. As it would be hard to test / write code for a proble m we don't have right now.

Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label Mar 18, 2024
Copy link
Contributor

This issue has been closed as inactive because it has been stale for 120 days with no activity.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale May 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants