Skip to content

Commit

Permalink
Merge pull request #4406 from sanposhiho/ga-graduate
Browse files Browse the repository at this point in the history
KEP-1610: graduate ContainerResource to stable
  • Loading branch information
k8s-ci-robot authored Feb 8, 2024
2 parents dfded09 + 07db9a1 commit b3fadb9
Show file tree
Hide file tree
Showing 3 changed files with 38 additions and 21 deletions.
2 changes: 2 additions & 0 deletions keps/prod-readiness/sig-autoscaling/1610.yaml
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
kep-number: 1610
beta:
approver: "@johnbelamaric"
stable:
approver: "@johnbelamaric"
49 changes: 32 additions & 17 deletions keps/sig-autoscaling/1610-container-resource-autoscaling/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -733,7 +733,7 @@ You can take a look at one potential example of such test in:
https://github.com/kubernetes/kubernetes/pull/97058/files#diff-7826f7adbc1996a05ab52e3f5f02429e94b68ce6bce0dc534d1be636154fded3R246-R282
-->

No. But, the tests to confirm the behavior on switching the feature gate will be added by beta. ([issue](https://github.com/kubernetes/kubernetes/issues/115467))
No. But, the tests to confirm the behavior on switching the feature gate will be added by beta. ([issue](https://github.com/kubernetes/kubernetes/issues/123189))

### Rollout, Upgrade and Rollback Planning

Expand Down Expand Up @@ -767,12 +767,14 @@ What signals should users be paying attention to when the feature is young
that might indicate a serious problem?
-->

- The container resource metric takes much longer time compared to other metrics.
which can be monitored via the 1st metrics described in [What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?](#what-are-the-slis-service-level-indicators-an-operator-can-use-to-determine-the-health-of-the-service) section.
- Increase the overall performance of HPA controller
which can be monitored via the 2nd metrics described in [What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?](#what-are-the-slis-service-level-indicators-an-operator-can-use-to-determine-the-health-of-the-service) section.
- Many error occurrence on the container resource metrics
which can be monitored via the 3rd metrics described in [What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?](#what-are-the-slis-service-level-indicators-an-operator-can-use-to-determine-the-health-of-the-service) section.
- `reconciliation_duration_seconds`: The time(seconds) that the HPA controller takes to reconcile once.
- You should rollback if you see an increase in the overall performance of HPA controller
- `metric_computation_duration_seconds{metric_type=ContainerResource}`: The time(seconds) that the HPA controller takes to calculate one metric.
- You should rollback if you see the container resource metric takes much longer time compared to other metrics.
- `reconciliations_total{error=internal}`: Number of internal errors in reconciliation of HPA controller.
- You should rollback if you see many error occurrence on the reconciliation.
- `metric_computation_total{error=internal,{metric_type=ContainerResource}`: Number of internal errors in the calculation of `type: ContainerResource`.
- You should rollback if you see many error occurrence on the container resource metrics

###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

Expand All @@ -782,7 +784,6 @@ Longer term, we may want to require automated upgrade/rollback tests, but we
are missing a bunch of machinery and tooling and can't do that now.
-->

Not yet.
But, as described in [Are there any tests for feature enablement/disablement?](#Are-there-any-tests-for-feature-enablement/disablement?), the tests to confirm the behavior on switching the feature gate will be added. ([issue](https://github.com/kubernetes/kubernetes/issues/115467))

###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
Expand Down Expand Up @@ -826,7 +827,6 @@ Recall that end users cannot usually observe component logs or access metrics.

- [x] Events
- `SuccessfulRescale` event with `memory/cpu/etc resource utilization (percentage of request) above/below target`
- Note that we cannot know if this reason is due to the `Resource` metric or `ContainerResource` in the current implementation. We'll change this reason for `ContainerResource` to `memory/cpu/etc container resource utilization (percentage of request) above/below target` so that we can distinguish.
- [x] API .status
- When something wrong with the container metrics, `ScalingActive` condition will be false with `FailedGetContainerResourceMetric` reason.

Expand Down Expand Up @@ -861,14 +861,11 @@ Pick one more of these and delete the rest.
- Details:
-->

HPA controller have no metrics in it now.
The following metrics will be implemented by beta. ([issue](https://github.com/kubernetes/kubernetes/issues/115639))
1. How long does each metric type take to compute the ideal replica num.
- so that users can confirm the container resource metric doesn't take long time compared to other metrics.
2. How long does the HPA controller take to complete reconcile one HPA object.
- so that users can confirm the container resource metric doesn't increse the whole time of scaling.
3. Provide the metric to show error occurrence for each metric.
- so that users can confirm no much error occurrence on the container resource metric.
- [x] Metrics
- `metric_computation_duration_seconds`: The time(seconds) that the HPA controller takes to calculate one metric.
- `metric_computation_total`: Number of metric computations.
- `reconciliations_total`: Number of reconciliation of HPA controller.
- `reconciliation_duration_seconds`: The time(seconds) that the HPA controller takes to reconcile once.

###### Are there any missing metrics that would be useful to have to improve observability of this feature?

Expand Down Expand Up @@ -996,6 +993,20 @@ This through this both in small and large cases, again with respect to the

No.

###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?

<!--
Focus not just on happy cases, but primarily on more pathological cases
(e.g. probes taking a minute instead of milliseconds, failed pods consuming resources, etc.).
If any of the resources can be exhausted, how this is mitigated with the existing limits
(e.g. pods per node) or new limits added by this KEP?

Are there any tests that were run/should be run to understand performance characteristics better
and validate the declared limits?
-->

No.

### Troubleshooting

<!--
Expand Down Expand Up @@ -1037,6 +1048,10 @@ For each of them, fill in the following information by copying the below templat

###### What steps should be taken if SLOs are not being met to determine the problem?

Check `metric_computation_duration_seconds` or `reconciliation_duration_seconds` to see which metric encountered the latency issue.
And, if it is a latency problem only specific in `type: ContainerResource`,
you can opt-out this feature by removing the `type: ContainerResource` metric from HPA(s).

## Implementation History

* 2020-04-03 Initial KEP merged
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -12,16 +12,16 @@ approvers:
- "@josephburnett"
- "@gjtempleton"
creation-date: 2020-02-18
last-updated: 2023-02-02
last-updated: 2024-01-15
status: implementable

latest-milestone: "1.27"
stage: "beta"
latest-milestone: "1.30"
stage: "stable"

milestone:
alpha: "v1.20"
beta: "v1.27"
stable: "v1.29"
stable: "v1.30"

feature-gates:
- name: HPAContainerMetrics
Expand Down

0 comments on commit b3fadb9

Please sign in to comment.