KEP 1790: Update recover resize failure KEP for going beta. #3188

gnufied · 2022-01-27T02:46:15Z

Add sections for PRR

gnufied · 2022-01-27T02:46:56Z

msau42 · 2022-01-27T21:54:55Z

keps/sig-storage/1790-recover-resize-failure/README.md

-    - Metric name:
-    - [Optional] Aggregation method:
-    - Components exposing the metric:
+    - controller expansion operation duration:


do we want a metric specifically for reducing the size?

Yeah I think we need a new metric from external resize controller for this operation. In addition to reducing size feature.

Added metric for counting volumes that have been recovered.

msau42 · 2022-01-27T21:58:56Z

keps/sig-storage/1790-recover-resize-failure/README.md

@@ -224,8 +224,7 @@ The complete expansion and recovery flow of both control-plane and kubelet is do
 ### Risks and Mitigations

 - Once expansion is initiated, the lowering of requested size is only allowed upto a value *greater* than `pvc.Status`. It is not possible to entirely go back to previously requested size. This should not be a problem however in-practice because user can retry expansion with slightly higher value than `pvc.Status` and still recover from previously failing expansion request.
-
-
+
 ## Graduation Criteria

 * *Alpha* in 1.23 behind `RecoverExpansionFailure` feature gate with set to a default of `false`.


It would be nice if we can rollback all the way to the original size before before beta, because that would involve updating API validation. I need to think more deeply if we would need the new validation logic to soak for a release before going to beta.

deads2k · 2022-01-31T22:38:05Z

keps/sig-storage/1790-recover-resize-failure/README.md

 * **How can a rollout fail? Can it impact already running workloads?**
  This change should not impact existing workloads and requires user interaction via reducing pvc capacity.

 * **What specific metrics should inform a rollback?**
+  No specific metric but if expansion of PVCs are being stuck (can be verified from `pvc.Status.Conditions`)


In volume expansion PRR, I saw metrics related to errors during certain operations. This seems like a good spot to have similar metrics.

I mentioned a metric for counting expansion failures. The operation duration for recovery is not going to be separate from general expansion but counting expansion successes and failures is useful and hence I have included it.

deads2k · 2022-01-31T22:39:17Z

keps/sig-storage/1790-recover-resize-failure/README.md

-  Describe manual testing that was done and the outcomes.
-  Longer term, we may want to require automated upgrade/rollback tests, but we
-  are missing a bunch of machinery and tooling and do that now.
+  We have not fully tested upgrade and rollback but as part of beta process we will have it tested.


you mean as requirement before going to beta?

yes, as a requirement for going beta. I have not yet fully tested (i.e upgrade and rollback) it, but I will test it and add e2e, since upgrading and rolling back has some k8s and external-resizer version compatibility issues as noted in the KEP.

deads2k · 2022-01-31T22:41:24Z

keps/sig-storage/1790-recover-resize-failure/README.md

-    - Usage description:
-      - Impact of its outage on the feature:
-      - Impact of its degraded performance or high error rates on the feature:
+  The reason of using PV name as a label is - we do not expect this feature to be used in a cluster very often


Please get an opinion from sig-instrumentation on your PR. I think even knowing the overall success and fail counts is useful.

cc @dgrisonnet . What are you opinions on emitting a metric that can potentially contain volume name as a label? The motivation of using volume name here is - this metric should be emitted relatively rarely (not everyday people are expected to need to recover from volume expansion failures) and hence use of volume name should be okay.

I agree with David, knowing the overall success and failures is definitely useful, however, I am wondering if we really need the volume name here. I don't have much knowledge of storage so could you perhaps walk me through a scenario where volume expansion failures are only happening for one specific volume and how a cluster administrator could mitigate this issue?

The user could have made a typo while editing PVC. Say you have 10GB PVC and you want to expand to 100GB, but instead type 1000GB. Now expansion is stuck forever because there may not be enough space in backend to satisfy 1000GB and hence this feature allows users to retry expansion with a lower value (say 100GB).

May be not a typo but available capacity was not obvious to the user and they increased the size to a value which can't be fulfilled.

So those are the scenarios. It still should not be that high.

Awesome thank you for the insight. Based on the scenarios you mentioned, I think it should be fine to add a volume label to the counter metrics.

dgrisonnet · 2022-02-01T15:24:36Z

keps/sig-storage/1790-recover-resize-failure/README.md

-  checking if there are objects with field X set) may be last resort. Avoid
-  logs or events for this purpose.
-
+  Any volume that has been recovered will emit a metric: `operation_operation_volume_recovery_total{state='success', volume_name='pvc-abce'}`.


what are the values that the state label can take?

If there are only success and failure, I would advise splitting the metric in two as it would make it easier to compute error rates:

storage_operation_volume_recovery_total{volume='pvc-abce'}

storage_operation_volume_recovery_failures_total{volume='pvc-abce'}

dgrisonnet · 2022-02-01T15:24:58Z

keps/sig-storage/1790-recover-resize-failure/README.md

-  checking if there are objects with field X set) may be last resort. Avoid
-  logs or events for this purpose.
-
+  Any volume that has been recovered will emit a metric: `operation_operation_volume_recovery_total{state='success', volume_name='pvc-abce'}`.


Suggested change

Any volume that has been recovered will emit a metric: `operation_operation_volume_recovery_total{state='success', volume_name='pvc-abce'}`.

Any volume that has been recovered will emit a metric: `storage_operation_volume_recovery_total{state='success', volume_name='pvc-abce'}`.

dgrisonnet · 2022-02-01T15:28:59Z

keps/sig-storage/1790-recover-resize-failure/README.md

+        - Components exposing the metric: kube-controller-manager
+    - controller expansion operation errors:
+        - Metric name: storage_operation_errors_total{operation_name=expand_volume}
+        - [Optional] Aggregation method: cumulative counter


For SLIs, it would be better to use the error counter to compute error rates. I am not sure if we have this metric yet, but you would need storage_operation_total in addition to storage_operation_errors_total.

dgrisonnet · 2022-02-01T15:37:06Z

keps/sig-storage/1790-recover-resize-failure/README.md

+        - Components exposing the metric: kubelet
+    - node expansion operation errors:
+        - Metric name: storage_operation_errors_total{operation_name=volume_fs_resize}
+        - [Optional] Aggregation method: cumulative counter


Same comment as for controller expansion operation errors.

actually let me update this section. These error metrics were recently removed and replaced by adding status field in volume_operation_total_seconds metric.

that was definitely the wrong thing to do. Moving forward, I would be in favor of reverting kubernetes/kubernetes#98332 and adding metrics to compute the error rates.

Essentially the change that was made, increased the number of metrics exposed and made it harder to use to compute the error rate for no actual benefit.

cc @msau42 @mattcary

We consolidated those metrics with the guidance from sig-instrumentation. I think we should take the separate error metric vs consolidated metric discussion offline. For the purposes of this feature, I think we shoiuld keep the metrics consistent with how we collect metrics for all other storage operations, and move to a different model later based on the discussion.

Sounds good to me, I wouldn't block this effort because of that, I just wanted to point out that the original error metric approach taken in this KEP was the correct one.

dgrisonnet · 2022-02-01T15:41:12Z

keps/sig-storage/1790-recover-resize-failure/README.md

-    - Metric name:
-    - [Optional] Aggregation method:
-    - Components exposing the metric:
+    - controller expansion operation duration:


I have a feeling that having the name of the operation as a label will make the metrics harder to use compared to if we had metrics dedicated for each operation such as storage_operation_expand_volume_duration_seconds.
From a user perspective, I think that it will be harder to know what are the different types of operations on which SLIs can be computed.

Do you perhaps have a list of all the possible operations so that I can put it into perspective?

These metrics are already there and have been there for a long time (like may be 1.11 release and earlier).

Looking for OperationCompleteHook in pkg/volume it looks like we have about 14. I think it's been a lot easier to manage with a single metric with a status field---our slo processing within google, for instance, would be a lot more toil-y if we had to pull a metric per operation.

"verify_volumes_are_attached_per_node"
"verify_volumes_are_attached"
"volume_attach"
DetachOperationName
"volume_mount"
"volume_unmount"
"unmount_device"
"map_volume"
"unmap_volume"
"unmap_device"
"verify_controller_attached_volume"
"expand_volume"
"expand_volume"
"volume_fs_resize"

Yeah since there are a lot of possible operations, it should be easier to have all of them under the same metric. Just wanted to check if that was the case here or not.

I think it's been a lot easier to manage with a single metric with a status field

The expression should be very similar whether we have one or multiple metrics. The difference is that fitting a lot of information in a histogram is very expensive and tends to be counterintuitive to users who would expect counter metrics to also be present and to have more granularity if necessary.

Also when looking into the duration of your operations, the status information is superflux so we will be exposing it in all the buckets of the histograms even though it will not be useful.

The expression should be very similar whether we have one or multiple metrics. The difference is that fitting a lot of information in a histogram is very expensive and tends to be counterintuitive to users who would expect counter metrics to also be present and to have more granularity if necessary.

oh, that's an interesting point.

Also when looking into the duration of your operations, the status information is superflux so we will be exposing it in all the buckets of the histograms even though it will not be useful.

I'm not sure about this point---if, eg, certain errors have higher latency that might be an interesting thing to know?

I'm not sure about this point---if, eg, certain errors have higher latency that might be an interesting thing to know?

I would agree that it may be useful under certain circumstances, but in the majority of cases, you only care about the average and the 99th percentile when using the histogram and these operations tend to not be impacted by errors since they are uncommon and a majority of them is returning earlier than normal code paths. As such, considering the cost of this information compared to its value, I think logs are better suited since they would make this information cheaper.

Also for SLOs, errors are already counted as part of the unavailability so you don't need to define duration thresholds for them.

I think the choice of what to put into histograms really depends on the scenario, but for Kubernetes in particular where we try to keep the number of metrics exposed under control, I think the cost vs value of adding the status label isn't necessarily worth it to have in the histogram metric.

dgrisonnet · 2022-02-01T15:45:04Z

keps/sig-storage/1790-recover-resize-failure/README.md

-  optional services that are needed. For example, if this feature depends on
-  a cloud provider API, or upon an external software-defined storage or network
-  control plane.
+  Tentative name of metric is - `operation_operation_volume_recovery_total{state='success', volume_name='pvc-abce'}`


Suggested change

Tentative name of metric is - `operation_operation_volume_recovery_total{state='success', volume_name='pvc-abce'}`

Tentative name of metric is - `storage_operation_volume_recovery_total{state='success', volume='pvc-abce'}`

dgrisonnet · 2022-02-01T15:53:16Z

keps/sig-storage/1790-recover-resize-failure/README.md

-    - Usage description:
-      - Impact of its outage on the feature:
-      - Impact of its degraded performance or high error rates on the feature:
+  The reason of using PV name as a label is - we do not expect this feature to be used in a cluster very often


I agree with David, knowing the overall success and failures is definitely useful, however, I am wondering if we really need the volume name here. I don't have much knowledge of storage so could you perhaps walk me through a scenario where volume expansion failures are only happening for one specific volume and how a cluster administrator could mitigate this issue?

deads2k · 2022-02-01T20:00:53Z

PRR is complete for beta

/approve

msau42 · 2022-02-01T20:23:54Z

/lgtm
/approve

k8s-ci-robot · 2022-02-01T20:24:18Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: deads2k, gnufied, msau42

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~keps/prod-readiness/OWNERS~~ [deads2k]
~~keps/sig-storage/OWNERS~~ [msau42]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Start adding more details to PRR for recovery KEP

d7c7e9f

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 27, 2022

k8s-ci-robot requested review from msau42 and saad-ali January 27, 2022 02:46

k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/storage Categorizes an issue or PR as relevant to SIG Storage. labels Jan 27, 2022

k8s-ci-robot assigned deads2k Jan 27, 2022

gnufied changed the title ~~Add recover resizer prr~~ KEP 1790: Update recover resize failure KEP for going beta. Jan 27, 2022

Update kep.yaml

1d0746e

gnufied force-pushed the add-recover-resizer-prr branch from 074d5ab to 1d0746e Compare January 27, 2022 13:11

msau42 reviewed Jan 27, 2022

View reviewed changes

Add note for adding new metric

fde8cea

deads2k reviewed Jan 31, 2022

View reviewed changes

dgrisonnet reviewed Feb 1, 2022

View reviewed changes

k8s-ci-robot assigned msau42 Feb 1, 2022

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 1, 2022

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 1, 2022

k8s-ci-robot merged commit 9ffa073 into kubernetes:master Feb 1, 2022

k8s-ci-robot added this to the v1.24 milestone Feb 1, 2022

rhockenbury mentioned this pull request Feb 3, 2022

Support recovery from volume expansion failure #1790

Open

12 tasks

	Any volume that has been recovered will emit a metric: `operation_operation_volume_recovery_total{state='success', volume_name='pvc-abce'}`.
	Any volume that has been recovered will emit a metric: `storage_operation_volume_recovery_total{state='success', volume_name='pvc-abce'}`.

	Tentative name of metric is - `operation_operation_volume_recovery_total{state='success', volume_name='pvc-abce'}`
	Tentative name of metric is - `storage_operation_volume_recovery_total{state='success', volume='pvc-abce'}`

KEP 1790: Update recover resize failure KEP for going beta. #3188

KEP 1790: Update recover resize failure KEP for going beta. #3188

Conversation

gnufied commented Jan 27, 2022 • edited Loading

gnufied commented Jan 27, 2022

Choose a reason for hiding this comment

gnufied Jan 31, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gnufied Jan 31, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gnufied Jan 31, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

deads2k commented Feb 1, 2022

msau42 commented Feb 1, 2022

k8s-ci-robot commented Feb 1, 2022

gnufied commented Jan 27, 2022 •

edited

Loading

gnufied Jan 31, 2022 •

edited

Loading

gnufied Jan 31, 2022 •

edited

Loading

gnufied Jan 31, 2022 •

edited

Loading