-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP 1790: Update recover resize failure KEP for going beta. #3188
KEP 1790: Update recover resize failure KEP for going beta. #3188
Conversation
/assign @deads2k |
074d5ab
to
1d0746e
Compare
- Metric name: | ||
- [Optional] Aggregation method: | ||
- Components exposing the metric: | ||
- controller expansion operation duration: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we want a metric specifically for reducing the size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think we need a new metric from external resize controller for this operation. In addition to reducing size feature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added metric for counting volumes that have been recovered.
@@ -224,8 +224,7 @@ The complete expansion and recovery flow of both control-plane and kubelet is do | |||
### Risks and Mitigations | |||
|
|||
- Once expansion is initiated, the lowering of requested size is only allowed upto a value *greater* than `pvc.Status`. It is not possible to entirely go back to previously requested size. This should not be a problem however in-practice because user can retry expansion with slightly higher value than `pvc.Status` and still recover from previously failing expansion request. | |||
|
|||
|
|||
|
|||
## Graduation Criteria | |||
|
|||
* *Alpha* in 1.23 behind `RecoverExpansionFailure` feature gate with set to a default of `false`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice if we can rollback all the way to the original size before before beta, because that would involve updating API validation. I need to think more deeply if we would need the new validation logic to soak for a release before going to beta.
* **How can a rollout fail? Can it impact already running workloads?** | ||
This change should not impact existing workloads and requires user interaction via reducing pvc capacity. | ||
|
||
* **What specific metrics should inform a rollback?** | ||
No specific metric but if expansion of PVCs are being stuck (can be verified from `pvc.Status.Conditions`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In volume expansion PRR, I saw metrics related to errors during certain operations. This seems like a good spot to have similar metrics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mentioned a metric for counting expansion failures. The operation duration for recovery is not going to be separate from general expansion but counting expansion successes and failures is useful and hence I have included it.
Describe manual testing that was done and the outcomes. | ||
Longer term, we may want to require automated upgrade/rollback tests, but we | ||
are missing a bunch of machinery and tooling and do that now. | ||
We have not fully tested upgrade and rollback but as part of beta process we will have it tested. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you mean as requirement before going to beta?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, as a requirement for going beta. I have not yet fully tested (i.e upgrade and rollback) it, but I will test it and add e2e, since upgrading and rolling back has some k8s and external-resizer version compatibility issues as noted in the KEP.
- Usage description: | ||
- Impact of its outage on the feature: | ||
- Impact of its degraded performance or high error rates on the feature: | ||
The reason of using PV name as a label is - we do not expect this feature to be used in a cluster very often |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please get an opinion from sig-instrumentation on your PR. I think even knowing the overall success and fail counts is useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @dgrisonnet . What are you opinions on emitting a metric that can potentially contain volume name as a label? The motivation of using volume name here is - this metric should be emitted relatively rarely (not everyday people are expected to need to recover from volume expansion failures) and hence use of volume name should be okay.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with David, knowing the overall success and failures is definitely useful, however, I am wondering if we really need the volume name here. I don't have much knowledge of storage so could you perhaps walk me through a scenario where volume expansion failures are only happening for one specific volume and how a cluster administrator could mitigate this issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- The user could have made a typo while editing PVC. Say you have 10GB PVC and you want to expand to 100GB, but instead type
1000GB
. Now expansion is stuck forever because there may not be enough space in backend to satisfy 1000GB and hence this feature allows users to retry expansion with a lower value (say 100GB). - May be not a typo but available capacity was not obvious to the user and they increased the size to a value which can't be fulfilled.
So those are the scenarios. It still should not be that high.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome thank you for the insight. Based on the scenarios you mentioned, I think it should be fine to add a volume
label to the counter metrics.
checking if there are objects with field X set) may be last resort. Avoid | ||
logs or events for this purpose. | ||
|
||
Any volume that has been recovered will emit a metric: `operation_operation_volume_recovery_total{state='success', volume_name='pvc-abce'}`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what are the values that the state
label can take?
If there are only success
and failure
, I would advise splitting the metric in two as it would make it easier to compute error rates:
- storage_operation_volume_recovery_total{volume='pvc-abce'}
- storage_operation_volume_recovery_failures_total{volume='pvc-abce'}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ack.
checking if there are objects with field X set) may be last resort. Avoid | ||
logs or events for this purpose. | ||
|
||
Any volume that has been recovered will emit a metric: `operation_operation_volume_recovery_total{state='success', volume_name='pvc-abce'}`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any volume that has been recovered will emit a metric: `operation_operation_volume_recovery_total{state='success', volume_name='pvc-abce'}`. | |
Any volume that has been recovered will emit a metric: `storage_operation_volume_recovery_total{state='success', volume_name='pvc-abce'}`. |
- Components exposing the metric: kube-controller-manager | ||
- controller expansion operation errors: | ||
- Metric name: storage_operation_errors_total{operation_name=expand_volume} | ||
- [Optional] Aggregation method: cumulative counter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For SLIs, it would be better to use the error counter to compute error rates. I am not sure if we have this metric yet, but you would need storage_operation_total
in addition to storage_operation_errors_total
.
- Components exposing the metric: kubelet | ||
- node expansion operation errors: | ||
- Metric name: storage_operation_errors_total{operation_name=volume_fs_resize} | ||
- [Optional] Aggregation method: cumulative counter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as for controller expansion operation errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually let me update this section. These error metrics were recently removed and replaced by adding status
field in volume_operation_total_seconds
metric.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that was definitely the wrong thing to do. Moving forward, I would be in favor of reverting kubernetes/kubernetes#98332 and adding metrics to compute the error rates.
Essentially the change that was made, increased the number of metrics exposed and made it harder to use to compute the error rate for no actual benefit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We consolidated those metrics with the guidance from sig-instrumentation. I think we should take the separate error metric vs consolidated metric discussion offline. For the purposes of this feature, I think we shoiuld keep the metrics consistent with how we collect metrics for all other storage operations, and move to a different model later based on the discussion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me, I wouldn't block this effort because of that, I just wanted to point out that the original error metric approach taken in this KEP was the correct one.
- Metric name: | ||
- [Optional] Aggregation method: | ||
- Components exposing the metric: | ||
- controller expansion operation duration: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a feeling that having the name of the operation as a label will make the metrics harder to use compared to if we had metrics dedicated for each operation such as storage_operation_expand_volume_duration_seconds
.
From a user perspective, I think that it will be harder to know what are the different types of operations on which SLIs can be computed.
Do you perhaps have a list of all the possible operations so that I can put it into perspective?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These metrics are already there and have been there for a long time (like may be 1.11 release and earlier).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking for OperationCompleteHook in pkg/volume it looks like we have about 14. I think it's been a lot easier to manage with a single metric with a status field---our slo processing within google, for instance, would be a lot more toil-y if we had to pull a metric per operation.
"verify_volumes_are_attached_per_node"
"verify_volumes_are_attached"
"volume_attach"
DetachOperationName
"volume_mount"
"volume_unmount"
"unmount_device"
"map_volume"
"unmap_volume"
"unmap_device"
"verify_controller_attached_volume"
"expand_volume"
"expand_volume"
"volume_fs_resize"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah since there are a lot of possible operations, it should be easier to have all of them under the same metric. Just wanted to check if that was the case here or not.
I think it's been a lot easier to manage with a single metric with a status field
The expression should be very similar whether we have one or multiple metrics. The difference is that fitting a lot of information in a histogram is very expensive and tends to be counterintuitive to users who would expect counter metrics to also be present and to have more granularity if necessary.
Also when looking into the duration of your operations, the status information is superflux so we will be exposing it in all the buckets of the histograms even though it will not be useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The expression should be very similar whether we have one or multiple metrics. The difference is that fitting a lot of information in a histogram is very expensive and tends to be counterintuitive to users who would expect counter metrics to also be present and to have more granularity if necessary.
oh, that's an interesting point.
Also when looking into the duration of your operations, the status information is superflux so we will be exposing it in all the buckets of the histograms even though it will not be useful.
I'm not sure about this point---if, eg, certain errors have higher latency that might be an interesting thing to know?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about this point---if, eg, certain errors have higher latency that might be an interesting thing to know?
I would agree that it may be useful under certain circumstances, but in the majority of cases, you only care about the average and the 99th percentile when using the histogram and these operations tend to not be impacted by errors since they are uncommon and a majority of them is returning earlier than normal code paths. As such, considering the cost of this information compared to its value, I think logs are better suited since they would make this information cheaper.
Also for SLOs, errors are already counted as part of the unavailability so you don't need to define duration thresholds for them.
I think the choice of what to put into histograms really depends on the scenario, but for Kubernetes in particular where we try to keep the number of metrics exposed under control, I think the cost vs value of adding the status label isn't necessarily worth it to have in the histogram metric.
optional services that are needed. For example, if this feature depends on | ||
a cloud provider API, or upon an external software-defined storage or network | ||
control plane. | ||
Tentative name of metric is - `operation_operation_volume_recovery_total{state='success', volume_name='pvc-abce'}` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tentative name of metric is - `operation_operation_volume_recovery_total{state='success', volume_name='pvc-abce'}` | |
Tentative name of metric is - `storage_operation_volume_recovery_total{state='success', volume='pvc-abce'}` |
- Usage description: | ||
- Impact of its outage on the feature: | ||
- Impact of its degraded performance or high error rates on the feature: | ||
The reason of using PV name as a label is - we do not expect this feature to be used in a cluster very often |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with David, knowing the overall success and failures is definitely useful, however, I am wondering if we really need the volume name here. I don't have much knowledge of storage so could you perhaps walk me through a scenario where volume expansion failures are only happening for one specific volume and how a cluster administrator could mitigate this issue?
PRR is complete for beta /approve |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: deads2k, gnufied, msau42 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Add sections for PRR
xref #1790