Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP 1790: Update recover resize failure KEP for going beta. #3188

Merged
merged 3 commits into from
Feb 1, 2022

Conversation

gnufied
Copy link
Member

@gnufied gnufied commented Jan 27, 2022

Add sections for PRR

xref #1790

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 27, 2022
@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/storage Categorizes an issue or PR as relevant to SIG Storage. labels Jan 27, 2022
@gnufied
Copy link
Member Author

gnufied commented Jan 27, 2022

/assign @deads2k

@gnufied gnufied changed the title Add recover resizer prr KEP 1790: Update recover resize failure KEP for going beta. Jan 27, 2022
@gnufied gnufied force-pushed the add-recover-resizer-prr branch from 074d5ab to 1d0746e Compare January 27, 2022 13:11
- Metric name:
- [Optional] Aggregation method:
- Components exposing the metric:
- controller expansion operation duration:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want a metric specifically for reducing the size?

Copy link
Member Author

@gnufied gnufied Jan 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think we need a new metric from external resize controller for this operation. In addition to reducing size feature.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added metric for counting volumes that have been recovered.

@@ -224,8 +224,7 @@ The complete expansion and recovery flow of both control-plane and kubelet is do
### Risks and Mitigations

- Once expansion is initiated, the lowering of requested size is only allowed upto a value *greater* than `pvc.Status`. It is not possible to entirely go back to previously requested size. This should not be a problem however in-practice because user can retry expansion with slightly higher value than `pvc.Status` and still recover from previously failing expansion request.



## Graduation Criteria

* *Alpha* in 1.23 behind `RecoverExpansionFailure` feature gate with set to a default of `false`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice if we can rollback all the way to the original size before before beta, because that would involve updating API validation. I need to think more deeply if we would need the new validation logic to soak for a release before going to beta.

* **How can a rollout fail? Can it impact already running workloads?**
This change should not impact existing workloads and requires user interaction via reducing pvc capacity.

* **What specific metrics should inform a rollback?**
No specific metric but if expansion of PVCs are being stuck (can be verified from `pvc.Status.Conditions`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In volume expansion PRR, I saw metrics related to errors during certain operations. This seems like a good spot to have similar metrics.

Copy link
Member Author

@gnufied gnufied Jan 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mentioned a metric for counting expansion failures. The operation duration for recovery is not going to be separate from general expansion but counting expansion successes and failures is useful and hence I have included it.

Describe manual testing that was done and the outcomes.
Longer term, we may want to require automated upgrade/rollback tests, but we
are missing a bunch of machinery and tooling and do that now.
We have not fully tested upgrade and rollback but as part of beta process we will have it tested.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you mean as requirement before going to beta?

Copy link
Member Author

@gnufied gnufied Jan 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, as a requirement for going beta. I have not yet fully tested (i.e upgrade and rollback) it, but I will test it and add e2e, since upgrading and rolling back has some k8s and external-resizer version compatibility issues as noted in the KEP.

- Usage description:
- Impact of its outage on the feature:
- Impact of its degraded performance or high error rates on the feature:
The reason of using PV name as a label is - we do not expect this feature to be used in a cluster very often
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please get an opinion from sig-instrumentation on your PR. I think even knowing the overall success and fail counts is useful.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @dgrisonnet . What are you opinions on emitting a metric that can potentially contain volume name as a label? The motivation of using volume name here is - this metric should be emitted relatively rarely (not everyday people are expected to need to recover from volume expansion failures) and hence use of volume name should be okay.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with David, knowing the overall success and failures is definitely useful, however, I am wondering if we really need the volume name here. I don't have much knowledge of storage so could you perhaps walk me through a scenario where volume expansion failures are only happening for one specific volume and how a cluster administrator could mitigate this issue?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. The user could have made a typo while editing PVC. Say you have 10GB PVC and you want to expand to 100GB, but instead type 1000GB. Now expansion is stuck forever because there may not be enough space in backend to satisfy 1000GB and hence this feature allows users to retry expansion with a lower value (say 100GB).
  2. May be not a typo but available capacity was not obvious to the user and they increased the size to a value which can't be fulfilled.

So those are the scenarios. It still should not be that high.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome thank you for the insight. Based on the scenarios you mentioned, I think it should be fine to add a volume label to the counter metrics.

checking if there are objects with field X set) may be last resort. Avoid
logs or events for this purpose.

Any volume that has been recovered will emit a metric: `operation_operation_volume_recovery_total{state='success', volume_name='pvc-abce'}`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are the values that the state label can take?

If there are only success and failure, I would advise splitting the metric in two as it would make it easier to compute error rates:

  • storage_operation_volume_recovery_total{volume='pvc-abce'}
  • storage_operation_volume_recovery_failures_total{volume='pvc-abce'}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack.

checking if there are objects with field X set) may be last resort. Avoid
logs or events for this purpose.

Any volume that has been recovered will emit a metric: `operation_operation_volume_recovery_total{state='success', volume_name='pvc-abce'}`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Any volume that has been recovered will emit a metric: `operation_operation_volume_recovery_total{state='success', volume_name='pvc-abce'}`.
Any volume that has been recovered will emit a metric: `storage_operation_volume_recovery_total{state='success', volume_name='pvc-abce'}`.

- Components exposing the metric: kube-controller-manager
- controller expansion operation errors:
- Metric name: storage_operation_errors_total{operation_name=expand_volume}
- [Optional] Aggregation method: cumulative counter
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For SLIs, it would be better to use the error counter to compute error rates. I am not sure if we have this metric yet, but you would need storage_operation_total in addition to storage_operation_errors_total.

- Components exposing the metric: kubelet
- node expansion operation errors:
- Metric name: storage_operation_errors_total{operation_name=volume_fs_resize}
- [Optional] Aggregation method: cumulative counter
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as for controller expansion operation errors.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually let me update this section. These error metrics were recently removed and replaced by adding status field in volume_operation_total_seconds metric.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that was definitely the wrong thing to do. Moving forward, I would be in favor of reverting kubernetes/kubernetes#98332 and adding metrics to compute the error rates.

Essentially the change that was made, increased the number of metrics exposed and made it harder to use to compute the error rate for no actual benefit.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We consolidated those metrics with the guidance from sig-instrumentation. I think we should take the separate error metric vs consolidated metric discussion offline. For the purposes of this feature, I think we shoiuld keep the metrics consistent with how we collect metrics for all other storage operations, and move to a different model later based on the discussion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me, I wouldn't block this effort because of that, I just wanted to point out that the original error metric approach taken in this KEP was the correct one.

- Metric name:
- [Optional] Aggregation method:
- Components exposing the metric:
- controller expansion operation duration:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a feeling that having the name of the operation as a label will make the metrics harder to use compared to if we had metrics dedicated for each operation such as storage_operation_expand_volume_duration_seconds.
From a user perspective, I think that it will be harder to know what are the different types of operations on which SLIs can be computed.

Do you perhaps have a list of all the possible operations so that I can put it into perspective?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These metrics are already there and have been there for a long time (like may be 1.11 release and earlier).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking for OperationCompleteHook in pkg/volume it looks like we have about 14. I think it's been a lot easier to manage with a single metric with a status field---our slo processing within google, for instance, would be a lot more toil-y if we had to pull a metric per operation.

"verify_volumes_are_attached_per_node"
"verify_volumes_are_attached"
"volume_attach"
DetachOperationName
"volume_mount"
"volume_unmount"
"unmount_device"
"map_volume"
"unmap_volume"
"unmap_device"
"verify_controller_attached_volume"
"expand_volume"
"expand_volume"
"volume_fs_resize"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah since there are a lot of possible operations, it should be easier to have all of them under the same metric. Just wanted to check if that was the case here or not.

I think it's been a lot easier to manage with a single metric with a status field

The expression should be very similar whether we have one or multiple metrics. The difference is that fitting a lot of information in a histogram is very expensive and tends to be counterintuitive to users who would expect counter metrics to also be present and to have more granularity if necessary.

Also when looking into the duration of your operations, the status information is superflux so we will be exposing it in all the buckets of the histograms even though it will not be useful.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The expression should be very similar whether we have one or multiple metrics. The difference is that fitting a lot of information in a histogram is very expensive and tends to be counterintuitive to users who would expect counter metrics to also be present and to have more granularity if necessary.

oh, that's an interesting point.

Also when looking into the duration of your operations, the status information is superflux so we will be exposing it in all the buckets of the histograms even though it will not be useful.

I'm not sure about this point---if, eg, certain errors have higher latency that might be an interesting thing to know?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about this point---if, eg, certain errors have higher latency that might be an interesting thing to know?

I would agree that it may be useful under certain circumstances, but in the majority of cases, you only care about the average and the 99th percentile when using the histogram and these operations tend to not be impacted by errors since they are uncommon and a majority of them is returning earlier than normal code paths. As such, considering the cost of this information compared to its value, I think logs are better suited since they would make this information cheaper.

Also for SLOs, errors are already counted as part of the unavailability so you don't need to define duration thresholds for them.

I think the choice of what to put into histograms really depends on the scenario, but for Kubernetes in particular where we try to keep the number of metrics exposed under control, I think the cost vs value of adding the status label isn't necessarily worth it to have in the histogram metric.

optional services that are needed. For example, if this feature depends on
a cloud provider API, or upon an external software-defined storage or network
control plane.
Tentative name of metric is - `operation_operation_volume_recovery_total{state='success', volume_name='pvc-abce'}`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Tentative name of metric is - `operation_operation_volume_recovery_total{state='success', volume_name='pvc-abce'}`
Tentative name of metric is - `storage_operation_volume_recovery_total{state='success', volume='pvc-abce'}`

- Usage description:
- Impact of its outage on the feature:
- Impact of its degraded performance or high error rates on the feature:
The reason of using PV name as a label is - we do not expect this feature to be used in a cluster very often
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with David, knowing the overall success and failures is definitely useful, however, I am wondering if we really need the volume name here. I don't have much knowledge of storage so could you perhaps walk me through a scenario where volume expansion failures are only happening for one specific volume and how a cluster administrator could mitigate this issue?

@deads2k
Copy link
Contributor

deads2k commented Feb 1, 2022

PRR is complete for beta

/approve

@msau42
Copy link
Member

msau42 commented Feb 1, 2022

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 1, 2022
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: deads2k, gnufied, msau42

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 1, 2022
@k8s-ci-robot k8s-ci-robot merged commit 9ffa073 into kubernetes:master Feb 1, 2022
@k8s-ci-robot k8s-ci-robot added this to the v1.24 milestone Feb 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/storage Categorizes an issue or PR as relevant to SIG Storage. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants