-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-3762: PersistentVolume last phase transition time #3796
KEP-3762: PersistentVolume last phase transition time #3796
Conversation
From verify CI job: |
Looks good enough for alpha otherwise. |
keps/sig-storage/3762-persistent-volume-last-phase-transition-time/README.md
Outdated
Show resolved
Hide resolved
keps/sig-storage/3762-persistent-volume-last-phase-transition-time/README.md
Show resolved
Hide resolved
keps/sig-storage/3762-persistent-volume-last-phase-transition-time/README.md
Outdated
Show resolved
Hide resolved
keps/sig-storage/3762-persistent-volume-last-phase-transition-time/README.md
Outdated
Show resolved
Hide resolved
Can you run update-toc.sh to fix the toc syntax error? |
Please add a file in the following directory for PRR approver: |
last used, which is when the volume transitioned to `Released` phase. | ||
|
||
We can approach the solution in a more generic way and record a timestamp of when the volume transitioned to any phase, | ||
not just to `Released` phase. This allows anyone, incl. our perf tests, to measure time e.g. between a PV `Pending` and `Bound` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also for providing metrics/SLO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added.
keps/sig-storage/3762-persistent-volume-last-phase-transition-time/README.md
Outdated
Show resolved
Hide resolved
Changes required for this KEP: | ||
|
||
* kube-apiserver | ||
* extend PersistentVolumeStatus type with `LastPhaseTransitionTime` field |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add the proposed type changes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added.
keps/sig-storage/3762-persistent-volume-last-phase-transition-time/README.md
Show resolved
Hide resolved
keps/sig-storage/3762-persistent-volume-last-phase-transition-time/README.md
Show resolved
Hide resolved
d56fd5c
to
2491dff
Compare
|
||
1) Introduce a new status field in PersistentVolumes. | ||
2) Update the new field with a timestamp every time a volume transitions to a different phase (`pv.Status.Phase`). | ||
3) Improve general observability and allow SLO definitions for Attach/Detach volume operations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to take back the use case for a SLO. I think a SLO may be better served by metrics improvements rather than adding timestamps to API objects, especially if we want to consider deletion cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack, removing this from goals.
keps/sig-storage/3762-persistent-volume-last-phase-transition-time/README.md
Show resolved
Hide resolved
37fd7d7
to
adee430
Compare
#### Beta | ||
|
||
- Allowing time for feedback (at least 2 releases between beta and GA). | ||
- Add unit tests covering feature enablement/disablement. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you move it to Alpha? If we want anyone to enable it in alpha (if not then it's not particularly useful), this one seems important.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, moved.
No change in cluster upgrade / downgrade process. | ||
|
||
When downgrading from a version that added the new timestamp field PVs we need to make sure that after downgrade the | ||
values of the disabled field are removed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How exactly you're going to achieve it?
I'm assuming you're talking about "eventual removal" here, really, right?
[the controller will unset it on the next processing of that object if it realized FG is disabled and field is set]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but the term "next processing" is still quite broad, it would help to specify it more. The field presence check and resetting could be done:
- While updating volume status (
updateVolumePhase()
) - the downside is that disabling FG will have no effect util volume status changes phase again. This does not seem optimal. - At any PV update - some PRs (e.g. this one) implement this by resetting value of the new field to
nil
during validation. This approach seems ok, is this a viable option for this enhancement as well? Also this presents a similar caveat as when the FG is enabled - the effect won't be immediate. - Make PV controller actively (on each sync) search for PVs with the new field and reset it if FG is disabled. This seems like an overkill. If the value is correct and reflects real last phase change there's no risk of persisting it in the PV objects even if FG is disabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that option (3) is probably an overkill - this is what I meant above as "eventual removal".
I'm not sure if (1) isn't good enough - but no matter what decision you will make, it should be explicitly mentioned in the KEP.
Also - @msau42 - for her thoughts on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When a feature is disabled, the api server strategy implementation should drop disabled fields on write, so approach 2) should cover it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
automations, so be extremely careful here. | ||
--> | ||
|
||
No. It only adds a new informative field to PV status. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So technically the answer is yes.
You're not changing anything in your cluster, and your PVs will start to contain a new field set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I'll change it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
||
###### What happens if we reenable the feature if it was previously rolled back? | ||
|
||
No issues expected, after rollback the field can be `nil` and validation should allow updates from `nil`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I missed that in the proposal, but when exactly this field will be set?
I'm assuming that we the status computed in the controller will be different and we will be patching the object.
If so, what happens after enabling the feature for the first time?
We have no idea when the last transition happened, so what we're going to do:
- will the field remain to be unset?
- are we going to (somewhat misleadingly) set it to now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[I guess the former, but then let's also add to Caveats section that the field will remain unset until the first change in PV status even after enabling the feature.]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If so, what happens after enabling the feature for the first time?
If feature is enabled for the first time there are two sub-cases where phase is modified that we need to think of:
- While provisioner creates a PV it's phase is set to
Bound
- at this point I'd rather not set the field because it would be misleading, as you said. This field is meant for "last phase change" so it does not make sense to insert something else (like current time) just for the sake of having some value. - Updating volume status (
updateVolumePhase()
) - this is when the newLastPhaseTransitionTime
field and its value appears in PV object for the first time. So it's good idea to write down the caveeats since enabling FG won't have an immediate effect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK - so please put that explicitly into your KEP so it's clear what exactly we can expect from this field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated this section. Also adding one more design detail to explicitly state that we want to allow timestamp updates, not just from nil
.
The KEP lgtm once the comments are addressed |
adee430
to
2d16991
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just one minor nit - other than that LGTM.
NOTE: Also set `disable-supported` to `true` or `false` in `kep.yaml`. | ||
--> | ||
|
||
Yes. This will result in the timestamp value being set to `nil`. Mentioned in "Upgrade / Downgrade Strategy" section. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... timestamp value being eventually set to nil
. More details in ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
||
When downgrading from a version that added the new timestamp field PVs we need to make sure that after downgrade the | ||
values of the disabled field are removed. We intend to use API server strategy implementation, more specifically PV | ||
validation, to remove the values - each time a PV gets validated we will set the value to `nil` if feature gate is disabled. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually one more thing - k8s validation isn't generally supposed to change anything.
I think what you really want is effectively add that to PrepareForCreate/PrepareForUpdate methods:
https://github.com/kubernetes/kubernetes/blob/master/pkg/registry/core/persistentvolume/strategy.go
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, this is where I was aiming but shouldn't have used the term "validation" this way. I changed this to be specific about the strategy and the concrete methods.
Based on exploratory testing we will define an appropriate time tolerance which will represent maximum time limit for | ||
the volume to transition phase. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Won't this result in a flaky test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically yes but this kind of polling is widely used in tests already. For example here. So this will not flake more than any other test that has timeout for some specific action to happen.
- Feature implemented behind a feature flag | ||
- Unit tests completed and enabled | ||
- Add unit tests covering feature enablement/disablement. | ||
- Initial e2e tests completed and enabled |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you add e2e tests for an alpha feature when the feature is behind a feature gate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are dedicated alpha suites for this purpose.
- [ ] Metrics | ||
- Metric name: | ||
- [Optional] Aggregation method: | ||
- Components exposing the metric: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be worth measuring the time it takes to transition phases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But that depends on what exactly is happening (e.g. type of volume, etc.)
I wouldn't bundle it together with this KEP, at least at this point.
2d16991
to
eb33be8
Compare
Comments are addressed. /lgtm |
/lgtm Thanks! |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: RomanBednar, wojtek-t, xing-yang The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
PersistentVolume last phase transition time
PersistentVolume last phase transition time #3762