Skip to content

Commit

Permalink
Update recovery kep with new APIs
Browse files Browse the repository at this point in the history
  • Loading branch information
gnufied committed Feb 2, 2023
1 parent cc74f30 commit 6705700
Show file tree
Hide file tree
Showing 2 changed files with 64 additions and 25 deletions.
84 changes: 62 additions & 22 deletions keps/sig-storage/1790-recover-resize-failure/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -107,28 +107,68 @@ As part of this proposal, we are mainly proposing three changes:
- NodeExpansionFailed // state set when expansion has failed in kubelet with a terminal error. Transient errors don't set NodeExpansionFailed.
3. Update quota code to use `max(pvc.Spec.Resources, pvc.Status.AllocatedResources)` when evaluating usage for PVC.

### Making resizeStatus more general

After some discussion with sig-storage folks and to accommodate changes coming from https://github.com/kubernetes/enhancements/issues/3751 we are proposing that we rename `pvc.Status.ResizeStatus` to `pvc.Status.AllocatedResourceStatus` and make it a map.

So basically new API will look like:

```go
type ClaimResourceResizeStatus string

const (
PersistentVolumeClaimControllerExpansionInProgress ClaimResourceResizeStatus = "ControllerExpansionInProgress"
PersistentVolumeClaimControllerExpansionFailed ClaimResourceResizeStatus = "ControllerExpansionFailed"

PersistentVolumeClaimNodeExpansionPending ClaimResourceResizeStatus = "NodeExpansionPending"
PersistentVolumeClaimNodeExpansionInProgress ClaimResourceResizeStatus = "NodeExpansionInProgress"
PersistentVolumeClaimNodeExpansionFailed ClaimResourceResizeStatus = "NodeExpansionFailed"
)

// PersistentVolumeClaimStatus represents the status of PV claim
type PersistentVolumeClaimStatus struct {
// Phase represents the current phase of PersistentVolumeClaim
// +optional
Phase PersistentVolumeClaimPhase
// AccessModes contains all ways the volume backing the PVC can be mounted
// +optional
AccessModes []PersistentVolumeAccessMode
// Represents the actual resources of the underlying volume
// +optional
Capacity ResourceList
// +optional
AllocatedResources ResourceList
// AllocatedResourceStatus stores a map of resource that is being expanded
// and possible status that the resource exists in.
// Some examples may be:
// pvc.status.allocatedResourceStatus["storage"] = "ControllerExpansionInProgress"
AllocatedResourceStatus map[ResourceName]ClaimResourceResizeStatus
}
```


### Implementation

We propose that by relaxing validation on PVC update to allow users to reduce `pvc.Spec.Resources`, it becomes possible to cancel previously issued expansion requests or retry expansion with a lower value if previous request has not been successful. In general - we know that volume plugins are designed to never perform actual shrinking of the volume, for both in-tree and CSI volumes. Moreover if a previously issued expansion has been successful and user
reduces the PVC request size, for both CSI and in-tree plugins they are designed to return a successful response with NO-OP. So, reducing requested size will be a safe operation and will never result in data loss or actual shrinking of volume.

We however do have a problem with quota calculation because if a previously issued expansion is successful but is not recorded(or partially recorded) in api-server and user reduces requested size of the PVC, then quota controller will assume it as actual shrinking of volume and reduce used storage size by the user(incorrectly). Since we know actual size of the volume only after performing expansion(either on node or controller), allowing quota to be reduced on PVC size reduction will allow an user to abuse the quota system.

To solve aforementioned problem - we propose that, a new field will be added to PVC, called `pvc.Status.AllocatedResources`. When user expands the PVC, and when expansion-controller starts volume expansion - it will set `pvc.Status.AllocatedResources` to user requested value in `pvc.Spec.Resources` before performing expansion and it will set `pvc.Status.ResizeStatus` to `ControllerExpansionInProgress`. The quota calculation will be updated to use `max(pvc.Spec.Resources, pvc.Status.AllocatedResources)` which will ensure that abusing quota will not be possible.
To solve aforementioned problem - we propose that, a new field will be added to PVC, called `pvc.Status.AllocatedResources`. When user expands the PVC, and when expansion-controller starts volume expansion - it will set `pvc.Status.AllocatedResources` to user requested value in `pvc.Spec.Resources` before performing expansion and it will set `pvc.Status.AllocatedResourceStatus[storage]` to `ControllerExpansionInProgress`. The quota calculation will be updated to use `max(pvc.Spec.Resources, pvc.Status.AllocatedResources)` which will ensure that abusing quota will not be possible.

Resizing operation in external resize controller will always work towards full-filling size recorded in `pvc.Status.AllocatedResources` and only when previous operation has finished(i.e `pvc.Status.ResizeStatus` is nil) or when previous operation has failed with a terminal error - it will use new user requested value from `pvc.Spec.Resources`.
Resizing operation in external resize controller will always work towards full-filling size recorded in `pvc.Status.AllocatedResources` and only when previous operation has finished(i.e `pvc.Status.AllocatedResourceStatus[storage]` is nil) or when previous operation has failed with a terminal error - it will use new user requested value from `pvc.Spec.Resources`.

Kubelet on the other hand will only expand volumes for which `pvc.Status.ResizeStatus` is in `NodeExpansionPending` or `NodeExpansionInProgress` state and `pv.Spec.Cap > pvc.Status.Cap`. If a volume expansion fails in kubelet with a terminal error(which will set `NodeExpansionFailed` state) - then it must wait for resize controller in external-resizer to reconcile the state and put it back in `NodeExpansionPending`.
Kubelet on the other hand will only expand volumes for which `pvc.Status.AllocatedResourceStatus[storage]` is in `NodeExpansionPending` or `NodeExpansionInProgress` state and `pv.Spec.Cap > pvc.Status.Cap`. If a volume expansion fails in kubelet with a terminal error(which will set `NodeExpansionFailed` state) - then it must wait for resize controller in external-resizer to reconcile the state and put it back in `NodeExpansionPending`.

When user reduces `pvc.Spec.Resources`, expansion-controller will set `pvc.Status.AllocatedResources` to lower value only if one of the following is true:

1. If `pvc.Status.ResizeStatus` is `ControllerExpansionFailed` (indicating that previous expansion to last known `allocatedResources` failed with a final error) and previous control-plane has not succeeded.
2. If `pvc.Status.ResizeStatus` is `NodeExpansionFailed` and SP supports node-only expansion (indicating that previous expansion to last known `allocatedResources` failed on node with a final error).
3. If `pvc.Status.ResizeStatus` is `nil` or `empty` and previous `ControllerExpandVolume` has not succeeded.
1. If `pvc.Status.AllocatedResourceStatus[storage]` is `ControllerExpansionFailed` (indicating that previous expansion to last known `allocatedResources` failed with a final error) and previous control-plane has not succeeded.
2. If `pvc.Status.AllocatedResourceStatus[storage]` is `NodeExpansionFailed` and SP supports node-only expansion (indicating that previous expansion to last known `allocatedResources` failed on node with a final error).
3. If `pvc.Status.AllocatedResourceStatus[storage]` is `nil` or `empty` and previous `ControllerExpandVolume` has not succeeded.

![Determining new size](./get_new_size.png)

**Note**: Whenever resize controller or kubelet modifies `pvc.Status` (such as when setting both `AllocatedResources` and `ResizeStatus`) - it is expected that all changes to `pvc.Status` are submitted as part of same patch request to avoid race conditions.
**Note**: Whenever resize controller or kubelet modifies `pvc.Status` (such as when setting both `AllocatedResources` and `pvc.Status.AllocatedResourceStatus`) - it is expected that all changes to `pvc.Status` are submitted as part of same patch request to avoid race conditions.

The complete expansion and recovery flow of both control-plane and kubelet is documented in attached PDF. The attached pdf - documents complete volume expansion flow via state diagrams and is much more exhaustive than the text above.

Expand All @@ -142,9 +182,9 @@ The complete expansion and recovery flow of both control-plane and kubelet is do
- User increases 10Gi PVC to 100Gi by changing - `pvc.spec.resources.requests["storage"] = "100Gi"`.
- Quota controller uses `max(pvc.Status.AllocatedResources, pvc.Spec.Resources)` and adds `90Gi` to used quota.
- Expansion controller starts expanding the volume and sets `pvc.Status.AllocatedResources` to `100Gi`.
- Expansion controller also sets `pvc.Status.ResizeStatus` to `ControllerExpansionInProgress`.
- Expansion controller also sets `pvc.Status.AllocatedResourceStatus['storage']` to `ControllerExpansionInProgress`.
- Expansion to 100Gi fails and hence `pv.Spec.Capacity` and `pvc.Status.Capacity `stays at 10Gi.
- Expansion controller sets `pvc.Status.ResizeStatus` to `ControllerExpansionFailed`.
- Expansion controller sets `pvc.Status.AllocatedResourceStatus['storage']` to `ControllerExpansionFailed`.
- User requests size to 20Gi.
- Expansion controler notices that previous expansion to last known `allocatedresources` failed, so it sets new `allocatedResources` to `20G`
- Expansion succeeds and `pvc.Status.Capacity` and `pv.Spec.Capacity` report new size as `20Gi`.
Expand All @@ -154,31 +194,31 @@ The complete expansion and recovery flow of both control-plane and kubelet is do
- User increases 10Gi PVC to 100Gi by changing - `pvc.spec.resources.requests["storage"] = "100Gi"`
- Quota controller uses `max(pvc.Status.AllocatedResources, pvc.Spec.Resources)` and adds `90Gi` to used quota.
- Expansion controller starts expanding the volume and sets `pvc.Status.AllocatedResources` to `100Gi`.
- Expansion controller also sets `pvc.Status.ResizeStatus` to `ControllerExpansionInProgress`.
- Expansion controller also sets `pvc.Status.AllocatedResourceStatus['storage']` to `ControllerExpansionInProgress`.
- Since expansion operations in control-plane are NO-OP, expansion in control-plane succeeds and `pv.Spec` is set to `100G`.
- Expansion controller also sets `pvc.Status.ResizeStatus` to `NodeExpansionPending`.
- Expansion starts on the node and kubelet sets `pvc.Status.ResizeStatus` to `NodeExpansionInProgress`.
- Expansion controller also sets `pvc.Status.AllocatedResourceStatus['storage']` to `NodeExpansionPending`.
- Expansion starts on the node and kubelet sets `pvc.Status.AllocatedResourceStatus['storage']` to `NodeExpansionInProgress`.
- Expansion fails on the node with a final error.
- Kubelet sets `pvc.Status.ResizeStatus` to `NodeExpansionFailed`.
- Since pvc has `pvc.Status.ResizeStatus` set to `NodeExpansionFailed` - kubelet will stop retrying node expansion.
- At this point Kubelet will wait for `pvc.Status.ResizeStatus` to be `NodeExpansionPending`.
- Kubelet sets `pvc.Status.AllocatedResourceStatus['storage']` to `NodeExpansionFailed`.
- Since pvc has `pvc.Status.AllocatedResourceStatus['storage']` set to `NodeExpansionFailed` - kubelet will stop retrying node expansion.
- At this point Kubelet will wait for `pvc.Status.AllocatedResourceStatus['storage']` to be `NodeExpansionPending`.
- User requests size to 20Gi.
- Expansion controller starts expanding the volume and sees that last expansion failed on the node and driver does not have control-plane expansion.
- Expansion controller sets `pvc.Status.AllocatedResources` to `20G`.
- Expansion controller also sets `pvc.Status.ResizeStatus` to `ControllerExpansionInProgress`.
- Expansion controller also sets `pvc.Status.AllocatedResourceStatus['storage']` to `ControllerExpansionInProgress`.
- Since expansion operations in control-plane are NO-OP, expansion in control-plane succeeds and `pv.Spec` is set to `20G`.
- Expansion succeed on the node with latest `allocatedResources` and `pvc.Status.Size` is set to `20G`.
- Expansion controller also sets `pvc.Status.ResizeStatus` to `NodeExpansionPending`.
- Expansion controller also sets `pvc.Status.AllocatedResourceStatus['storage']` to `NodeExpansionPending`.
- Kubelet can now retry expansion and expansion on node succeeds.
- Kubelet sets `pvc.Status.ResizeStatus` to empty string and `pvc.Status.Capacity` to new value.
- Kubelet sets `pvc.Status.AllocatedResourceStatus['storage']` to empty string and `pvc.Status.Capacity` to new value.
- Quota controller sees a reduction in used quota because `max(pvc.Spec.Resources, pvc.Status.AllocatedResources)` is 20Gi.


##### Case 3 (Malicious user)
- User increases 10Gi PVC to 100Gi by changing `pvc.spec.resources.requests["storage"] = "100Gi"`
- Quota controller uses `max(pvc.Status.AllocatedResources, pvc.Spec.Resources)` and adds `90Gi` to used quota.
- Expansion controller slowly starts expanding the volume and sets `pvc.Status.AllocatedResources` to `100Gi` (before expanding).
- Expansion controller also sets `pvc.Status.ResizeStatus` to `ControllerExpansionInProgress`.
- Expansion controller also sets `pvc.Status.AllocatedResourceStatus['storage']` to `ControllerExpansionInProgress`.
- At this point -`pv.Spec.Capacity` and `pvc.Status.Capacity` stays at 10Gi until the resize is finished.
- While the storage backend is re-sizing the volume, user requests size 20Gi by changing `pvc.spec.resources.requests["storage"] = "20Gi"`
- Expansion controller notices that previous expansion to last known `allocatedresources` is still in-progress.
Expand All @@ -193,7 +233,7 @@ The complete expansion and recovery flow of both control-plane and kubelet is do
- User expands expands the PVC to 100GB.
- Quota controller uses `max(pvc.Status.AllocatedResources, pvc.Spec.Resources)` and adds `89.9GB` to used quota.
- Expansion controller starts expanding the volume and sets `pvc.Status.AllocatedResources` to `100GB` (before expanding).
- Expansion controller also sets `pvc.Status.ResizeStatus` to `ControllerExpansionInProgress`.
- Expansion controller also sets `pvc.Status.AllocatedResourceStatus['storage']` to `ControllerExpansionInProgress`.
- At this point -`pv.Spec.Capacity` and `pvc.Status.Capacity` stays at 10.1GB until the resize is finished.
- while resize was in progress - expansion controler crashes and loses state.
- User reduces the size of PVC to 10.5GB.
Expand All @@ -207,11 +247,11 @@ The complete expansion and recovery flow of both control-plane and kubelet is do
- User increases 10Gi PVC to 100Gi by changing `pvc.spec.resources.requests["storage"] = "100Gi"`
- Quota controller uses `max(pvc.Status.AllocatedResources, pvc.Spec.Resources)` and adds `90Gi` to used quota.
- Expansion controller slowly starts expanding the volume and sets `pvc.Status.AllocatedResources` to `100Gi` (before expanding).
- Expansion controller also sets `pvc.Status.ResizeStatus` to `ControllerExpansionInProgress`.
- Expansion controller also sets `pvc.Status.AllocatedResourceStatus['storage']` to `ControllerExpansionInProgress`.
- At this point -`pv.Spec.Capacity` and `pvc.Status.Capacity` stays at 10Gi until the resize is finished.
- While the storage backend is re-sizing the volume, user requests size 200Gi by changing `pvc.spec.resources.requests["storage"] = "200Gi"`
- Quota controller uses `max(pvc.Status.AllocatedResources, pvc.Spec.Resources)` and adds `100Gi` to used quota.
- Since `pvc.Status.ResizeStatus` is in `ControllerExpansionInProgress` - expansion controller still chooses last `pvc.Status.AllocatedResources` as new size.
- Since `pvc.Status.AllocatedResourceStatus['storage']` is in `ControllerExpansionInProgress` - expansion controller still chooses last `pvc.Status.AllocatedResources` as new size.
- User reduces size back to `20Gi`.
- Quota controller uses `max(pvc.Status.AllocatedResources, pvc.Spec.Resources)` and *returns* `100Gi` to used quota.
- Expansion controller notices that previous expansion to last known `allocatedresources` is still in-progress.
Expand Down
5 changes: 2 additions & 3 deletions keps/sig-storage/1790-recover-resize-failure/kep.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -21,12 +21,11 @@ see-also:
replaces:
superseded-by:

latest-milestone: "v1.24"
stage: "beta"
latest-milestone: "v1.27"
stage: "alpha"

milestone:
alpha: "v1.23"
beta: "v1.24"


feature-gates:
Expand Down

0 comments on commit 6705700

Please sign in to comment.