[VPA] KEP-4902: Delete OOM Pods #4902

RuriRyan · 2022-05-20T13:37:20Z

Which component this PR applies to?

vertical-pod-autoscaler

What type of PR is this?

/kind documentation

What this PR does / why we need it:

voelzmo

Thanks for the KEP PR, we've also run into CrashLoopBackoff situations before where an existing PDB prevented the VPA from evicting and applying the new recommendations, so I'm really looking forward to this as a potential solution!

voelzmo · 2022-05-23T11:52:17Z

vertical-pod-autoscaler/enhancements/4902-delete-oom-pods/README.md

+
+## Proposal
+
+The proposal is to add `--experimental-deletion` to the VPA to enable deletion


Putting the experimental nature of this feature in the flag's name doesn't seem to be very future-proof to me. Maybe something like --delete-on-eviction-error is more descriptive of what this flag enables?

voelzmo · 2022-05-23T11:53:20Z

vertical-pod-autoscaler/enhancements/4902-delete-oom-pods/README.md

+The proposal is to add `--experimental-deletion` to the VPA to enable deletion
+of pods. Currently only as an experimental, or beta feature.
+To add a bit of configuration an additional an additional flag,
+`--experimental-deletion-threshold`, should be addedd.


Can you elaborate a bit what this flag is doing?

vertical-pod-autoscaler/enhancements/4902-delete-oom-pods/README.md

…ME.md Co-authored-by: Marco Voelz <voelzmo@users.noreply.github.com>

mwielgus · 2022-05-31T18:56:53Z

cc: @jbartosik

RuriRyan · 2022-06-08T07:41:09Z

bump

vertical-pod-autoscaler/enhancements/4902-delete-oom-pods/README.md

jbartosik · 2022-06-14T13:56:19Z

vertical-pod-autoscaler/enhancements/4902-delete-oom-pods/README.md

+
+## Proposal
+
+The proposal is to add `--delete-on-eviction-error` to the VPA to enable


Why flags not API configuration? To make experiment easier to implement? Do you think there is little benefit to allowing to configure this on VPA-object level instead on a cluster level? Something else?

To make it easier to implement. Also in our use case, we would enable this for all our VPA ressources, so there was no reason to add this to the API.
But in general i'm a fan of making things more configurable. I'll work out a v2 describing the api changes.

Do you think there is little benefit to allowing to configure this on VPA-object level instead on a cluster level

For us it would also be the case that we'd enable this for specific clusters entirely, not for single VPA objects. Even in the case we would make this configurable per VPA, having a global option (that people could override per VPA) would be preferred – having to touch each and every VPA is kind of cumbersome.

For us it would also be the case that we'd enable this for specific clusters entirely, not for single VPA objects. Even in the case we would make this configurable per VPA, having a global option (that people could override per VPA) would be preferred – having to touch each and every VPA is kind of cumbersome.

On the other hand there are people who don't manage their clusters (for example on GKE but also when many teams share a cluster) and then they can't change value of the flag.

While I understand it might be less convenient in some cases I think it's more important to make it possible at all.

On the other hand there are people who don't manage their clusters (for example on GKE but also when many teams share a cluster) and then they can't change value of the flag.

While I understand it might be less convenient in some cases I think it's more important to make it possible at all.

That makes perfect sense. Are these two options mutually exclusive or would we want to have a global switch and the option to configure this per VPA? There are already multiple cases where this pattern exists, right?

That makes perfect sense. Are these two options mutually exclusive or would we want to have a global switch and the option to configure this per VPA? There are already multiple cases where this pattern exists, right?

have you seen the update I just pushed yesterday? I also added a field for the VPA resource in addition to the flag to the proposal. It's very similar to the minReplicas setting. I think this will be the best solution everyone.

I think it's better to just do the API field. With both API field and flag we get the following semantics:

DeleteOomingOnEvictionError = true - delete pod if eviction failed and it's OOMing,

DeleteOomingOnEvictionError = false - don't delete pod even if it's OOMing and eviction fails (current behavior),

DeleteOomingOnEvictionError not set - talk to your administrator or run a test to see what happens. But it can change.

I think having clean semantics on what happens by default is better than making (one time!) migration easier.

If it would only be a one time migration it would be ok, but if we think of other open-source projects who have to potentially adjust their deployment mechanism to support such a setting it might take a very long time to get this done with. So I would still vote for having both.
Also asking your cluster administrator or looking into some documentation for your clusters doesn't sound that bad to me.

jbartosik · 2022-06-27T12:00:30Z

vertical-pod-autoscaler/enhancements/4902-delete-oom-pods/README.md

+   - [Update the eviction API](#update-the-eviction-api)
+<!-- /toc -->
+
+


Nit: extra empty line (also a few times later)

jbartosik · 2022-06-27T12:01:25Z

vertical-pod-autoscaler/enhancements/4902-delete-oom-pods/README.md

+any further disruptions.
+
+This proposal addresses the problem by allowing users to enable the deletion of
+pods as a backup if the eviction fails.


allowing users to enable the deletion of
OOMing
pods as a backup if the eviction fails.

jbartosik · 2022-06-27T12:06:07Z

vertical-pod-autoscaler/enhancements/4902-delete-oom-pods/README.md

+
+Instead of implementing this change on the client side, the VPA in this case,
+it could be implemented on the API side. This would have the advantage that it
+would work for all clients. On the other hand this would introduce breaking


Why doing this change in PDB would be a breaking change? I think you could just add a new filed, default being the current behavior and it would be fine.

jbartosik · 2022-06-27T12:20:40Z

vertical-pod-autoscaler/enhancements/4902-delete-oom-pods/README.md

+updater to enable the new feature globally.
+
+Additionally a new field in the VPA resource
+(`Spec.UpdatePolicy.DeleteOnEvictionError`) which takes precedence to the


I don't think DeleteOnEvictionError is a good name - we won't simply delete on error. DeleteOomingOnEvictionError ?

jbartosik · 2022-06-27T12:30:35Z

vertical-pod-autoscaler/enhancements/4902-delete-oom-pods/README.md

+
+## Proposal
+
+The proposal is to add `--delete-on-eviction-error` to the VPA to enable


I think it's better to just do the API field. With both API field and flag we get the following semantics:

DeleteOomingOnEvictionError = true - delete pod if eviction failed and it's OOMing,

DeleteOomingOnEvictionError = false - don't delete pod even if it's OOMing and eviction fails (current behavior),

DeleteOomingOnEvictionError not set - talk to your administrator or run a test to see what happens. But it can change.

I think having clean semantics on what happens by default is better than making (one time!) migration easier.

jbartosik · 2022-06-30T11:56:45Z

One more thing: we probably should only delete OOMing pods if we plan to increase their memory request.

We have information about current request and about recommendation target in updater so it's doable. I'm not sure how much work it will be. But I think we should do it (or at least note to do it later).

RuriRyan · 2022-07-01T12:45:16Z

One more thing: we probably should only delete OOMing pods if we plan to increase their memory request.

We have information about current request and about recommendation target in updater so it's doable. I'm not sure how much work it will be. But I think we should do it (or at least note to do it later).

line https://github.com/kubernetes/autoscaler/blob/master/vertical-pod-autoscaler/pkg/updater/logic/updater.go#L260 returns a list of pods and could potenially be changed to return more information. But this is only a surface level view and more or less the first thing I found.

I'll add a section describing this and try out a couple implementations to see how easy it is, if it gets too complicated i'll leave a note.

jbartosik · 2022-07-18T13:15:40Z

Hi,
I'm back. It looks like we're stuck a bit on whether we should have just API change or API change and flag. I'd like to ask to help.

@RuriRyan @voelzmo Can you make it to the SIG meeting today? If you can maybe we can resolve it over there? Sorry about asking about it at the last moment

jbartosik

/kind api-change

Looks good to me.

I think we need someone to check API.

k8s-triage-robot · 2022-07-25T14:23:51Z

This PR may require API review.

If so, when the changes are ready, complete the pre-review checklist and request an API review.

Status of requested reviews is tracked in the API Review project.

mwielgus

/lgtm
/approve

k8s-ci-robot · 2022-08-30T09:16:01Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jbartosik, mwielgus, RuriRyan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~vertical-pod-autoscaler/OWNERS~~ [jbartosik,mwielgus]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

…te_pods_enhancement [VPA] KEP-4902: Delete OOM Pods

k8s-ci-robot added kind/documentation Categorizes issue or PR as related to documentation. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels May 20, 2022

k8s-ci-robot requested review from jbartosik and krzysied May 20, 2022 13:37

RuriRyan force-pushed the PSC-2673/vpa_delete_pods_enhancement branch from b4f429f to 240b1fa Compare May 20, 2022 13:38

RuriRyan changed the title ~~[VPA] Enhancement proposal: Delete OOM Pods~~ [VPA] KEP-4902: Delete OOM Pods May 20, 2022

[VPA] Enhancement proposal: Delete OOM Pods

45a6608

RuriRyan force-pushed the PSC-2673/vpa_delete_pods_enhancement branch from 240b1fa to 45a6608 Compare May 20, 2022 13:41

jbartosik added the area/vertical-pod-autoscaler label May 23, 2022

voelzmo reviewed May 23, 2022

View reviewed changes

RuriRyan and others added 2 commits May 27, 2022 09:42

Update vertical-pod-autoscaler/enhancements/4902-delete-oom-pods/READ…

58b19eb

…ME.md Co-authored-by: Marco Voelz <voelzmo@users.noreply.github.com>

Rename flags & further explain the threshold flag

8a09276

mwielgus assigned jbartosik May 31, 2022

jbartosik reviewed Jun 14, 2022

View reviewed changes

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jun 20, 2022

extend proposal, add design details

d2859ee

RuriRyan force-pushed the PSC-2673/vpa_delete_pods_enhancement branch from 1b660ae to d2859ee Compare June 20, 2022 15:25

jbartosik reviewed Jun 27, 2022

View reviewed changes

jbartosik mentioned this pull request Jul 20, 2022

[VPA] Feature Request: Delete OOM Pods #4730

Closed

further improvements

1711a4b

jbartosik approved these changes Jul 25, 2022

View reviewed changes

k8s-ci-robot added kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Jul 25, 2022

mwielgus approved these changes Aug 30, 2022

View reviewed changes

k8s-ci-robot assigned mwielgus Aug 30, 2022

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 30, 2022

k8s-ci-robot merged commit ef2d9e7 into kubernetes:master Aug 30, 2022

navinjoy pushed a commit to navinjoy/autoscaler that referenced this pull request Oct 26, 2022

Merge pull request kubernetes#4902 from ionos-cloud/PSC-2673/vpa_dele…

c22c935

…te_pods_enhancement [VPA] KEP-4902: Delete OOM Pods

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[VPA] KEP-4902: Delete OOM Pods #4902

[VPA] KEP-4902: Delete OOM Pods #4902

RuriRyan commented May 20, 2022

voelzmo left a comment

voelzmo May 23, 2022

voelzmo May 23, 2022

mwielgus commented May 31, 2022

RuriRyan commented Jun 8, 2022

jbartosik Jun 14, 2022

RuriRyan Jun 16, 2022

voelzmo Jun 17, 2022

jbartosik Jun 17, 2022

voelzmo Jun 21, 2022

RuriRyan Jun 21, 2022

jbartosik Jun 27, 2022

RuriRyan Jul 1, 2022

jbartosik Jun 27, 2022

jbartosik Jun 27, 2022

jbartosik Jun 27, 2022

jbartosik Jun 27, 2022

jbartosik Jun 27, 2022

jbartosik commented Jun 30, 2022

RuriRyan commented Jul 1, 2022

jbartosik commented Jul 18, 2022

jbartosik left a comment

k8s-triage-robot commented Jul 25, 2022

mwielgus left a comment

k8s-ci-robot commented Aug 30, 2022


		## Proposal

		The proposal is to add `--experimental-deletion` to the VPA to enable deletion


		## Proposal

		The proposal is to add `--delete-on-eviction-error` to the VPA to enable

		- [Update the eviction API](#update-the-eviction-api)
		<!-- /toc -->

[VPA] KEP-4902: Delete OOM Pods #4902

[VPA] KEP-4902: Delete OOM Pods #4902

Conversation

RuriRyan commented May 20, 2022

Which component this PR applies to?

What type of PR is this?

What this PR does / why we need it:

voelzmo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mwielgus commented May 31, 2022

RuriRyan commented Jun 8, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jbartosik commented Jun 30, 2022

RuriRyan commented Jul 1, 2022

jbartosik commented Jul 18, 2022

jbartosik left a comment

Choose a reason for hiding this comment

k8s-triage-robot commented Jul 25, 2022

mwielgus left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Aug 30, 2022