-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[VPA] KEP-4902: Delete OOM Pods #4902
[VPA] KEP-4902: Delete OOM Pods #4902
Conversation
b4f429f
to
240b1fa
Compare
240b1fa
to
45a6608
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the KEP PR, we've also run into CrashLoopBackoff
situations before where an existing PDB prevented the VPA from evicting and applying the new recommendations, so I'm really looking forward to this as a potential solution!
|
||
## Proposal | ||
|
||
The proposal is to add `--experimental-deletion` to the VPA to enable deletion |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Putting the experimental
nature of this feature in the flag's name doesn't seem to be very future-proof to me. Maybe something like --delete-on-eviction-error
is more descriptive of what this flag enables?
The proposal is to add `--experimental-deletion` to the VPA to enable deletion | ||
of pods. Currently only as an experimental, or beta feature. | ||
To add a bit of configuration an additional an additional flag, | ||
`--experimental-deletion-threshold`, should be addedd. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate a bit what this flag is doing?
vertical-pod-autoscaler/enhancements/4902-delete-oom-pods/README.md
Outdated
Show resolved
Hide resolved
…ME.md Co-authored-by: Marco Voelz <voelzmo@users.noreply.github.com>
cc: @jbartosik |
bump |
vertical-pod-autoscaler/enhancements/4902-delete-oom-pods/README.md
Outdated
Show resolved
Hide resolved
|
||
## Proposal | ||
|
||
The proposal is to add `--delete-on-eviction-error` to the VPA to enable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why flags not API configuration? To make experiment easier to implement? Do you think there is little benefit to allowing to configure this on VPA-object level instead on a cluster level? Something else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To make it easier to implement. Also in our use case, we would enable this for all our VPA ressources, so there was no reason to add this to the API.
But in general i'm a fan of making things more configurable. I'll work out a v2 describing the api changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think there is little benefit to allowing to configure this on VPA-object level instead on a cluster level
For us it would also be the case that we'd enable this for specific clusters entirely, not for single VPA objects. Even in the case we would make this configurable per VPA, having a global option (that people could override per VPA) would be preferred – having to touch each and every VPA is kind of cumbersome.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For us it would also be the case that we'd enable this for specific clusters entirely, not for single VPA objects. Even in the case we would make this configurable per VPA, having a global option (that people could override per VPA) would be preferred – having to touch each and every VPA is kind of cumbersome.
On the other hand there are people who don't manage their clusters (for example on GKE but also when many teams share a cluster) and then they can't change value of the flag.
While I understand it might be less convenient in some cases I think it's more important to make it possible at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the other hand there are people who don't manage their clusters (for example on GKE but also when many teams share a cluster) and then they can't change value of the flag.
While I understand it might be less convenient in some cases I think it's more important to make it possible at all.
That makes perfect sense. Are these two options mutually exclusive or would we want to have a global switch and the option to configure this per VPA? There are already multiple cases where this pattern exists, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes perfect sense. Are these two options mutually exclusive or would we want to have a global switch and the option to configure this per VPA? There are already multiple cases where this pattern exists, right?
have you seen the update I just pushed yesterday? I also added a field for the VPA resource in addition to the flag to the proposal. It's very similar to the minReplicas setting. I think this will be the best solution everyone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's better to just do the API field. With both API field and flag we get the following semantics:
DeleteOomingOnEvictionError = true
- delete pod if eviction failed and it's OOMing,DeleteOomingOnEvictionError = false
- don't delete pod even if it's OOMing and eviction fails (current behavior),DeleteOomingOnEvictionError
not set - talk to your administrator or run a test to see what happens. But it can change.
I think having clean semantics on what happens by default is better than making (one time!) migration easier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it would only be a one time migration it would be ok, but if we think of other open-source projects who have to potentially adjust their deployment mechanism to support such a setting it might take a very long time to get this done with. So I would still vote for having both.
Also asking your cluster administrator or looking into some documentation for your clusters doesn't sound that bad to me.
1b660ae
to
d2859ee
Compare
- [Update the eviction API](#update-the-eviction-api) | ||
<!-- /toc --> | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: extra empty line (also a few times later)
any further disruptions. | ||
|
||
This proposal addresses the problem by allowing users to enable the deletion of | ||
pods as a backup if the eviction fails. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
allowing users to enable the deletion of
OOMing
pods as a backup if the eviction fails.
|
||
Instead of implementing this change on the client side, the VPA in this case, | ||
it could be implemented on the API side. This would have the advantage that it | ||
would work for all clients. On the other hand this would introduce breaking |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why doing this change in PDB would be a breaking change? I think you could just add a new filed, default being the current behavior and it would be fine.
updater to enable the new feature globally. | ||
|
||
Additionally a new field in the VPA resource | ||
(`Spec.UpdatePolicy.DeleteOnEvictionError`) which takes precedence to the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think DeleteOnEvictionError
is a good name - we won't simply delete on error. DeleteOomingOnEvictionError
?
|
||
## Proposal | ||
|
||
The proposal is to add `--delete-on-eviction-error` to the VPA to enable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's better to just do the API field. With both API field and flag we get the following semantics:
DeleteOomingOnEvictionError = true
- delete pod if eviction failed and it's OOMing,DeleteOomingOnEvictionError = false
- don't delete pod even if it's OOMing and eviction fails (current behavior),DeleteOomingOnEvictionError
not set - talk to your administrator or run a test to see what happens. But it can change.
I think having clean semantics on what happens by default is better than making (one time!) migration easier.
One more thing: we probably should only delete OOMing pods if we plan to increase their memory request. We have information about current request and about recommendation target in updater so it's doable. I'm not sure how much work it will be. But I think we should do it (or at least note to do it later). |
line https://github.com/kubernetes/autoscaler/blob/master/vertical-pod-autoscaler/pkg/updater/logic/updater.go#L260 returns a list of pods and could potenially be changed to return more information. But this is only a surface level view and more or less the first thing I found. I'll add a section describing this and try out a couple implementations to see how easy it is, if it gets too complicated i'll leave a note. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/kind api-change
Looks good to me.
I think we need someone to check API.
This PR may require API review. If so, when the changes are ready, complete the pre-review checklist and request an API review. Status of requested reviews is tracked in the API Review project. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jbartosik, mwielgus, RuriRyan The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
…te_pods_enhancement [VPA] KEP-4902: Delete OOM Pods
Which component this PR applies to?
vertical-pod-autoscaler
What type of PR is this?
/kind documentation
What this PR does / why we need it:
KEP for #4730 and #4898