-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds feature to crashlooping delete pods after evition fails #4898
Conversation
This PR adds a new experimental feature, which can be enabled via a command line flag which attempts a pod deletion should the eviction fail for whatever reason. The deletion is bound to various conditions to not blindly delete a pod where the eviction failure might be justified. In general the deletion only occurs if at least one container was OOMKilled and is currently in CrashLoopBackoff and the number of restarts exceeeds the configured threshold (also behind a flag).
Welcome @RuriRyan! |
Thanks. I think it would be good to have an enhancement proposal for this. In the enhancement proposal I'd like to see:
|
I was writing in hurry. Let me try once more. Thanks for sharing this. I'll take a look when I have some time (I expect next week). Reading the PR will help me understand how much complexity implementing this would add to VPA (which is one concern). I still would like to understand in some more detail why implementing this as a part of PDB is a problem. If we could have support for evicting pods in PDB it would be better. But that's "if". On the other hand if implementing this in PDB is not an option I'd like to know why but I guess it makes sense to make an improvement in VPA. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: RuriRyan The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/kind api-change |
/kind documentation |
LGMT |
This PR may require API review. If so, when the changes are ready, complete the pre-review checklist and request an API review. Status of requested reviews is tracked in the API Review project. |
I meant to leave those comments on the PR with KEP |
@jbartosik I 'm considering this as "done" now and am patiently awaiting your review :D |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I did first pass, mostly focusing on what was hard for me to read.
Did you test if this works as intended end - to - end?
@@ -383,6 +383,11 @@ spec: | |||
pods. If not specified, all fields in the `PodUpdatePolicy` are | |||
set to their default values. | |||
properties: | |||
deleteOomingOnEvictionError: | |||
description: Wheather to try to delete the pod when eviction fails |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/Wheather/Whether/
Or maybe rewrite to something simpler, like:
When true VPA will try to delete OOMing pods when eviction fails. When False it won't do that.
@@ -5,3 +5,5 @@ updater-arm64 | |||
updater-arm | |||
updater-ppc64le | |||
updater-s390x | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please drop the empty line
@@ -26,6 +26,9 @@ Threshold for evicting pods is specified by recommended min/max values from VPA | |||
Priority of evictions within a set of replicated pods is proportional to sum of percentages of changes in resources | |||
(i.e. pod with 15% memory increase 15% cpu decrease recommended will be evicted | |||
before pod with 20% memory increase and no change in cpu). | |||
* Deleting pods if the eviction fails and if the corresponding feature flag (`--experimental-deletion`) is enabled. | |||
Deletion is guarded by a treshold (`--experimental-deletion-threshold`) of how many restarts are required. The pod | |||
needs to be in `CrashLoopBackOff` and the LastTerminationReason needs to be `OOMKilled`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LastTerminationReason
?
deleted = true | ||
eventRecorder.Event(podToEvict, apiv1.EventTypeNormal, "DeletedByVPA", | ||
"Pod was deleted by VPA Updater to apply resource recommendation.") | ||
break |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I read this loop right it will attempt to delete 0 or 1 containers.
- 0 containers if canDelete doesn't return true for any container,
- 1 container otherwise (returns error if delete fails, breaks if it succeeds)
Is this the intended behavior (as opposed to attempting to delete all containers until we run into an error)?
If so I'd make it into its own function that returns (bool, error)
to make this easier to read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is the intended behaviour, because we cannot delete individual containers, but only the whole pod. So as soon as the pod is deleted, there is no reason to continue for the rest of the containers in the pod.
// EvictViaDelete sends deletion instruction to api client. Returns error if pod cannot be deleted or if client returned error | ||
// Does not check if pod was actually deleted after termination grace period. | ||
func (e *podsEvictionRestrictionImpl) EvictViaDelete(podToEvict *apiv1.Pod, eventRecorder record.EventRecorder) error { | ||
cr, present := e.podToReplicaCreatorMap[getPodID(podToEvict)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At first I was confused why we set cr
here but use it much later. Please add a comment like
// Make sure that we can map pod to replica so any problems prevent deleting the pod.
|
||
for _, pod := range pods[:1] { | ||
err := eviction.EvictViaDelete(pod, test.FakeEventRecorder()) | ||
assert.Nil(t, err, "Should evict with no error") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use assert.NoError
} | ||
for _, pod := range pods[1:] { | ||
err := eviction.EvictViaDelete(pod, test.FakeEventRecorder()) | ||
assert.Error(t, err, "Error expected") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we expect error here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is effectively just a copy & paste of the unit test for the "normal" eviction. I found out that only one eviction is allowed per "replica controller" for every "tick" of the updater. So when multiple pods of the same replica set should get evicted it throws an error until the next tick.
|
||
// only try to delete the pod if the feature is enabled and if we would increase | ||
// the resource requests or limits | ||
if priority.ScaleUp && DeleteOomingOnEvictionError { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd add a function that returns bool and rewrite the condition as
if attemptEvictViaDelete(&priority, &vpa.Spec) {
//...
@@ -65,6 +65,8 @@ var ( | |||
|
|||
namespace = os.Getenv("NAMESPACE") | |||
vpaObjectNamespace = flag.String("vpa-object-namespace", apiv1.NamespaceAll, "Namespace to search for VPA objects. Empty means all namespaces will be used.") | |||
|
|||
deleteOomingOnEvictionError = flag.Bool("delete-ooming-on-eviction-error", false, "If true, updater will try to delete ooming pods when the eviction fails.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there should be "by default" somewhere in the flag description
prometheus.CounterOpts{ | ||
Namespace: metricsNamespace, | ||
Name: "deleted_pods_total", | ||
Help: "Number of Pods delete by Updater to apply a new recommendation.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Number of Pods delete--d--
/hold While I'm working on VPA 0.12.0 I don't want to make any changes to VPA that are not related to the release (I think code is good to go but E2E tests are timing out). I'll remove the hold after I release VPA 0.12.0. |
/hold cancel |
@RuriRyan I see this has a few open comments, mostly small things. |
@jbartosik AFAICT all open review comments have either been fixed or answered |
We'd also love to see this PR move forward and eventually get merged. /cc @jbartosik |
@timoreimann looks good to me There's one thing remaining - checking if this works end to end. Did you run any (manual?) tests to verify this works as intended? Can you provide an e2e test that will ensure this keeps working? I thought that I could take OOM test scenario and modify it:
I'm not sure when I'll be able to do that. I'm worried about merging this without any verification that this works as intended end to end |
@timoreimann I want to merge this but need some verification that this works e2e. I see the following options:
I can contribute e2e tests but I'm not sure when. I don't think 2. is significantly faster than 1. |
Writing an e2e tests sounds like the right thing to do. I'm happy to look into that when I have some spare time unless someone beats me to it. |
I just saw We also saw the original issue happen quite a few times and would want to get this fixed, but the change to PDB seems to solve the problem in a more "k8s native" way instead of work-arounding with the delete call. @jbartosik @avorima @timoreimann wdyt? |
@voelzmo thanks for sharing the KEP. I wasn't aware of it, this does sound like the right approach going forward. That said, I think there's value in driving this PR to completion regardless: the solution in the KEP will only be available in Kubernetes 1.26 in alpha. Users on older versions will still want to have some kind of solution (at least we do given a large portion of our fleet is affected by the problem regularly). Just my 2 cents of course, final call is to be made by @jbartosik. Coincidentally, I have started working on extending the end-to-end tests as proposed just earlier today, so I could probably add the last missing piece fairly soon. |
On second thought, the feature from this PR will likely only be backported so far (if at all), and / or more recent versions of VPA may be incompatible with older Kubernetes versions? So maybe there is only so much value users of older Kubernetes releases could get out of this PR? I'm not super familiar with VPA's / CA's release policy and compatibility guarantees, so throwing out additional thoughts mostly. @jbartosik still to the rescue for more clarity and preferences. 🙂 EDIT: had to remind myself that VPA is versioned separately from CA and based on its own CRD, so perhaps my second thought concerns aren't as concerning after all, and continuing with the PR would still be beneficial. |
Thanks for sharing @voelzmo, that looks promising. This PR and the VPA KEP were just created because, at the time, it seemed like this was never going to make it into the PDB spec, so I would actually be fine with closing it now. The short time that it would be useful would probably be outweighed by the effort it takes to maintain it. Only speaking for myself of course. |
I'm happy to use PDB features instead of working around PDB in VPA.
|
Ok, what about the KEP? I'm not familiar with the process, but I suppose it needs to be removed or updated to document this decision. |
I added an item about this to the next SIG meeting |
The Kubernetes project currently lacks enough contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/close |
@avorima: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Which component this PR applies to?
vertical-pod-autoscaler
What type of PR is this?
/kind feature
What this PR does / why we need it:
This PR adds a new experimental feature, which can be enabled via
a command line flag which attempts a pod deletion should the eviction
fail for whatever reason.
The deletion is bound to various conditions to not blindly delete a pod
where the eviction failure might be justified.
In general the deletion only occurs if at least one container was
OOMKilled and is currently in CrashLoopBackoff and the number of
restarts exceeeds the configured threshold (also behind a flag).
Which issue(s) this PR fixes:
Fixes #4730
Special notes for your reviewer:
Does this PR introduce a user-facing change?