-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PodDisruptionBudget causing inability to gracefully drain a node with tekton-pipelines-webhook pod #3654
Comments
This definitely does sound like something we need to document and communicate to operators better. The PodDisruptionBudget was put in place to prevent users from being unable to create PipelineRuns/TaskRuns/etc when the webhook Pod is unavailable, e.g., during an upgrade. Unfortunately they way it did this basically turned into "never let Kubernetes make the Pod unavailable" 😬 There are basically two options, I think:
I think the bug here is to, at the very least, document this behavior for operators. We might want to make minAvailable 50% so that only two replicas are needed to trigger the rolling ugprade behavior. |
1 and/or 2 are what i did to get around it, but it burned a day or so w/ our k8s paas support trying to triage why the upgrade of the cluster was halted/stuck due to this. |
agree, its more of a documentation heads up clear warning kind of thing |
In my opinion defaults should be reasonable. Current default values casing trouble operating cluster and require manual intervention during normal operations like draining nodes. In this case I see options:
|
I agree, the default configuration should be safe and easy to use for basic usage. cc @afrittoli Worth noting that Knative seems to have the same configuration (autoscaling, HPA), I wonder if they've got the same issue, or whether they have a solution: https://github.com/knative/serving/blob/master/config/core/deployments/webhook-hpa.yaml added in knative/serving#9444 |
the knative team is handling this with the operator, by patching the pdb on the fly. |
This issue also seems to be related: knative/operator#376 If we change our PDB to |
@nak3 seems to be responsible for both knative/operator#376 and the In general I'd prefer to solve this by making Tekton's default configuration easy to use even if it means being slightly less HA-capable, and then document HA and maybe make operators responsible for automating it. Anyway, curious for @nak3's thoughts. |
Yes, we use |
@abhinavkhanna-sf |
Turns out, #3784 probably won't actually fix the issue. 😢 Based on investigation from @nikhil-thomas (🙇♂️ ), even with We basically have two options in the near-term:
Given those options, I think (2) is the least surprising/painful to users, and I'll send a PR today to Make It So. Longer-term, this is something the operator can handle for us, in a couple possible ways:
In any case, these operator changes will be more involved than just documentation, so they probably deserve a TEP of their own, and won't happen soon. |
Great adventure!
This looks like a good first step! |
Expected Behavior
Gracefully draining nodes which contains tekton pods are possible.
Actual Behavior
With single running
tekton-pipelines-webhook
pod (which is possible due totekton-pipelines-webhook
:HorizontalPodAutoscaler.spec.minReplicas: 1
) it's impossible to gracefully drain a node becausePodDisruptionBudget
is set tominAvailable: 80%
.Steps to Reproduce the Problem
tekton-pipelines-webhook
pod (scaled down)tekton-pipelines-webhook
podAdditional Info
The text was updated successfully, but these errors were encountered: