Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disable Webhook PDB by default, document enabling it #3787

Merged
merged 1 commit into from
Mar 10, 2021

Conversation

imjasonh
Copy link
Member

Fixes #3654

Changes

This change disables PodDisruptionBudget for the webhook deployment, and documents how to re-enable it in docs/enabling-ha. It also makes some edits to enabling-ha.md to streamline and recommend best practices.

/kind bug

Submitter Checklist

These are the criteria that every PR should meet, please check them off as you
review them:

  • [n] Includes tests (if functionality changed/added)
  • [y] Includes docs (if user facing)
  • [y] Commit messages follow commit message best practices
  • [y] Release notes block has been filled in or deleted (only if no user facing changes)

See the contribution guide for more details.

Double check this list of stuff that's easy to miss:

Reviewer Notes

If API changes are included, additive changes must be approved by at least two OWNERS and backwards incompatible changes must be approved by more than 50% of the OWNERS, and they must first be added in a backwards compatible way.

Release Notes

Disable PodDisruptionBudget for the webhook deployment by default

cc @vdemeester @nikhil-thomas @bobcatfish

@tekton-robot tekton-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. labels Feb 24, 2021
@tekton-robot tekton-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Feb 24, 2021
@imjasonh imjasonh changed the title Disable HA webhook by default, document enabling it Disable Webhook PDB by default, document enabling it Feb 24, 2021
@imjasonh
Copy link
Member Author

cc @raballew as well, who added the PDB in #3391

@nikhil-thomas
Copy link
Member

/lgtm

@tekton-robot
Copy link
Collaborator

@nikhil-thomas: changing LGTM is restricted to collaborators

In response to this:

/lgtm

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link
Member

@vdemeester vdemeester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/meow
Thanks for doing this @imjasonh (and @nikhil-thomas for the exploration)

@tekton-robot
Copy link
Collaborator

@vdemeester: cat image

In response to this:

/meow
Thanks for doing this @imjasonh (and @nikhil-thomas for the exploration)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@tekton-robot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: vdemeester

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tekton-robot tekton-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 24, 2021
@zhangtbj
Copy link
Contributor

zhangtbj commented Mar 2, 2021

Hi Jason,

A quick question, is that possible to remove the PDB setting from Tekton deployment and document how to configure it in the document? :)

Then the user can choose enable or disable it by themselves.

We enable PDB in our Tekton on production env. If it is set MinAvailable as 0 or other values by default, I am afraid it may override the existing settings or maybe to other users.

@sbose78
Copy link
Contributor

sbose78 commented Mar 5, 2021

I see @zhangtbj 's point here. Shipping the hpa manifests might override existing values. We may consider skipping the hpa manifests altogether and instead include documentation on it ?

@vdemeester
Copy link
Member

I see @zhangtbj 's point here. Shipping the hpa manifests might override existing values. We may consider skipping the hpa manifests altogether and instead include documentation on it ?

And move that knowledge/management to the operator 👼🏼

@pritidesai pritidesai added this to the Pipelines 0.22 milestone Mar 9, 2021
@imjasonh
Copy link
Member Author

imjasonh commented Mar 9, 2021

Sorry for letting this PR slip through the cracks. Let's get this in before 0.22.

To clarify, the specific ask is to remove the PDB from webhook-hpa.yaml from the default Tekton installation bundle, and instead document how to enable it, with example YAML in the docs. Does that sound right to you @zhangtbj @sbose78 ?

@sbose78
Copy link
Contributor

sbose78 commented Mar 9, 2021 via email

@imjasonh
Copy link
Member Author

imjasonh commented Mar 9, 2021

That's right, Jason.

Done. 👍

@pritidesai
Copy link
Member

@sbose78 please help review the changes 🙏 (looking for /lgtm 😉 )

@sbose78
Copy link
Contributor

sbose78 commented Mar 9, 2021 via email

@tekton-robot
Copy link
Collaborator

@sbose78: changing LGTM is restricted to collaborators

In response to this:

/lgtm

On Tue, Mar 9, 2021, 16:06 Priti Desai notifications@github.com wrote:

@sbose78 https://github.com/sbose78 please help review the changes 🙏
(looking for /lgtm 😉 )


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#3787 (comment),
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AAEFEAB3QWZFYJNKFYUCSRDTCZ5VPANCNFSM4YEVKIRA
.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@pritidesai
Copy link
Member

Thanks @sbose78, we will have to add you to the org. I will create a separate PR in the community repo.

@pritidesai
Copy link
Member

You should get the /lgtm privilege after this PR in community repo is merged 😄 Until then,

/lgtm

@tekton-robot tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Mar 9, 2021
@sbose78
Copy link
Contributor

sbose78 commented Mar 9, 2021 via email

@tekton-robot tekton-robot merged commit 10870df into tektoncd:master Mar 10, 2021
@zhangtbj
Copy link
Contributor

Cool, thank you Jason! :)

imjasonh added a commit to imjasonh/pipeline that referenced this pull request Jul 28, 2021
The safe-to-evict annotation tells the cluster autoscaler whether the
pod can be evicted to allow the node it's on to scale down.

This was set to false (by me!) 2 years ago in tektoncd@fc6ef39
to prevent service unreliability during scale-down events. If the
no webhook replicas are available, users can't create/update/delete
Tekton objects; if no controller replicas are available, status updates
from Pod events, etc., won't be processed.

Unfortunately, blocking node eviction means the node that the pod(s) get
scheduled to can't be scaled down. Furthermore, the nodes can't be fully
drained when updating the cluster. This can leave a cluster in a
mid-upgrade state that can make issues difficult to diagnose and reason
about.

With this change, a cluster scale-down event might cause temporary
service unreliability with the default single-replica configuration. As
with tektoncd#3787 if a user/operator wants to prevent this, they should
configure more replicas for HA.
tekton-robot pushed a commit that referenced this pull request Jul 29, 2021
The safe-to-evict annotation tells the cluster autoscaler whether the
pod can be evicted to allow the node it's on to scale down.

This was set to false (by me!) 2 years ago in fc6ef39
to prevent service unreliability during scale-down events. If the
no webhook replicas are available, users can't create/update/delete
Tekton objects; if no controller replicas are available, status updates
from Pod events, etc., won't be processed.

Unfortunately, blocking node eviction means the node that the pod(s) get
scheduled to can't be scaled down. Furthermore, the nodes can't be fully
drained when updating the cluster. This can leave a cluster in a
mid-upgrade state that can make issues difficult to diagnose and reason
about.

With this change, a cluster scale-down event might cause temporary
service unreliability with the default single-replica configuration. As
with #3787 if a user/operator wants to prevent this, they should
configure more replicas for HA.
vdemeester pushed a commit to openshift/tektoncd-pipeline that referenced this pull request Aug 17, 2021
The safe-to-evict annotation tells the cluster autoscaler whether the
pod can be evicted to allow the node it's on to scale down.

This was set to false (by me!) 2 years ago in tektoncd@fc6ef39
to prevent service unreliability during scale-down events. If the
no webhook replicas are available, users can't create/update/delete
Tekton objects; if no controller replicas are available, status updates
from Pod events, etc., won't be processed.

Unfortunately, blocking node eviction means the node that the pod(s) get
scheduled to can't be scaled down. Furthermore, the nodes can't be fully
drained when updating the cluster. This can leave a cluster in a
mid-upgrade state that can make issues difficult to diagnose and reason
about.

With this change, a cluster scale-down event might cause temporary
service unreliability with the default single-replica configuration. As
with tektoncd#3787 if a user/operator wants to prevent this, they should
configure more replicas for HA.

(cherry picked from commit 5350069)
Signed-off-by: Vincent Demeester <vdemeest@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/bug Categorizes issue or PR as related to a bug. lgtm Indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PodDisruptionBudget causing inability to gracefully drain a node with tekton-pipelines-webhook pod
7 participants