Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replicate webhook pods for resiliency #3391

Merged
merged 4 commits into from
Nov 5, 2020
Merged

Replicate webhook pods for resiliency #3391

merged 4 commits into from
Nov 5, 2020

Conversation

raballew
Copy link
Contributor

@raballew raballew commented Oct 15, 2020

Changes

Closes #3386

  • Avoid node becoming single point of failure for webhook
  • Ensure availability at node failure or high load
  • Protect webhook in case of disruption

Submitter Checklist

These are the criteria that every PR should meet, please check them off as you
review them:

  • Includes tests (if functionality changed/added)
  • Includes docs (if user facing)
  • Commit messages follow commit message best practices
  • Release notes block has been filled in or deleted (only if no user facing changes)

See the contribution guide for more details.

Double check this list of stuff that's easy to miss:

Reviewer Notes

If API changes are included, additive changes must be approved by at least two OWNERS and backwards incompatible changes must be approved by more than 50% of the OWNERS, and they must first be added in a backwards compatible way.

Release Notes

Add HorizontalPodAutoscaler, PodDisruptionBudget and podAntiAffinity to webhook

@tekton-robot tekton-robot added do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Oct 15, 2020
@tekton-robot tekton-robot requested review from dibyom and a user October 15, 2020 07:46
@tekton-robot
Copy link
Collaborator

Hi @raballew. Thanks for your PR.

I'm waiting for a tektoncd member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@tekton-robot tekton-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Oct 15, 2020
@imjasonh
Copy link
Member

/kind feature
/ok-to-test

This looks good so far, let's see what the tests think. :)

Can you squash your commits so this PR only contains one commit?

@tekton-robot tekton-robot added kind/feature Categorizes issue or PR as related to a new feature. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Oct 15, 2020
@afrittoli afrittoli closed this Oct 15, 2020
@afrittoli afrittoli reopened this Oct 15, 2020
@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Oct 15, 2020

CLA Check
The committers are authorized under a signed CLA.

@afrittoli
Copy link
Member

@raballew Thank you for your contribution!!
You (or your company) will need to sign the CLA - see link in above for instructions.

Could you include release notes in the PR description?
It's in the new PR template and it needs to be filled. Once you do so, the do-not-merge/release-note-label-needed will go away automatically.

Thank you again!

At the moment, the webhook is a SPOF in certain
scenarios. Under high load or when a node failure
occurs the webhook becomes unavailable. Defining
a HPA, PDB and affinity rules solves this issue.
@tekton-robot tekton-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Oct 15, 2020
@raballew
Copy link
Contributor Author

raballew commented Oct 15, 2020

@afrittoli CLA signed
@imjasonh commits are squashed now

Thanks for guiding me trough the PR!

@afrittoli
Copy link
Member

afrittoli commented Oct 16, 2020

I've not really worked with the HPA before, so I tried this on my kind cluster.

After I got the metrics server running with:

          - --kubelet-insecure-tls
          - --kubelet-preferred-address-types=InternalIP

The HPA was still not happy, because of missing resource requests in the webhook pod:

Conditions:
  Type           Status  Reason                   Message
  ----           ------  ------                   -------
  AbleToScale    True    SucceededGetScale        the HPA controller was able to get the target's current scale
  ScalingActive  False   FailedGetResourceMetric  the HPA was unable to compute the replica count: missing request for cpu

After adding resources from in the deployment:

    resources:
      requests:
        cpu: 250m

It is now happy:

$ k get hpa -n tekton-pipelines
NAME                       REFERENCE                             TARGETS   MINPODS   MAXPODS   REPLICAS   AGE
tekton-pipelines-webhook   Deployment/tekton-pipelines-webhook   3%/100%   1         5         1          135m

$ k describe hpa -n tekton-pipelines
Name:                                                  tekton-pipelines-webhook
Namespace:                                             tekton-pipelines
Labels:                                                app.kubernetes.io/component=webhook
                                                       app.kubernetes.io/instance=default
                                                       app.kubernetes.io/name=webhook
                                                       app.kubernetes.io/part-of=tekton-pipelines
                                                       app.kubernetes.io/version=devel
                                                       pipeline.tekton.dev/release=devel
                                                       version=devel
Annotations:                                           <none>
CreationTimestamp:                                     Fri, 16 Oct 2020 08:23:24 +0100
Reference:                                             Deployment/tekton-pipelines-webhook
Metrics:                                               ( current / target )
  resource cpu on pods  (as a percentage of request):  3% (8m) / 100%
Min replicas:                                          1
Max replicas:                                          5
Deployment pods:                                       1 current / 1 desired
Conditions:
  Type            Status  Reason              Message
  ----            ------  ------              -------
  AbleToScale     True    ReadyForNewScale    recommended size matches current size
  ScalingActive   True    ValidMetricFound    the HPA was able to successfully calculate a replica count from cpu resource utilization (percentage of request)
  ScalingLimited  False   DesiredWithinRange  the desired count is within the acceptable range

According to the HPA docs:

Please note that if some of the Pod's containers do not have the relevant resource request set, CPU utilization for the Pod will not be defined and the autoscaler will not take any action for that metric. See the algorithm details section below for more information about how the autoscaling algorithm works.

So I think that for this to work we need to set resource requests too.
The good news is that without a metrics service the HPA will be un-happy, but the deployment will still work fine, so for dev environments this is pretty much transparent. So I think it should be ok to have this the "default" deployment.

Copy link
Member

@afrittoli afrittoli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this. I think we should clarify if we need to add resources.
TBH I don't know what a good value for CPU would be? @imjasonh wdyt?

@@ -0,0 +1,65 @@
# Copyright 2019 The Tekton Authors
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: s/2019/2020

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure I will bump the year.

@afrittoli I also looked through all resource definitions and most of the files were modified through the course of this year and still use 2019 copyright year. If I am not wrong, technically the copyright year should be updated when you made contributions to a file within that year.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uhm, the practise that I've always followed and that I've seen being used in several open source projects is it to set the year to that of the creation of the file, and do not bother updating the date again.
There are a lot of different opinions about this in the internet. The feeling I get after reading a few opinions is that the year is not so critical in the copyright notice; however company lawyers usually insist on having it.

We could add guidance about this in the community repo @vdemeester @bobcatfish @imjasonh @abayer

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general the guidance I've seen is to bump the year if you're touching the file. Bulk commits to update the year everywhere aren't great for history/blame tracking, but keeping it up to date is slightly better than not...

config/webhook-hpa.yaml Show resolved Hide resolved
Technically, the copyright year should have a value equal to the year of
the last contribution made to a file.
@imjasonh
Copy link
Member

Thanks for this. I think we should clarify if we need to add resources.

TBH I don't know what a good value for CPU would be? @imjasonh wdyt?

Two options:

  1. Run a Tekton installation on your cluster and give it some load, and observe how many resources it uses, then give it 2x or so to give us some headroom.
  2. Crib from Knative even further, and copy their resource requests: https://github.com/knative/serving/blob/master/config/core/deployments/webhook.yaml

I'd probably just borrow from Knative, I don't think either of us do anything very resource intensive in our webhook checks, and if it turns out to be too low we can bump it up in the future.

@afrittoli
Copy link
Member

2. https://github.com/knative/serving/blob/master/config/core/deployments/webhook.yaml

ok, (2) sounds reasonable to me, also 100m seems small enough for a validation service.
Do we needs limits? If the limit is reached, the HPA will spawn a new instance?

        resources:
          requests:
            cpu: 100m
            memory: 100Mi
          limits:
            cpu: 500m
            memory: 500Mi

@raballew
Copy link
Contributor Author

If the limit is reached, the HPA will spawn a new instance?

According to HPA docs:

 [...] the controller calculates the utilization value as a percentage of the equivalent resource request on the containers in each Pod.

When a new replica should be spawned, is described here.

A resource request is required for autoscaler to take any action for a
metric.
@raballew
Copy link
Contributor Author

/retest

@afrittoli
Copy link
Member

Thanks for the updates!

This looks good to me know, the only thing missing is documentation, but I'm ok doing that as a separate PR if you prefer.
We need to document that a metrics server is required for this to work, and in general it would be good to have an HA setup dedicated document or paragraph.

CC'ing @qu1queee since he's been looking into HA documentation for Tekton.

/lgtm

@tekton-robot tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Oct 19, 2020
@raballew
Copy link
Contributor Author

@afrittoli I will do it in this PR, so everything related to webhooks and HPA is bundled here. Could you point me to the correct file where to add the documentation?

@afrittoli
Copy link
Member

I think the entry point might be the install and configuration doc.
It could be a chapter in that page, the content could be in the install page directly or perhaps in a dedicated ha.md with the HA concept and how to configure it.

If you decide to go for a dedicated new file, could you also file a PR to the website repo, so that it shows up on tekton.dev too?

@imjasonh imjasonh added this to the Pipelines v0.18 milestone Oct 19, 2020
@tekton-robot tekton-robot removed the lgtm Indicates that a PR is ready to be merged. label Oct 19, 2020
@raballew
Copy link
Contributor Author

@afrittoli I have added the HA docs to the install page as the amount of documentation needed for this topic is rather small. Once more components such as the controller implement a scaling mechanism as well, this section should be moved to a dedicated ha.md file.

@raballew raballew mentioned this pull request Oct 19, 2020
@raballew
Copy link
Contributor Author

/retest

@raballew
Copy link
Contributor Author

/assign @afrittoli

@raballew raballew requested a review from afrittoli October 27, 2020 17:58
Copy link
Member

@afrittoli afrittoli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the docs on this!
/approve

@tekton-robot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: afrittoli

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tekton-robot tekton-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 3, 2020
@afrittoli
Copy link
Member

/cc @vdemeester

@vdemeester
Copy link
Member

/lgtm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/feature Categorizes issue or PR as related to a new feature. lgtm Indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Replicate webhook pods for resiliency
6 participants