Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

optionally allow CLUSTER_ENDPOINT to be used rather than the cluster-ip #2138

Merged
merged 3 commits into from
Nov 16, 2022

Conversation

bwagner5
Copy link
Contributor

@bwagner5 bwagner5 commented Nov 11, 2022

What type of PR is this?

  • feature

Which issue does this PR fix:
awslabs/amazon-eks-ami#1099

What does this PR do / Why do we need it:

When a node is starting, kube-proxy and the VPC CNI DaemonSets get scheduled and started on the node at roughly the same time. This causes a race in the VPC CNI since it currently depends on kube-proxy to finish setting up cluster-ip routes on the node. If the VPC CNI hits the k8s server check first, then there is a 5 sec delay since that is the current timeout value set. This delays the node from getting to a Ready state.

If an issue # is not available please add repro steps and logs from IPAMD/CNI showing the issue:

Testing done on this change:

Installed with and without setting env.CLUSTER_ENDPOINT:

helm upgrade --install aws-node-ce -n kube-system charts/aws-vpc-cni/ --set image.override=xxxxxxxxx.dkr.ecr.us-west-2.amazonaws.com/vpc-cni:ce --set init.image.override=xxxxxxxxxxx.dkr.ecr.us-west-2.amazonaws.com/vpc-cni:init-ce --set env.CLUSTER_ENDPOINT=https://xxxxx.xxx.us-west-2.eks.amazonaws.com

This log shows a race where kube-proxy wasn't finished when aws-node reached out to the api-server through the cluster ip, resulting in a delay in initialization:

|     Event                     | Timestamp | t  |
|-------------------------------|-----------|----|
| kube-proxy Started            | 07:30:54  | 0  |
| AWS Node Container Starts     | 07:30:55  | 1  |
| VPC CNI Init Container Starts | 07:30:55  | 1  |
| VPC CNI Plugin Initialized    | 07:31:05  | 10 |
{"level":"info","ts":"2022-11-11T07:30:56.231Z","caller":"logger/logger.go:52","msg":"Constructed new logger instance"}
{"level":"info","ts":"2022-11-11T07:30:56.235Z","caller":"eniconfig/eniconfig.go:61","msg":"Initialized new logger as an existing instance was not found"}
{"level":"info","ts":"2022-11-11T07:30:56.321Z","caller":"aws-k8s-agent/main.go:28","msg":"Starting L-IPAMD   ..."}
{"level":"info","ts":"2022-11-11T07:30:56.329Z","caller":"aws-k8s-agent/main.go:39","msg":"Testing communication with server"}
{"level":"error","ts":"2022-11-11T07:31:01.329Z","caller":"wait/wait.go:211","msg":"Unable to reach API Server, Get \"https://10.100.0.1:443/version?timeout=5s\": context deadline exceeded"}
{"level":"info","ts":"2022-11-11T07:31:03.351Z","caller":"wait/wait.go:211","msg":"Successful communication with the Cluster! Cluster Version is: v1.22+. git version: v1.22.15-eks-fb459a0. git tree state: clean. commit: be82fa628e60d024275efaa239bfe53a9119c2d9. platform: linux/amd64"}

This PR adds an option CLUSTER_ENDPOINT environment variable that can be passed which allows aws-node to initialize without waiting on kube-proxy. Kube-proxy is still required before aws-node finishes initializing the CNI plugin.

This log shows the components starting at the same time where kube-proxy is not completed, but aws-node can continue bootstrapping with the CLUSTER_ENDPOINT:

|     Event                     | Timestamp | t  |
|-------------------------------|-----------|----|
| AWS Node Container Starts     | 07:40:39  | 0  |
| kube-proxy Started            | 07:40:39  | 0  |
| VPC CNI Init Container Starts | 07:40:39  | 0  |
| VPC CNI Plugin Initialized    | 07:40:41  | 2  |
{"level":"info","ts":"2022-11-11T07:40:39.878Z","caller":"logger/logger.go:52","msg":"Constructed new logger instance"}
{"level":"info","ts":"2022-11-11T07:40:39.881Z","caller":"eniconfig/eniconfig.go:61","msg":"Initialized new logger as an existing instance was not found"}
{"level":"info","ts":"2022-11-11T07:40:39.974Z","caller":"aws-k8s-agent/main.go:28","msg":"Starting L-IPAMD   ..."}
{"level":"info","ts":"2022-11-11T07:40:39.982Z","caller":"aws-k8s-agent/main.go:39","msg":"Testing communication with server"}
{"level":"info","ts":"2022-11-11T07:40:40.014Z","caller":"wait/wait.go:211","msg":"Successful communication with the Cluster! Cluster Version is: v1.22+. git version: v1.22.15-eks-fb459a0. git tree state: clean. commit: be82fa628e60d024275efaa239bfe53a9119c2d9. platform: linux/amd64"}

...

{"level":"info","ts":"2022-11-11T07:40:41.407Z","caller":"wait/wait.go:211","msg":"Successful communication with the Cluster! Cluster Version is: v1.22+. git version: v1.22.15-eks-fb459a0. git tree state: clean. commit: be82fa628e60d024275e
faa239bfe53a9119c2d9. platform: linux/amd64"}

The first api-server connectivity check uses the CLUSTER_ENDPOINT passed in and the second one uses the cluster ip to make sure that kube-proxy is finished. The second check blocks the CNI from initializing.

Automation added to e2e:

Will this PR introduce any new dependencies?:

NO

Will this break upgrades or downgrades. Has updating a running cluster been tested?:

Have not tested upgrading a cluster, but this change only affects the endpoint being used.

Does this change require updates to the CNI daemonset config files to work?:

No

Does this PR introduce any user-facing change?:

CLUSTER_ENDPOINT can now be specified to allow the VPC CNI to initialize before kube-proxy has finished setting up cluster IP routes.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@achevuru
Copy link
Contributor

@ bwagner5 We were exploring on removing the API Server connectivity check dependency from CNI altogether. We just do it once during boot up and it is not the CNIs job to setup and/or validate the worker nodes to CPI connectivity anyways...So, we can just remove that check altogether.

@bwagner5
Copy link
Contributor Author

@ bwagner5 We were exploring on removing the API Server connectivity check dependency from CNI altogether. We just do it once during boot up and it is not the CNIs job to setup and/or validate the worker nodes to CPI connectivity anyways...So, we can just remove that check altogether.

That would be nice, although it doesn't solve the latency problem since something will try to connect to the apiserver within the VPC CNI and then likely timeout if kube-proxy isn't done yet. AFAIK nothing else guarantees that kube-proxy is done setting up routes though right? So pods could schedule and not have connectivity which could result in weird crashing behavior or higher latency pod start times like the VPC CNI runs into.

@achevuru
Copy link
Contributor

VPC CNI only checks for API Server connectivity during the initial start up phase and so once we remove it we shouldn't run in to any additional init latency due to this race condition. CNI itself runs a controller when it is operating in custom networking mode but at that point it is like any other controller running on worker nodes.

As far as other application pods running in to issues (because kube-proxy for some reason takes long time to setup Kubernetes service endpoints), those pods should retry just like CNI does currently and I guess I'm trying to say CNI isn't responsible for validating API Server connectivity. Should be a kube-proxy issue. For example, if API server connectivity breaks post CNI startup, these pods will run in to the same issue. Currently, we only check it once during init and don't do periodic checks.

Also, the pods will always rely on the service endpoints to reach to API Server and if we're bypassing that in CNI by providing cluster endpoint to improve the start up time of VPC CNI, then we're hiding the kube proxy problem from VPC CNI but other application pods will now run in to the same issue while trying to connect to API Server as you outlined above.

@bwagner5
Copy link
Contributor Author

bwagner5 commented Nov 11, 2022

VPC CNI only checks for API Server connectivity during the initial start up phase and so once we remove it we shouldn't run in to any additional init latency due to this race condition.

Even when removing the API Server connectivity check, I believe the K8s cached client would still hang on synchronization blocking the rest of the initialization flow:

cache.WaitForCacheSync(stopChan)

I agree that it's no worse than what happens today with the connectivity check, but it still adds unneeded latency when initializing the node.

@achevuru
Copy link
Contributor

Yeah, I can see the problem there. But, I do remember someone from NW team tested without the check for API Server connectivity to make sure the init goes fine when Kubernetes service endpoints are missing. I'll try to check what was tested...

But, the kube-proxy startup latency issue should've been mitigated to a large extent from K8S v1.22 onwards. We identified one of the reasons contributing to this behavior and that was addressed from v1.22+. I'll share the internal tickets with you. Beyond that, only other scenario where we saw the race condition emerge is when the cluster has a huge number of services (100+) which can result kube-proxy taking up additional time during startup.

@bwagner5 bwagner5 marked this pull request as ready for review November 15, 2022 23:17
@bwagner5 bwagner5 requested a review from a team as a code owner November 15, 2022 23:17
achevuru
achevuru previously approved these changes Nov 16, 2022
README.md Outdated
@@ -410,6 +410,17 @@ Specifies the cluster name to tag allocated ENIs with. See the "Cluster Name tag

---

#### `CLUSTER_ENDPOINT`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Lets mention the release version, CLUSTER_ENDPOINT (v1.12.1+)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated!

Copy link
Contributor

@jayanthvn jayanthvn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a minor nit, lgtm.

@bwagner5 bwagner5 requested review from jayanthvn and achevuru and removed request for jayanthvn and achevuru November 16, 2022 00:36
@jayanthvn jayanthvn merged commit 1aac6d0 into aws:master Nov 16, 2022
@jayanthvn jayanthvn added this to the v1.12.1 milestone Nov 23, 2022
jdn5126 added a commit that referenced this pull request Dec 12, 2022
* create publisher with logger (#2119)

* Add missing rules when NodePort support is disabled (#2026)

* Add missing rules when NodePort support is disabled

* the rules that need to be installed for NodePort support and SNAT
  support are very similar. The same traffic mark is needed for both. As
  a result, rules that are currently installed only when NodePort
  support is enabled should also be installed when external SNAT is
  disabled, which is the case by default.
* remove "-m state --state NEW" from a rule in the nat table. This is
  always true for packets that traverse the nat table.
* fix typo in one rule's name (extra whitespace).

Fixes #2025

Co-authored-by: Quan Tian <qtian@vmware.com>

Signed-off-by: Antonin Bas <abas@vmware.com>

* Fix typos and unit tests

Signed-off-by: Antonin Bas <abas@vmware.com>

* Minor improvement to code comment

Signed-off-by: Antonin Bas <abas@vmware.com>

* Address review comments

* Delete legacy nat rule
* Fix an unrelated log message

Signed-off-by: Antonin Bas <abas@vmware.com>

Signed-off-by: Antonin Bas <abas@vmware.com>
Co-authored-by: Jayanth Varavani <1111446+jayanthvn@users.noreply.github.com>
Co-authored-by: Sushmitha Ravikumar <58063229+sushrk@users.noreply.github.com>

* downgrade test go.mod to align with root go.mod (#2128)

* skip addon installation when addon info is not available (#2131)

* Merging test/Makefile and test/go.mod to the root Makefil and go.mod, adjust the .github/workflows and integration test instructions (#2129)

* update troubleshooting docs for CNI image (#2132)

fix location where make command is run

* fix env name in test script (#2136)

* optionally allow CLUSTER_ENDPOINT to be used rather than the cluster-ip (#2138)

* optionally allow CLUSTER_ENDPOINT to be used rather than the kubernetes cluster ip

* remove check for kube-proxy

* add version to readme

* Add resources config option to cni metrics helper (#2141)

* Add resources config option to cni metrics helper

* Remove default-empty resources block; replace with conditional

* Add metrics for ec2 api calls made by CNI and expose via prometheus (#2142)

Co-authored-by: Jay Deokar <jsdeokar@amazon.com>

* increase workflow role duration to 4 hours (#2148)

* Update golang 1.19.2 EKS-D (#2147)

* Update golang

* Move to EKS distro builds

* [HELM]: Move CRD resources to a separate folder as per helm standard (#2144)

Co-authored-by: Jay Deokar <jsdeokar@amazon.com>

* VPC-CNI minimal image builds (#2146)

* VPC-CNI minimal image builds

* update dependencies for ginkgo when running integration tests

* address review comments and break up init main function

* review comments for sysctl

* Simplify binary installation, fix review comments

Since init container is required to always run, let binary installation
for external plugins happen in init container. This simplifies the main
container entrypoint and the dockerfile for each image.

* when IPAMD connection fails, try to teardown pod network using prevResult (#2145)

* add env var to enable nftables (#2155)

* fix failing weekly cron tests (#2154)

* Deprecate AWS_VPC_K8S_CNI_CONFIGURE_RPFILTER and remove no-op setter (#2153)

* Deprecate AWS_VPC_K8S_CNI_CONFIGURE_RPFILTER

* update release version comments

Signed-off-by: Antonin Bas <abas@vmware.com>
Co-authored-by: Jeffrey Nelson <jdnelson@amazon.com>
Co-authored-by: Antonin Bas <antonin.bas@gmail.com>
Co-authored-by: Jayanth Varavani <1111446+jayanthvn@users.noreply.github.com>
Co-authored-by: Sushmitha Ravikumar <58063229+sushrk@users.noreply.github.com>
Co-authored-by: Jerry He <37866862+jerryhe1999@users.noreply.github.com>
Co-authored-by: Brandon Wagner <wagnerbm@amazon.com>
Co-authored-by: Jonathan Ogilvie <679297+jcogilvie@users.noreply.github.com>
Co-authored-by: Jay Deokar <jsdeokar@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants