Introduce a nutanix prism client cache #415

thunderboltsid · 2024-04-23T17:51:12Z

During a recent incident, it was observed that creating a new Nutanix client for each request implies basic authentication for every request. This places unnecessary stress on IAM services. This stress was particularly problematic when the IAM services were already in a degraded state, thereby prolonging recovery efforts. Each basic auth request gets processed through the entire IAM stack, significantly increasing the load and impacting performance.

It's recommend that the client use session-auth cookies instead of basic auth for requests to Prism Central where possible. Given how the CAPX controller works currently, a new client is created per reconcile cycle. In #398 we switched to using Session-Auth instead of Basic-Auth. However, switching from Basic-Auth to Session-Auth alone wouldn’t solve the problem of consistent Basic-Auth calls. This is because each time a client is created, which is every reconcile cycle, it will still result in one Basic-Auth call to /users/me to fetch the session cookie. To alleviate this, we are adding a cache of clients and reusing the client from the cache across reconciliation cycles of the same cluster for both the NutanixCluster and NutanixMachine reconciliation.

In a large-scale setup of 40+ clusters w/ 4 nodes each, we were able to see a noticeable drop in QPS to the IAM stack for the oidc/token calls. Before the client caching, a controller restart led to 10+ QPS on oidc/token endpoint with a steady state at around 0.5 QPS. After deploying the client cache changes, we saw a peak of ~3 QPS as caches warmed up and dropped to 0 QPS afterwards with sporadic requests only when session token refresh was needed. As we can see, with the changes proposed in this document, we were able to reduce the number of high-impact calls to IAM significantly.

codecov · 2024-04-23T18:00:55Z

Codecov Report

Attention: Patch coverage is 35.08772% with 37 lines in your changes are missing coverage. Please review.

Project coverage is 28.37%. Comparing base (8228f42) to head (00a4b6c).

Files	Patch %	Lines
controllers/nutanixmachine_controller.go	0.00%	17 Missing ⚠️
controllers/helpers.go	50.00%	14 Missing ⚠️
controllers/nutanixcluster_controller.go	25.00%	6 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #415      +/-   ##
==========================================
+ Coverage   26.91%   28.37%   +1.46%     
==========================================
  Files          19       14       -5     
  Lines        1360     1304      -56     
==========================================
+ Hits          366      370       +4     
+ Misses        994      934      -60

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

thunderboltsid · 2024-04-23T20:33:21Z

/retest

controllers/nutanixcluster_controller.go

thunderboltsid · 2024-05-02T22:53:24Z

/retest

thunderboltsid · 2024-05-02T23:18:24Z

/test e2e-k8s-conformance

thunderboltsid · 2024-05-03T00:09:21Z

/test e2e-nutanix-features

thunderboltsid · 2024-05-03T11:16:26Z

/test e2e-nutanix-features

thunderboltsid · 2024-05-03T12:03:15Z

/retest

The cache stores a prismgoclient.V3 client instance for each NutanixCluster instance. The cache is shared between nutanixcluster and nutanixmachine controllers.

thunderboltsid · 2024-05-03T16:31:56Z

/retest

dkoshkin

Thanks for all the new tests!

controllers/helpers.go

pkg/client/cache.go

controllers/helpers_test.go

controllers/nutanixcluster_controller.go

thunderboltsid · 2024-05-06T12:02:39Z

/retest

thunderboltsid · 2024-05-06T12:49:12Z

/test e2e-k8s-conformance
/test e2e-capx-controller-upgrade

adiantum

Overall looks good, but I think we should plan some refactoring in separate PR. Please consider my comments.
/lgtm
/approve

controllers/helpers.go

pkg/client/cache.go

thunderboltsid · 2024-05-06T13:06:47Z

/test e2e-k8s-conformance

adiantum

/lgtm
/approve

nutanix-cn-prow-bot · 2024-05-06T14:29:21Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: adiantum, dkoshkin, thunderboltsid

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [adiantum,thunderboltsid]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

thunderboltsid · 2024-05-06T14:30:45Z

/retest

thunderboltsid · 2024-05-06T15:16:38Z

/test e2e-capx-scaling

* Introduce a nutanix prism client cache The cache stores a prismgoclient.V3 client instance for each NutanixCluster instance. The cache is shared between nutanixcluster and nutanixmachine controllers. * Address review comments

* Ensure fallback config is only read when prismCentral is absent (#403) Skip reading fallback config file from /etc/nutanix/config/prismCentral if NutanixCluster has prismCentral set. Co-authored-by: Sid Shukla <sid.shukla@nutanix.com> * Add a provision for handling image based bootstrap (#406) * Add a provision for handling image based bootstrap AHV has a limit of 32KB for cloud-init userdata. In Openshift, the ignition can be rather large (a magnitude over the limit). In order to support larger userdata files, we allow mounting the customization as an image. * Only set guestcustomization explicitly when bootstrap ref is secret * Use lowercase for data_source_reference kind. --------- Co-authored-by: Sid Shukla <sid.shukla@nutanix.com> * Introduce a nutanix prism client cache (#415) * Introduce a nutanix prism client cache The cache stores a prismgoclient.V3 client instance for each NutanixCluster instance. The cache is shared between nutanixcluster and nutanixmachine controllers. * Address review comments * Update CAPI version in go.mod This is to ensure newer versions of interfaces from SharedIndexInformers don't cause compile failures. Update go version to v1.22 becasue cmp.Or is only available in go v1.22 update prism-go-client --------- Co-authored-by: Deepak Muley <deepak.muley@nutanix.com> Co-authored-by: Yanhua Li <yanhua.li@nutanix.com>

* Introduce a nutanix prism client cache The cache stores a prismgoclient.V3 client instance for each NutanixCluster instance. The cache is shared between nutanixcluster and nutanixmachine controllers. * Address review comments

* Introduce a nutanix prism client cache (#415) * Introduce a nutanix prism client cache The cache stores a prismgoclient.V3 client instance for each NutanixCluster instance. The cache is shared between nutanixcluster and nutanixmachine controllers. * Address review comments * Build fixes to keep the build passing Changes to fix linting issues and version dependencies to keep the build passing

thunderboltsid mentioned this pull request Apr 23, 2024

Switch Nutanix Client to use Session Auth instead of Basic Auth #398

Merged

nutanix-cn-prow-bot bot requested review from deepakm-ntnx and yannickstruyf3 April 23, 2024 18:54

thunderboltsid requested review from adiantum and dkoshkin and removed request for yannickstruyf3 April 23, 2024 20:33

dkoshkin reviewed Apr 23, 2024

View reviewed changes

controllers/nutanixcluster_controller.go Show resolved Hide resolved

thunderboltsid force-pushed the jira/krbn-8098 branch 4 times, most recently from 877f63c to 9ffeed4 Compare May 2, 2024 22:36

thunderboltsid requested a review from dkoshkin May 2, 2024 23:05

thunderboltsid force-pushed the jira/krbn-8098 branch from 9ffeed4 to 1d94d8b Compare May 3, 2024 10:58

thunderboltsid added the do-not-merge/work-in-progress label May 3, 2024

Introduce a nutanix prism client cache

87f0011

The cache stores a prismgoclient.V3 client instance for each NutanixCluster instance. The cache is shared between nutanixcluster and nutanixmachine controllers.

thunderboltsid force-pushed the jira/krbn-8098 branch from 1d94d8b to 87f0011 Compare May 3, 2024 16:16

thunderboltsid removed the do-not-merge/work-in-progress label May 3, 2024

dkoshkin approved these changes May 4, 2024

View reviewed changes

controllers/helpers.go Outdated Show resolved Hide resolved

pkg/client/cache.go Show resolved Hide resolved

controllers/helpers_test.go Outdated Show resolved Hide resolved

controllers/nutanixcluster_controller.go Show resolved Hide resolved

adiantum approved these changes May 6, 2024

View reviewed changes

controllers/helpers.go Outdated Show resolved Hide resolved

pkg/client/cache.go Show resolved Hide resolved

nutanix-cn-prow-bot bot assigned adiantum May 6, 2024

nutanix-cn-prow-bot bot added lgtm approved labels May 6, 2024

thunderboltsid force-pushed the jira/krbn-8098 branch from 5e3954f to ec9c33e Compare May 6, 2024 14:04

Address review comments

00a4b6c

thunderboltsid force-pushed the jira/krbn-8098 branch from ec9c33e to 00a4b6c Compare May 6, 2024 14:05

adiantum approved these changes May 6, 2024

View reviewed changes

thunderboltsid merged commit fac5b21 into main May 6, 2024
11 of 13 checks passed

thunderboltsid deleted the jira/krbn-8098 branch May 6, 2024 16:38

thunderboltsid mentioned this pull request May 7, 2024

Cherry-pick PR #403, #406, and #415 to release-v1.2 #430

Merged

thunderboltsid mentioned this pull request May 10, 2024

Cherry-pick PR #415 to release-v1.3 #433

Merged

thunderboltsid added the enhancement New feature or request label Jun 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce a nutanix prism client cache #415

Introduce a nutanix prism client cache #415

thunderboltsid commented Apr 23, 2024 •

edited

Loading

codecov bot commented Apr 23, 2024 •

edited

Loading

thunderboltsid commented Apr 23, 2024

thunderboltsid commented May 2, 2024

thunderboltsid commented May 2, 2024

thunderboltsid commented May 3, 2024

thunderboltsid commented May 3, 2024

thunderboltsid commented May 3, 2024

thunderboltsid commented May 3, 2024

dkoshkin left a comment

thunderboltsid commented May 6, 2024

thunderboltsid commented May 6, 2024

adiantum left a comment

thunderboltsid commented May 6, 2024

adiantum left a comment

nutanix-cn-prow-bot bot commented May 6, 2024

thunderboltsid commented May 6, 2024

thunderboltsid commented May 6, 2024

Introduce a nutanix prism client cache #415

Introduce a nutanix prism client cache #415

Conversation

thunderboltsid commented Apr 23, 2024 • edited Loading

codecov bot commented Apr 23, 2024 • edited Loading

Codecov Report

thunderboltsid commented Apr 23, 2024

thunderboltsid commented May 2, 2024

thunderboltsid commented May 2, 2024

thunderboltsid commented May 3, 2024

thunderboltsid commented May 3, 2024

thunderboltsid commented May 3, 2024

thunderboltsid commented May 3, 2024

dkoshkin left a comment

Choose a reason for hiding this comment

thunderboltsid commented May 6, 2024

thunderboltsid commented May 6, 2024

adiantum left a comment

Choose a reason for hiding this comment

thunderboltsid commented May 6, 2024

adiantum left a comment

Choose a reason for hiding this comment

nutanix-cn-prow-bot bot commented May 6, 2024

thunderboltsid commented May 6, 2024

thunderboltsid commented May 6, 2024

thunderboltsid commented Apr 23, 2024 •

edited

Loading

codecov bot commented Apr 23, 2024 •

edited

Loading