-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce a nutanix prism client cache #415
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #415 +/- ##
==========================================
+ Coverage 26.91% 28.37% +1.46%
==========================================
Files 19 14 -5
Lines 1360 1304 -56
==========================================
+ Hits 366 370 +4
+ Misses 994 934 -60 ☔ View full report in Codecov by Sentry. |
/retest |
877f63c
to
9ffeed4
Compare
/retest |
/test e2e-k8s-conformance |
/test e2e-nutanix-features |
9ffeed4
to
1d94d8b
Compare
/test e2e-nutanix-features |
/retest |
The cache stores a prismgoclient.V3 client instance for each NutanixCluster instance. The cache is shared between nutanixcluster and nutanixmachine controllers.
1d94d8b
to
87f0011
Compare
/retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all the new tests!
/retest |
/test e2e-k8s-conformance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good, but I think we should plan some refactoring in separate PR. Please consider my comments.
/lgtm
/approve
/test e2e-k8s-conformance |
5e3954f
to
ec9c33e
Compare
ec9c33e
to
00a4b6c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: adiantum, dkoshkin, thunderboltsid The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest |
/test e2e-capx-scaling |
* Introduce a nutanix prism client cache The cache stores a prismgoclient.V3 client instance for each NutanixCluster instance. The cache is shared between nutanixcluster and nutanixmachine controllers. * Address review comments
* Ensure fallback config is only read when prismCentral is absent (#403) Skip reading fallback config file from /etc/nutanix/config/prismCentral if NutanixCluster has prismCentral set. Co-authored-by: Sid Shukla <sid.shukla@nutanix.com> * Add a provision for handling image based bootstrap (#406) * Add a provision for handling image based bootstrap AHV has a limit of 32KB for cloud-init userdata. In Openshift, the ignition can be rather large (a magnitude over the limit). In order to support larger userdata files, we allow mounting the customization as an image. * Only set guestcustomization explicitly when bootstrap ref is secret * Use lowercase for data_source_reference kind. --------- Co-authored-by: Sid Shukla <sid.shukla@nutanix.com> * Introduce a nutanix prism client cache (#415) * Introduce a nutanix prism client cache The cache stores a prismgoclient.V3 client instance for each NutanixCluster instance. The cache is shared between nutanixcluster and nutanixmachine controllers. * Address review comments * Update CAPI version in go.mod This is to ensure newer versions of interfaces from SharedIndexInformers don't cause compile failures. Update go version to v1.22 becasue cmp.Or is only available in go v1.22 update prism-go-client --------- Co-authored-by: Deepak Muley <deepak.muley@nutanix.com> Co-authored-by: Yanhua Li <yanhua.li@nutanix.com>
* Introduce a nutanix prism client cache The cache stores a prismgoclient.V3 client instance for each NutanixCluster instance. The cache is shared between nutanixcluster and nutanixmachine controllers. * Address review comments
* Introduce a nutanix prism client cache The cache stores a prismgoclient.V3 client instance for each NutanixCluster instance. The cache is shared between nutanixcluster and nutanixmachine controllers. * Address review comments
* Introduce a nutanix prism client cache (#415) * Introduce a nutanix prism client cache The cache stores a prismgoclient.V3 client instance for each NutanixCluster instance. The cache is shared between nutanixcluster and nutanixmachine controllers. * Address review comments * Build fixes to keep the build passing Changes to fix linting issues and version dependencies to keep the build passing
During a recent incident, it was observed that creating a new Nutanix client for each request implies basic authentication for every request. This places unnecessary stress on IAM services. This stress was particularly problematic when the IAM services were already in a degraded state, thereby prolonging recovery efforts. Each basic auth request gets processed through the entire IAM stack, significantly increasing the load and impacting performance.
It's recommend that the client use session-auth cookies instead of basic auth for requests to Prism Central where possible. Given how the CAPX controller works currently, a new client is created per reconcile cycle. In #398 we switched to using Session-Auth instead of Basic-Auth. However, switching from Basic-Auth to Session-Auth alone wouldn’t solve the problem of consistent Basic-Auth calls. This is because each time a client is created, which is every reconcile cycle, it will still result in one Basic-Auth call to
/users/me
to fetch the session cookie. To alleviate this, we are adding a cache of clients and reusing the client from the cache across reconciliation cycles of the same cluster for both the NutanixCluster and NutanixMachine reconciliation.In a large-scale setup of 40+ clusters w/ 4 nodes each, we were able to see a noticeable drop in QPS to the IAM stack for the
oidc/token
calls. Before the client caching, a controller restart led to 10+ QPS onoidc/token
endpoint with a steady state at around 0.5 QPS. After deploying the client cache changes, we saw a peak of ~3 QPS as caches warmed up and dropped to 0 QPS afterwards with sporadic requests only when session token refresh was needed. As we can see, with the changes proposed in this document, we were able to reduce the number of high-impact calls to IAM significantly.