-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🐛 Requeue health check during the creation of a new cluster accessor #8122
🐛 Requeue health check during the creation of a new cluster accessor #8122
Conversation
Signed-off-by: Aljoscha Poertner <aljoscha.poertner@posteo.de>
|
Welcome @AljoschaP! |
Hi @AljoschaP. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/ok-to-test
fyi, while analysing the code we figured out that the bug that is triggered by it (accessors not having a running health check) is something that is probably very common for all users, no matter which providers are used. It is however not seen/noticed by most users as the involved kubeconfig used for the accessor never gets invalid. This can easily be verified by adding a log to the health check that prints out that a health check is performed. You'll see that it stops doing health checks after the first invocation. |
Thx, very nice catch!! Great work! /lgtm |
LGTM label has been added. Git tree hash: c133431490be719dac8dbbab1e99d3095e247111
|
/cherry-pick release-1.3 |
@sbueringer: once the present PR merges, I will cherry-pick it on top of release-1.3 in a new PR and assign it to you. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/cherry-pick release-1.2 |
@sbueringer: once the present PR merges, I will cherry-pick it on top of release-1.2 in a new PR and assign it to you. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/hold /test pull-cluster-api-e2e-full-main |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome finding 👍
/lgtm
/hold cancel |
/assign @fabriziopandini |
@AljoschaP that's a great first PR! |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: fabriziopandini The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
independent flake /retest |
@AljoschaP if you have some time, can you please open a follow-up issue to add a unit test to cover this case? Thank you! |
@sbueringer: new pull request created: #8158 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@sbueringer: new pull request created: #8159 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Of course 🙂 I will create an issue and start working on it beginning of next week 🙂 Thank you all! |
Perfect, thank you very much!! |
What this PR does / why we need it:
When a cluster accessor is created, the health check is started before the accessor is stored. If the health check hits the
loadAccessor
before the accessor is stored, it returnstrue
, leaves the health check and never retries it. Together with expiring kubeconfigs (e.g. EKS), this will lead to a situation where a kubeconfig will never be refreshed and the cluster is basically dead until we restart the controller. This PR uses theTryLock
to check if the cluster is currently locked which means that we're probably in the middle of the creation of a new cluster accessor and requeues the health check until the creation is done.Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):Fixes #7989