-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🌱 Allow Machines in unreachable Clusters to do initial reconciliation #7719
🌱 Allow Machines in unreachable Clusters to do initial reconciliation #7719
Conversation
cb7da5e
to
99dad5f
Compare
@@ -44,6 +44,11 @@ var ( | |||
func (r *Reconciler) reconcileNode(ctx context.Context, cluster *clusterv1.Cluster, machine *clusterv1.Machine) (ctrl.Result, error) { | |||
log := ctrl.LoggerFrom(ctx) | |||
|
|||
// Create a watch on the nodes in the Cluster. | |||
if err := r.watchClusterNodes(ctx, cluster); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fabriziopandini I think we should be alright to add the tracker where it's needed instead of ignoring the error earlier in the reconcile.
Alternatively we could ignore the error but check in this function if the watch is actually set and exit early if not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have strong objections to this specific change
However, as a follow up, I think we should probably think more holistically about how failures on getting a client from the tracker are handled, because now all the reconciler trying to get an accessor currently locked goes into exponential backoff, and this has two downsides:
- we have a spike of errors on the logs
- exponential backoff grows fast, quickly going over the resync period.
I can see two options here:
- a quick win, which is requeing after the timeout for creating a new accessor
- a better solution, where the cluster tracker enquees events for the reconcilers when a new accessor is created.
@sbueringer opinions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we have a spike of errors on the logs
I think the exponential backoff leads to the reverse, less log messages. If we reconcile more often we get more log mesages (btw we log them as info today on log level 5, not as an error).
exponential backoff grows fast, quickly going over the resync period.
But if the exponential backoff is higher than the resync period we still get a reconcile after the resync period, right?
Usually exponential backoffs have a maximum, what is the maximum in CR?
a quick win, which is requeing after the timeout for creating a new accessor
Just to clarify, instead of ctrl.Result{Requeue: true}
(which leads to the exponential backoff) we would return ctrl.Result{Requeue: true, RequeueAfter: 10 * time.Second}
(I think 10s is our timeout) when we get the ErrClusterLocked
error?
Sounds fine to me. I'm personally not concerned about getting that info message every 10 seconds on log level 5 during the time when the accessor is (usually) initially created.
a better solution, where the cluster tracker enques events for the reconcilers when a new accessor is created.
This means the cluster tracker has to keep track of which reconciler asked for an accessor for a specific cluster while reconciling a certain object and then triggger reconciles for all those objects once the accessor has been created?
To be honest, I would go with reconciling more often. This seems too complex to me if the main benefit would be that we have less log messages on log level 5. (Might be I"m missing something)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change in this PR looks good to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default max exponential backoff is ~16 minutes 45 seconds for controller runtime, and we currently use that default across our controllers. I'm going to and open an issue that points to this thread so we can discuss this in a more visible place.
I do think we should be concerned with how often we return errors/requeues and resort to exponential backoff when it comes to total time for operations to complete vs stability. There may be cases where we want to err on the side of less or more frequent reconciles depending on the operation.
/hold |
Hi @killianmuldoon, the PR could be related to the problem I discussed with you in Slack. I investigated the problem yesterday and what we saw is that new nodes don't get recognized by the capi-controller-manager after the external config gets invalid (approx. after 15 minutes). The log shows that there are many failed attempts to create a new cluster accessor:
If we understand it correctly, a failed attempt gets ignored and the corresponding node will not be reconciled, so maybe this PR will fix our issue as well. A thing we don't understand is the use of TryLock: cluster-api/controllers/remote/cluster_cache_tracker.go Lines 215 to 221 in 6a4e90c
Could you explain why you decided to use try lock and ignore instead of try lock and wait? We ask ourselves what would happen if the first of multiple concurring attempts takes longer? Then all following attempts would fail, right? Is it possible that this can happen in every reconciliation, so that node reconciliation loops will never get a cluster accessor? That would explain our observations, where machines get stuck in the provisioned state. If you need more information (e.g. logs/stack trace), just contact me :-) Thank you very much! |
Signed-off-by: killianmuldoon <kmuldoon@vmware.com>
99dad5f
to
68763b7
Compare
/retest |
/lgtm |
LGTM label has been added. Git tree hash: 0a22e2a987d8976a934a85fe491d102b7d98af94
|
/lgtm +1 also to open an issue about possible improvements on how we handle tracker locks errors |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: fabriziopandini The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Signed-off-by: killianmuldoon kmuldoon@vmware.com
This change ignores errors in Machine reconciliation when the ClusterCacheTracker is not able to provide a client. This allows additional parts of the reconciliation - setting ownerReferences, reconciling external Bootstrap and Infrastructure objects - to be done before failing when the Machine controller is unable to get a lock on the ClusterAccessor, or the Cluster is uncontactable in some way.