Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

failed to wait for object to sync in-cache after patching context deadline exceeded #1017

Open
pkit opened this issue Jul 1, 2024 · 7 comments

Comments

@pkit
Copy link

pkit commented Jul 1, 2024

wut?
really, what does it mean?
why there's no other logs that describe what's going on?

2024-07-01T20:04:53.761Z info HelmRelease/something.flux-system - release out-of-sync with desired state: release config values changed 
2024-07-01T20:04:53.791Z info HelmRelease/something.flux-system - running 'upgrade' action with timeout of 5m0s 
2024-07-01T20:04:54.720Z info HelmRelease/something.flux-system - release is in a failed state 
2024-07-01T20:04:54.789Z info HelmRelease/something.flux-system - running 'rollback' action with timeout of 5m0s 
2024-07-01T20:05:05.069Z error HelmRelease/something.flux-system - failed to wait for object to sync in-cache after patching context deadline exceeded
@stefanprodan
Copy link
Member

failed to wait for object to sync in-cache after patching context deadline exceeded

This means the controller stopped receiving data from Kubernetes API, I suspect your Kubernetes controler plane is having issues.

@fcuello-fudo
Copy link

We are having the same problem, but also, the helm-controller pod is in a CrashLoopBackOff because of repeated failed Liveness probe .

Probably the Liveness probe should still work even if there are problems contacting the control plane

@stefanprodan
Copy link
Member

stefanprodan commented Oct 18, 2024

Probably the Liveness probe should still work even if there are problems contacting the control plane

Not if you build your controller with Kubernetes controller-runtime. Having the controller running and DDOSing the API endpoint would do you no good, kubelet will restart to controller with an exponential backoff which prevents the API server from being overloaded once it starts.

@fcuello-fudo
Copy link

Having the controller running and DDOSing the API endpoint would do you no good,

We downgraded the control-plane (GKE rapid channel) and now everything seems to be fine again. I still haven't really found the root cause, but my point was that if the controller is behaving properly, but the k8s API is overloaded or unresponsive for some other reason than the controller, the liveness probe on the controller should still pass the checks, right?

@stefanprodan
Copy link
Member

the liveness probe on the controller should still pass the checks, right?

Not if the CNI is failing, kubelet can't reach the port. There is nothing special about the liveness probe, it's the standard controller-runtime ping handler https://github.com/fluxcd/pkg/blob/ac1007b57e37838e73b8bc95365dab9a0e856e8e/runtime/probes/probes.go#L45

@fcuello-fudo
Copy link

Not if the CNI is failing, kubelet can't reach the port.

That it's not the case as there are several other applications running in the same cluster (and same node as flux controllers ) and non of them have any problems, neither communicating to the internet nor among each other.

Also, the liveness port of the flux controllers is reachable, but it just doesn't respond.

What I think is happening, is that the problematic version of the control plane has changed something related to rate limiting of API queries and that is only affecting flux because in our case it's the app that queries the k8s API the most.

I'm pretty sure we can reproduce the issue easily by switching the control plane back to the problematic version if you are willing to debug this together.

@stefanprodan
Copy link
Member

@fcuello-fudo if Flux runs into rate limits there must be error logs, if you can post those would be helpful. We use the Kubernetes PriorityAndFairness flow control to make our controllers comply with Kubernetes API rate limits, if the flow API is buggy this could lead to a disconnect https://github.com/fluxcd/pkg/blob/ac1007b57e37838e73b8bc95365dab9a0e856e8e/runtime/client/client.go#L76

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants