-
Notifications
You must be signed in to change notification settings - Fork 317
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Azure CNS K8S api errors on windows nodes: failed to watch and failed to list #4679
Comments
Running this script:
I am getting:
I see the service token is configured to rotate every 1 hour aprox.:
My impression is the service token is not being renewed on the pods residing on the windows node, so the issue is probably not in the CNI itself. |
I can confirm that when the pod starts failing to auth, the token inside it has properly been renewed. First thought is that for whatever reason the CNS pod has a quirck in its windows implementation (or an upstream library), and it does not update the token, using the original one injected into the pod which is now stale. |
On the Windows pod also getting this after recreating the pod:
|
Facing the same issue |
Thanks for highlighting this @ganastasiou14 @david-garcia-garcia. We're look at this currently and will come back shortly with what we come up with. |
The fix for this is rolling out now. The issue is that CNS on Windows compiles a custom, static kubeconfig at Pod startup (setkubeconfigpath.ps1). This was necessary to get a valid kubeconfig because client-go had hardcoded paths incompatible with HostProcess Container on Windows prior to ContainerD 1.7 (used in AKS <=1.27). The script runs at startup and never re-runs, so the token that exists at Pod start is the token CNS will try to use forever. Kubernetes 1.30 has a change where service account tokens refresh every 1 hour when OIDC is enabled. Previously, and when OIDC is not enabled, they are valid for 1 year. No CNS Pod is expected to live for that long due to Node updates, CNS patches, etc, so treating the token as immortal never presented an issue until 1.30+OIDC. This is resolved by bypassing the startup script and using the Pod InClusterConfig instead of a static kubeconfig when CNS is running on AKS >1.27. |
The fix for this has been released |
Action required from @aritraghosh, @julia-yin, @AllenWen-at-Azure |
Describe the bug
Using image: mcr.microsoft.com/containernetworking/azure-cns:v1.6.13
The Azure CNS pods that run on windows nodes work for a limited amount of time, and then lose connection to the K8S api and start issuing "Failed to list" and "Failed to watch" errors:
To Reproduce
Setup a cluster with Pod Subnet and place pods and nodes in different subnets. Make sure you add a windows nodepool.
After several minutes of working, the CNS pods in the windows nodes will start to fail. I presume that the impact of this is that networking is not updated when pods are rescheduled on the windows nodes.
Expected behavior
No API authentication errors.
Screenshots
Environment (please complete the following information):
Additional context
The text was updated successfully, but these errors were encountered: