Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Azure CNS K8S api errors on windows nodes: failed to watch and failed to list #4679

Open
david-garcia-garcia opened this issue Dec 3, 2024 · 8 comments
Labels
action-required bug Needs Attention 👋 Issues needs attention/assignee/owner

Comments

@david-garcia-garcia
Copy link

Describe the bug

Using image: mcr.microsoft.com/containernetworking/azure-cns:v1.6.13

The Azure CNS pods that run on windows nodes work for a limited amount of time, and then lose connection to the K8S api and start issuing "Failed to list" and "Failed to watch" errors:

│ {"level":"info","ts":"2024-12-03T06:51:32.658Z","caller":"v2/monitor.go:127","msg":"NNC already at target IPs, no scaling required","component":"ipam-pool-monitor"}                                          │
│ W1203 06:51:52.491443   20992 reflector.go:547] pkg/mod/k8s.io/client-go@v0.30.5/tools/cache/reflector.go:232: failed to list *v1.Pod: Unauthorized                                                           │
│ E1203 06:51:52.491443   20992 reflector.go:150] pkg/mod/k8s.io/client-go@v0.30.5/tools/cache/reflector.go:232: Failed to watch *v1.Pod: failed to list *v1.Pod: Unauthorized                                  │
│ W1203 06:52:07.369414   20992 reflector.go:547] pkg/mod/k8s.io/client-go@v0.30.5/tools/cache/reflector.go:232: failed to list *v1alpha.NodeNetworkConfig: Unauthorized                                        │
│ E1203 06:52:07.369414   20992 reflector.go:150] pkg/mod/k8s.io/client-go@v0.30.5/tools/cache/reflector.go:232: Failed to watch *v1alpha.NodeNetworkConfig: failed to list *v1alpha.NodeNetworkConfig: Unautho │
│ W1203 06:52:28.044035   20992 reflector.go:547] pkg/mod/k8s.io/client-go@v0.30.5/tools/cache/reflector.go:232: failed to list *v1.Pod: Unauthorized                                                           │
│ E1203 06:52:28.044035   20992 reflector.go:150] pkg/mod/k8s.io/client-go@v0.30.5/tools/cache/reflector.go:232: Failed to watch *v1.Pod: failed to list *v1.Pod: Unauthorized                                  │
│ {"level":"info","ts":"2024-12-03T06:52:32.659Z","caller":"v2/monitor.go:124","msg":"calculated new request","component":"ipam-pool-monitor","demand":8,"batch":16,"max":50,"buffer":0.5,"target":16}          │
│ {"level":"info","ts":"2024-12-03T06:52:32.659Z","caller":"v2/monitor.go:127","msg":"NNC already at target IPs, no scaling required","component":"ipam-pool-monitor"} 

To Reproduce
Setup a cluster with Pod Subnet and place pods and nodes in different subnets. Make sure you add a windows nodepool.

After several minutes of working, the CNS pods in the windows nodes will start to fail. I presume that the impact of this is that networking is not updated when pods are rescheduled on the windows nodes.

Expected behavior
No API authentication errors.

Screenshots

Environment (please complete the following information):

  • Kubernetes version 1.30.5

Additional context

@david-garcia-garcia
Copy link
Author

Running this script:

if (-not("dummy" -as [type])) {
    add-type -TypeDefinition @"
using System;
using System.Net;
using System.Net.Security;
using System.Security.Cryptography.X509Certificates;

public static class Dummy {
    public static bool ReturnTrue(object sender,
        X509Certificate certificate,
        X509Chain chain,
        SslPolicyErrors sslPolicyErrors) { return true; }

    public static RemoteCertificateValidationCallback GetDelegate() {
        return new RemoteCertificateValidationCallback(Dummy.ReturnTrue);
    }
}
"@
}

[System.Net.ServicePointManager]::ServerCertificateValidationCallback = [dummy]::GetDelegate()


$Token = Get-Content -Path "C:\var\run\secrets\kubernetes.io\serviceaccount\token"

Invoke-RestMethod -Uri "https://$env:KUBERNETES_SERVICE_HOST/apis/acn.azure.com/v1alpha/namespaces/kube-system/nodenetworkconfigs" -Headers @{Authorization = "Bearer $Token"} -Method Get

I am getting:

apiVersion            items
----------            -----
acn.azure.com/v1alpha {@{apiVersion=acn.azure.com/v1alpha; kind=NodeNetworkConfig; metadata=; spec...

I see the service token is configured to rotate every 1 hour aprox.:

  - name: kube-api-access-lj2hx
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          expirationSeconds: 3607
          path: token
      - configMap:
          items:
          - key: ca.crt
            path: ca.crt
          name: kube-root-ca.crt

My impression is the service token is not being renewed on the pods residing on the windows node, so the issue is probably not in the CNI itself.

@david-garcia-garcia
Copy link
Author

david-garcia-garcia commented Dec 3, 2024

I can confirm that when the pod starts failing to auth, the token inside it has properly been renewed. First thought is that for whatever reason the CNS pod has a quirck in its windows implementation (or an upstream library), and it does not update the token, using the original one injected into the pod which is now stale.

@david-garcia-garcia
Copy link
Author

On the Windows pod also getting this after recreating the pod:

{"level":"info","ts":"2024-12-04T08:40:34.959Z","caller":"pod/reconciler.go:68","msg":"rate limit exceeded","component":"pod-watcher"}                                                                        │
│ {"level":"info","ts":"2024-12-04T08:40:34.959Z","caller":"pod/reconciler.go:68","msg":"rate limit exceeded","component":"pod-watcher"}                                                                        │
│ {"level":"info","ts":"2024-12-04T08:40:34.959Z","caller":"pod/reconciler.go:68","msg":"rate limit exceeded","component":"pod-watcher"}                                                                        │
│ {"level":"info","ts":"2024-12-04T08:40:34.959Z","caller":"pod/reconciler.go:68","msg":"rate limit exceeded","component":"pod-watcher"}                                                                        │
│ {"level":"info","ts":"2024-12-04T08:40:37.519Z","caller":"pod/reconciler.go:68","msg":"rate limit exceeded","component":"pod-watcher"}

@ganastasiou14
Copy link

Facing the same issue
cc @rbtr

@chasewilson
Copy link
Contributor

Thanks for highlighting this @ganastasiou14 @david-garcia-garcia. We're look at this currently and will come back shortly with what we come up with.

@rbtr
Copy link

rbtr commented Dec 13, 2024

The fix for this is rolling out now.

The issue is that CNS on Windows compiles a custom, static kubeconfig at Pod startup (setkubeconfigpath.ps1). This was necessary to get a valid kubeconfig because client-go had hardcoded paths incompatible with HostProcess Container on Windows prior to ContainerD 1.7 (used in AKS <=1.27).

The script runs at startup and never re-runs, so the token that exists at Pod start is the token CNS will try to use forever.

Kubernetes 1.30 has a change where service account tokens refresh every 1 hour when OIDC is enabled. Previously, and when OIDC is not enabled, they are valid for 1 year. No CNS Pod is expected to live for that long due to Node updates, CNS patches, etc, so treating the token as immortal never presented an issue until 1.30+OIDC.

This is resolved by bypassing the startup script and using the Pod InClusterConfig instead of a static kubeconfig when CNS is running on AKS >1.27.

@microsoft-github-policy-service microsoft-github-policy-service bot added action-required Needs Attention 👋 Issues needs attention/assignee/owner labels Jan 8, 2025
@rbtr
Copy link

rbtr commented Jan 13, 2025

The fix for this has been released

Copy link
Contributor

Action required from @aritraghosh, @julia-yin, @AllenWen-at-Azure

@microsoft-github-policy-service microsoft-github-policy-service bot added the Needs Attention 👋 Issues needs attention/assignee/owner label Feb 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
action-required bug Needs Attention 👋 Issues needs attention/assignee/owner
Projects
None yet
Development

No branches or pull requests

4 participants