[BUG] Azure CNS K8S api errors on windows nodes: failed to watch and failed to list #4679

david-garcia-garcia · 2024-12-03T06:58:07Z

Describe the bug

Using image: mcr.microsoft.com/containernetworking/azure-cns:v1.6.13

The Azure CNS pods that run on windows nodes work for a limited amount of time, and then lose connection to the K8S api and start issuing "Failed to list" and "Failed to watch" errors:

│ {"level":"info","ts":"2024-12-03T06:51:32.658Z","caller":"v2/monitor.go:127","msg":"NNC already at target IPs, no scaling required","component":"ipam-pool-monitor"}                                          │
│ W1203 06:51:52.491443   20992 reflector.go:547] pkg/mod/k8s.io/client-go@v0.30.5/tools/cache/reflector.go:232: failed to list *v1.Pod: Unauthorized                                                           │
│ E1203 06:51:52.491443   20992 reflector.go:150] pkg/mod/k8s.io/client-go@v0.30.5/tools/cache/reflector.go:232: Failed to watch *v1.Pod: failed to list *v1.Pod: Unauthorized                                  │
│ W1203 06:52:07.369414   20992 reflector.go:547] pkg/mod/k8s.io/client-go@v0.30.5/tools/cache/reflector.go:232: failed to list *v1alpha.NodeNetworkConfig: Unauthorized                                        │
│ E1203 06:52:07.369414   20992 reflector.go:150] pkg/mod/k8s.io/client-go@v0.30.5/tools/cache/reflector.go:232: Failed to watch *v1alpha.NodeNetworkConfig: failed to list *v1alpha.NodeNetworkConfig: Unautho │
│ W1203 06:52:28.044035   20992 reflector.go:547] pkg/mod/k8s.io/client-go@v0.30.5/tools/cache/reflector.go:232: failed to list *v1.Pod: Unauthorized                                                           │
│ E1203 06:52:28.044035   20992 reflector.go:150] pkg/mod/k8s.io/client-go@v0.30.5/tools/cache/reflector.go:232: Failed to watch *v1.Pod: failed to list *v1.Pod: Unauthorized                                  │
│ {"level":"info","ts":"2024-12-03T06:52:32.659Z","caller":"v2/monitor.go:124","msg":"calculated new request","component":"ipam-pool-monitor","demand":8,"batch":16,"max":50,"buffer":0.5,"target":16}          │
│ {"level":"info","ts":"2024-12-03T06:52:32.659Z","caller":"v2/monitor.go:127","msg":"NNC already at target IPs, no scaling required","component":"ipam-pool-monitor"}

To Reproduce
Setup a cluster with Pod Subnet and place pods and nodes in different subnets. Make sure you add a windows nodepool.

After several minutes of working, the CNS pods in the windows nodes will start to fail. I presume that the impact of this is that networking is not updated when pods are rescheduled on the windows nodes.

Expected behavior
No API authentication errors.

Screenshots

Environment (please complete the following information):

Kubernetes version 1.30.5

Additional context

The text was updated successfully, but these errors were encountered:

david-garcia-garcia · 2024-12-03T08:29:29Z

Running this script:

if (-not("dummy" -as [type])) {
    add-type -TypeDefinition @"
using System;
using System.Net;
using System.Net.Security;
using System.Security.Cryptography.X509Certificates;

public static class Dummy {
    public static bool ReturnTrue(object sender,
        X509Certificate certificate,
        X509Chain chain,
        SslPolicyErrors sslPolicyErrors) { return true; }

    public static RemoteCertificateValidationCallback GetDelegate() {
        return new RemoteCertificateValidationCallback(Dummy.ReturnTrue);
    }
}
"@
}

[System.Net.ServicePointManager]::ServerCertificateValidationCallback = [dummy]::GetDelegate()


$Token = Get-Content -Path "C:\var\run\secrets\kubernetes.io\serviceaccount\token"

Invoke-RestMethod -Uri "https://$env:KUBERNETES_SERVICE_HOST/apis/acn.azure.com/v1alpha/namespaces/kube-system/nodenetworkconfigs" -Headers @{Authorization = "Bearer $Token"} -Method Get

I am getting:

apiVersion            items
----------            -----
acn.azure.com/v1alpha {@{apiVersion=acn.azure.com/v1alpha; kind=NodeNetworkConfig; metadata=; spec...

I see the service token is configured to rotate every 1 hour aprox.:

  - name: kube-api-access-lj2hx
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          expirationSeconds: 3607
          path: token
      - configMap:
          items:
          - key: ca.crt
            path: ca.crt
          name: kube-root-ca.crt

My impression is the service token is not being renewed on the pods residing on the windows node, so the issue is probably not in the CNI itself.

david-garcia-garcia · 2024-12-03T10:26:53Z

I can confirm that when the pod starts failing to auth, the token inside it has properly been renewed. First thought is that for whatever reason the CNS pod has a quirck in its windows implementation (or an upstream library), and it does not update the token, using the original one injected into the pod which is now stale.

david-garcia-garcia · 2024-12-04T08:44:23Z

On the Windows pod also getting this after recreating the pod:

{"level":"info","ts":"2024-12-04T08:40:34.959Z","caller":"pod/reconciler.go:68","msg":"rate limit exceeded","component":"pod-watcher"}                                                                        │
│ {"level":"info","ts":"2024-12-04T08:40:34.959Z","caller":"pod/reconciler.go:68","msg":"rate limit exceeded","component":"pod-watcher"}                                                                        │
│ {"level":"info","ts":"2024-12-04T08:40:34.959Z","caller":"pod/reconciler.go:68","msg":"rate limit exceeded","component":"pod-watcher"}                                                                        │
│ {"level":"info","ts":"2024-12-04T08:40:34.959Z","caller":"pod/reconciler.go:68","msg":"rate limit exceeded","component":"pod-watcher"}                                                                        │
│ {"level":"info","ts":"2024-12-04T08:40:37.519Z","caller":"pod/reconciler.go:68","msg":"rate limit exceeded","component":"pod-watcher"}

ganastasiou14 · 2024-12-09T16:04:22Z

Facing the same issue
cc @rbtr

chasewilson · 2024-12-09T23:00:26Z

Thanks for highlighting this @ganastasiou14 @david-garcia-garcia. We're look at this currently and will come back shortly with what we come up with.

rbtr · 2024-12-13T19:40:27Z

The fix for this is rolling out now.

The issue is that CNS on Windows compiles a custom, static kubeconfig at Pod startup (setkubeconfigpath.ps1). This was necessary to get a valid kubeconfig because client-go had hardcoded paths incompatible with HostProcess Container on Windows prior to ContainerD 1.7 (used in AKS <=1.27).

The script runs at startup and never re-runs, so the token that exists at Pod start is the token CNS will try to use forever.

Kubernetes 1.30 has a change where service account tokens refresh every 1 hour when OIDC is enabled. Previously, and when OIDC is not enabled, they are valid for 1 year. No CNS Pod is expected to live for that long due to Node updates, CNS patches, etc, so treating the token as immortal never presented an issue until 1.30+OIDC.

This is resolved by bypassing the startup script and using the Pod InClusterConfig instead of a static kubeconfig when CNS is running on AKS >1.27.

rbtr · 2025-01-13T16:20:14Z

The fix for this has been released

microsoft-github-policy-service · 2025-02-13T13:23:01Z

Action required from @aritraghosh, @julia-yin, @AllenWen-at-Azure

david-garcia-garcia added the bug label Dec 3, 2024

david-garcia-garcia mentioned this issue Dec 3, 2024

CNI Pod Subnet ignores Pod Subnet UDR Azure/azure-container-networking#3204

Closed

rbtr mentioned this issue Dec 6, 2024

fix: let Windows CNS use the InClusterConfig Azure/azure-container-networking#3248

Open

4 tasks

tyler-lloyd mentioned this issue Dec 13, 2024

feat: CNS checks apiserver in healthz Azure/azure-container-networking#3269

Open

4 tasks

microsoft-github-policy-service bot added action-required Needs Attention 👋 Issues needs attention/assignee/owner labels Jan 8, 2025

microsoft-github-policy-service bot removed action-required Needs Attention 👋 Issues needs attention/assignee/owner labels Jan 13, 2025

microsoft-github-policy-service bot added the action-required label Feb 7, 2025

microsoft-github-policy-service bot added the Needs Attention 👋 Issues needs attention/assignee/owner label Feb 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Azure CNS K8S api errors on windows nodes: failed to watch and failed to list #4679

[BUG] Azure CNS K8S api errors on windows nodes: failed to watch and failed to list #4679

david-garcia-garcia commented Dec 3, 2024

david-garcia-garcia commented Dec 3, 2024

david-garcia-garcia commented Dec 3, 2024 •

edited

Loading

david-garcia-garcia commented Dec 4, 2024

ganastasiou14 commented Dec 9, 2024

chasewilson commented Dec 9, 2024

rbtr commented Dec 13, 2024

rbtr commented Jan 13, 2025

microsoft-github-policy-service bot commented Feb 13, 2025

[BUG] Azure CNS K8S api errors on windows nodes: failed to watch and failed to list #4679

[BUG] Azure CNS K8S api errors on windows nodes: failed to watch and failed to list #4679

Comments

david-garcia-garcia commented Dec 3, 2024

david-garcia-garcia commented Dec 3, 2024

david-garcia-garcia commented Dec 3, 2024 • edited Loading

david-garcia-garcia commented Dec 4, 2024

ganastasiou14 commented Dec 9, 2024

chasewilson commented Dec 9, 2024

rbtr commented Dec 13, 2024

rbtr commented Jan 13, 2025

microsoft-github-policy-service bot commented Feb 13, 2025

david-garcia-garcia commented Dec 3, 2024 •

edited

Loading