Scaling AKS cluster puts nodes in NotReady state #274

Zimmergren · 2018-03-27T07:30:45Z

I've got a set of clusters running in AKS. Recently I've seen that triggering a scaling operation on the cluster (on this sample cluster from 1 to 2 nodes), causes the new node to end up in NotReady state.

kubectl get nodes

NAME                       STATUS     ROLES     AGE       VERSION
aks-nodepool1-13497206-1   Ready      agent     17d       v1.9.2
aks-nodepool1-13497206-2   NotReady   agent     36m       v1.9.2

Additional info:

Cluster node size: A4 (also happening on another one I have with D5 and a few more nodes)
Location: West Europe
Reproducible: Happens intermittently.

Questions:

Are there any timeouts for a failing scaling operation?
Are there ways to say "Kill any failing or NotReady nodes, and re-try scaling"
Are there any reliable self-healing mechanisms coupled with these operations?

40 minutes of semi-downtime due to a failed node scale that happens intermittently is going to be hard to define any SLA for ;)

The text was updated successfully, but these errors were encountered:

mooperd · 2018-03-27T09:14:24Z

@Zimmergren Please could you show the output of kubectl describe nodeaks-nodepool1-13497206-2 ?

Maybe its linked to this: kubernetes/kubernetes#55867

You can kill the "not ready" virtual machines using the gui console or CLI. A new node will be provisioned.

Zimmergren · 2018-03-27T09:19:05Z

Hey @mooperd,

Sure thing.

Command:

λ kubectl describe node aks-nodepool1-13497206-2

Output:

Name:               aks-nodepool1-13497206-2
Roles:              agent
Labels:             agentpool=nodepool1
                    beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/instance-type=Standard_A4
                    beta.kubernetes.io/os=linux
                    failure-domain.beta.kubernetes.io/region=westeurope
                    failure-domain.beta.kubernetes.io/zone=1
                    kubernetes.azure.com/cluster=MC_rg-weu-ac-aks2_cloudcluster2_westeurope
                    kubernetes.io/hostname=aks-nodepool1-13497206-2
                    kubernetes.io/role=agent
                    storageprofile=managed
                    storagetier=Standard_LRS
Annotations:        node.alpha.kubernetes.io/ttl=0
                    volumes.kubernetes.io/controller-managed-attach-detach=true
Taints:             <none>
CreationTimestamp:  Tue, 27 Mar 2018 09:08:44 +0200
Conditions:
  Type                 Status    LastHeartbeatTime                 LastTransitionTime                Reason                     Message
  ----                 ------    -----------------                 ------------------                ------                     -------
  NetworkUnavailable   False     Tue, 27 Mar 2018 09:08:57 +0200   Tue, 27 Mar 2018 09:08:57 +0200   RouteCreated               RouteController created a route
  OutOfDisk            False     Tue, 27 Mar 2018 09:08:54 +0200   Tue, 27 Mar 2018 09:08:44 +0200   KubeletHasSufficientDisk   kubelet has sufficient disk space available
  MemoryPressure       Unknown   Tue, 27 Mar 2018 09:08:54 +0200   Tue, 27 Mar 2018 09:09:36 +0200   NodeStatusUnknown          Kubelet stopped posting node status.
  DiskPressure         Unknown   Tue, 27 Mar 2018 09:08:54 +0200   Tue, 27 Mar 2018 09:09:36 +0200   NodeStatusUnknown          Kubelet stopped posting node status.
  Ready                Unknown   Tue, 27 Mar 2018 09:08:54 +0200   Tue, 27 Mar 2018 09:09:36 +0200   NodeStatusUnknown          Kubelet stopped posting node status.
Addresses:
  InternalIP:  10.240.0.5
  Hostname:    aks-nodepool1-13497206-2
Capacity:
 alpha.kubernetes.io/nvidia-gpu:  0
 cpu:                             8
 memory:                          14339648Ki
 pods:                            110
Allocatable:
 alpha.kubernetes.io/nvidia-gpu:  0
 cpu:                             8
 memory:                          14237248Ki
 pods:                            110
System Info:
 Machine ID:                 40beb5eb909e171860ceee669da56e1d
 System UUID:                3654B5A6-DEC4-4540-8E39-1CDD23AA265C
 Boot ID:                    ad8076b4-fb3e-4738-9f8b-9afa34cbc59d
 Kernel Version:             4.13.0-1007-azure
 OS Image:                   Debian GNU/Linux 9 (stretch)
 Operating System:           linux
 Architecture:               amd64
 Container Runtime Version:  docker://1.13.1
 Kubelet Version:            v1.9.2
 Kube-Proxy Version:         v1.9.2
PodCIDR:                     10.244.12.0/24
ExternalID:                  a6b55436-b4de-4045-8e39-1bdd23aa265b
Non-terminated Pods:         (3 in total)
  Namespace                  Name                       CPU Requests  CPU Limits  Memory Requests  Memory Limits
  ---------                  ----                       ------------  ----------  ---------------  -------------
  default                    omsagent-hwqbv             0 (0%)        0 (0%)      0 (0%)           0 (0%)
  kube-system                kube-proxy-kbp4d           100m (1%)     0 (0%)      0 (0%)           0 (0%)
  kube-system                kube-svc-redirect-v9tlw    0 (0%)        0 (0%)      0 (0%)           0 (0%)
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  CPU Requests  CPU Limits  Memory Requests  Memory Limits
  ------------  ----------  ---------------  -------------
  100m (1%)     0 (0%)      0 (0%)           0 (0%)
Events:         <none>

mooperd · 2018-03-27T09:37:01Z

2/3 nodes in my 1.8.7 AKS cluster are currently broken.

Events:
  Type     Reason         Age                 From                               Message
  ----     ------         ----                ----                               -------
  Warning  ImageGCFailed  3m (x1270 over 4d)  kubelet, aks-nodepool1-34207704-1  failed to get image stats: rpc error: code = Unavailable desc = grpc: the connection is unavailable

Events:
  Type     Reason                            Age                 From                               Message
  ----     ------                            ----                ----                               -------
  Warning  FailedNodeAllocatableEnforcement  5m (x2282 over 5d)  kubelet, aks-nodepool1-34207704-0  Failed to update Node Allocatable Limits "": failed to set supported cgroup subsystems for cgroup : Failed to set config for supported subsystems : failed to write 3585630208 to memory.limit_in_bytes: write /var/lib/docker/overlay2/b09109ad9b444cfd6ea70dba4a375c1548357ee2f1dd45191ea62931e1bf6aee/merged/sys/fs/cgroup/memory/memory.limit_in_bytes: invalid argument

Full output here: https://gist.github.com/mooperd/c0bece0ba011a57d60030612de09a3f1

mooperd · 2018-03-27T09:37:57Z

I'm using this command to prune dead nodes:

for i in $(kubectl get nodes | grep NotReady | awk '{print $1}'); do az vm delete --name $i --resource-group <resource group> -y & done

mooperd · 2018-03-27T09:39:12Z

Similar issue: #256

Zimmergren · 2018-03-27T10:50:20Z

Thanks @mooperd for your comments.

I've also done the whole "delete nodes by myself" routine in the past, but I would really hope to avoid that and have it self-heal and deal with any such dead nodes.

Right now I'm automating that using some API calls to the Azure Resource Manager API's and I'm deleting the VM if it's dead, then triggering a new scale-out.

Let's see what comes out of MS on this topic; I'm seeing lots of people in my community experiencing the same thing.

rfum · 2018-04-07T14:35:00Z

same issue here i've looked for node logs also nothing seems suspicious only ssh connection is slower than before.

kubelet seems alive
Disk usage seems fine
Memory usage also seems fine

rfum · 2018-04-07T14:44:06Z

#102 also same issue here you can find my logs from comment

gkli · 2018-05-08T17:25:21Z

+1, same issue, can repro 100% of the time... create simple deployment, set replica count to a small number (5?) once started and working, change replica count to a significantly larger number (100+) -- watch the system fail... Kubernetes version: 1.9.6

celamb4 · 2018-05-11T15:41:04Z

Reproduced #102 with 1.9.6, then scaled up and run into this one.

VincentSurelle · 2018-06-18T11:48:38Z

Same issue here. 100% reproducibility

oportingale · 2018-07-24T07:55:30Z

Hi, I am also experiencing this on a freshly built cluster Kubernetes version 1.10.5.
Started with 3 nodes and configured autoscale with a maximum of 10 nodes. I increased replicas of my deployment and nodes scaled up to 5. I then changed replicas down but the autoscale down didnt work.
I ended up manually scaling back down so the VM's have been deleted in the resource group however kubectl now lists nodes that dont even exist!
aks-agentpool-10477987-0 Ready agent 21h v1.10.5
aks-agentpool-10477987-1 NotReady agent 19h v1.10.5
aks-agentpool-10477987-2 Ready agent 19h v1.10.5
aks-agentpool-10477987-3 Ready agent 18h v1.10.5
aks-agentpool-10477987-4 NotReady agent 17h v1.10.5

Is there any update on this issue?

Thanks

Olly

dragonsappire · 2018-10-15T01:59:50Z

I reproduced it serveral times when trying to scale up to more pods with nodes in ready state. I double checked the node, they were all ready and after deleting the wrong pod deployment(in kubectl portal), the deployment will continue to run success.

celamb4 · 2018-11-13T20:43:57Z

Cant believe this is still open. This is a critical issue.

agolomoodysaada · 2018-11-13T20:53:05Z

It's unacceptable for a GA product

encircled · 2018-11-21T17:56:33Z

Same here, k8s version 1.11.4

mschuurmans · 2019-01-07T14:16:21Z

Any updates on this? having the same issue on k8s verison 1.11.5

sonttran · 2019-03-05T22:57:12Z

Any updates on this? Having the same issue on k8s v1.11.6-gke.2.

jnoller · 2019-03-06T00:01:34Z

@sonttran Please open an azure technical support ticket - this issue could have one of many causes, and we can not ask for the needed subscription details / information on github. Nodes going into notready can be triggered by so many different things that we will need to investigate the specific cluster(s) you have.

agolomoodysaada · 2019-03-06T13:12:54Z

@jnoller but... it's an issue on all clusters!

DimaMTR · 2019-03-17T15:49:16Z

Any updates on the issue? Having same error during autoscale of node pool in GKE on version 1.12.5-gke.10

Zimmergren · 2019-03-18T07:13:53Z

Are there any further plans to investigate this on a wider scale than per-cluster, @jnoller? It looks like we can reproduce it on every cluster from dev to prod, and by the looks of comments in this thread, a lot of folks have these issues.

I also can give a huge kudos and 👍 to Azure support, as they've been able to fix this every time for me. However, usually, it takes quite some time before we get to the point of having it resolved (time-zones, support response time SLA's, same e-mail conversations in every ticket before reaching the conclusion that something has to be restarted on the Azure backend side). If there's anything I can do, submit more details, troubleshooting cases and logs; just let me know 💪

jnoller · 2019-03-18T15:08:05Z

@Zimmergren Currently, this issue is a catch-all for similar bad behavior - we do not have widespread reports of this, and based on the support data (which I audit and read) this issue is still cluster specific. Custom VNETs, port blocking on the NSGs, and more can contribute to it (and, as I discovered last week, custom admission controllers).

Please file a support request, link to this issue in the ticket and ask for it to be escalated as needed to on-call engineering so that we can dig into it and identify the actual root cause versus the symptom.

Zimmergren · 2019-03-19T08:12:23Z

Thanks @jnoller , I appreciate your quick responses here 💪
I'll go ahead and submit tickets the next time it happens, and we'll explore it from there - thanks!

Flodu31 · 2020-07-24T12:49:28Z

Hello,

I have the same issue, on AKS 1.16.10...

Any update on this?
Thanks.
Florent

rwerlang · 2020-07-25T19:49:54Z

Hello, what is the status on this issue?
Just had the same problem here on AKS 1.16.10.

There are no logs to support what happened.
This is critical, since it is happening in production environment.

palma21 · 2020-07-27T07:52:07Z

Hi folks, apologies this issue is seemingly without any response. The problem is that all the reports here are not consistent with one problem but likely with small isolated problems and the conflation of issues are making them hard to solve/pinpoint.

I've seen a few folks that were actually describing the by-design experience of upgrade: https://docs.microsoft.com/en-us/azure/aks/upgrade-cluster#upgrade-an-aks-cluster
So just to frame the discussion outlining the process:

With a list of available versions for your AKS cluster, use the az aks upgrade command to upgrade. During the upgrade process, AKS adds a new node to the cluster that runs the specified Kubernetes version, then carefully cordon and drains one of the old nodes to minimize disruption to running applications. When the new node is confirmed as running application pods, the old node is deleted. This process repeats until all nodes in the cluster have been upgraded.

This means that during upgrade you will see a lot of nodes flipping between ready and not ready and that is normal. Since we'll always have n+1 in ready state for your applications.

Nonetheless I've see a few folks describing behaviors that are not expected. If you believe what you see is not expected, please open a support ticket. Feel free to paste it here once open and I'm happy to track it to conclusion and pinpoint your issue.

ghost · 2020-07-27T07:52:34Z

Hi there 👋 AKS bot here. This issue has been tagged as needing a support request so that the AKS support and engineering teams have a look into this particular cluster/issue.

Follow the steps here to create a support ticket for Azure Kubernetes Service and the cluster discussed in this issue.

Please do mention this issue in the case description so our teams can coordinate to help you.

Thank you!

ghost · 2020-08-08T06:03:12Z

Action required

ghost · 2020-09-25T18:02:49Z

Case being worked with Microsoft Support, adding stale label for automatic closure if not other reports are added.

andig · 2020-09-25T18:56:34Z

For the record: in my case we were working on this with support until I finally gave up using AKS. If a root cause is ever detected it would be great for MSFT/Azure to let us know here.

palma21 · 2020-09-25T19:03:19Z

Happy to take a look at your support case (you have a number?) and a cluster where you see this today. Unfortunately this ticket is quite old (opened with 1.9) and mixes upstream issues with possible AKS older issues.
Nodes not ready can happen for a myriad of reasons a lot of them actually expected, so it's better to look at a case by case and I'm happy to assist with that.

andig · 2020-09-25T20:00:39Z

Thanks @palma21 for stepping in. My ticket is older (2y) so not worth chasing today. Unresolved issue was the running nodes became unresponsive during cluster scale-up effectively killing the running workloads. Probably not relevant any longer. Unfortunately I had to give up efforts as the cluster was effectively useless and ticket remained unresolved after weeks of investigation.

ghost · 2020-10-03T12:02:22Z

Case being worked with Microsoft Support, adding stale label for automatic closure if not other reports are added.

palma21 · 2020-10-26T17:06:39Z

Closing as stale.

Zimmergren mentioned this issue Mar 27, 2018

Getting "Kubenet does not have netConfig. This is most likely due to lack of PodCIDR" as of today #273

Closed

Zimmergren mentioned this issue Apr 23, 2018

Nodes endup in state NotReady after scale-out and scale-in #320

Closed

koalalorenzo mentioned this issue May 28, 2018

Pods marked as Not Ready and with Insufficient Storage after scaling #386

Closed

Azure deleted a comment from mooperd Mar 18, 2019

jnoller added triage scale-notready labels Apr 3, 2019

palma21 added the SR-Support Request Support Request has been required/made label Jul 27, 2020

palma21 added Needs Information and removed scale-notready labels Jul 27, 2020

ghost added the Needs Attention 👋 Issues needs attention/assignee/owner label Aug 8, 2020

palma21 removed Needs Attention 👋 Issues needs attention/assignee/owner action-required labels Aug 8, 2020

ghost added the stale Stale issue label Sep 25, 2020

ghost removed the stale Stale issue label Sep 25, 2020

palma21 added the stale Stale issue label Sep 25, 2020

ghost removed the stale Stale issue label Sep 25, 2020

palma21 added the stale Stale issue label Sep 25, 2020

ghost removed the stale Stale issue label Sep 26, 2020

ghost added the stale Stale issue label Oct 3, 2020

Azure deleted a comment Oct 26, 2020

ghost removed the stale Stale issue label Oct 26, 2020

Azure deleted a comment Oct 26, 2020

palma21 closed this as completed Oct 26, 2020

ghost locked as resolved and limited conversation to collaborators Nov 25, 2020

Scaling AKS cluster puts nodes in NotReady state #274

Scaling AKS cluster puts nodes in NotReady state #274

Comments

Zimmergren commented Mar 27, 2018

mooperd commented Mar 27, 2018 • edited Loading

Zimmergren commented Mar 27, 2018

mooperd commented Mar 27, 2018

mooperd commented Mar 27, 2018

mooperd commented Mar 27, 2018 • edited Loading

Zimmergren commented Mar 27, 2018

rfum commented Apr 7, 2018

rfum commented Apr 7, 2018

gkli commented May 8, 2018 • edited Loading

celamb4 commented May 11, 2018

VincentSurelle commented Jun 18, 2018

oportingale commented Jul 24, 2018

dragonsappire commented Oct 15, 2018

celamb4 commented Nov 13, 2018

agolomoodysaada commented Nov 13, 2018

encircled commented Nov 21, 2018

mschuurmans commented Jan 7, 2019

sonttran commented Mar 5, 2019

jnoller commented Mar 6, 2019

agolomoodysaada commented Mar 6, 2019

DimaMTR commented Mar 17, 2019

Zimmergren commented Mar 18, 2019

jnoller commented Mar 18, 2019

Zimmergren commented Mar 19, 2019

Flodu31 commented Jul 24, 2020

rwerlang commented Jul 25, 2020

palma21 commented Jul 27, 2020

ghost commented Jul 27, 2020

ghost commented Aug 8, 2020

ghost commented Sep 25, 2020

andig commented Sep 25, 2020

palma21 commented Sep 25, 2020

andig commented Sep 25, 2020 • edited Loading

ghost commented Oct 3, 2020

palma21 commented Oct 26, 2020

mooperd commented Mar 27, 2018 •

edited

Loading

mooperd commented Mar 27, 2018 •

edited

Loading

gkli commented May 8, 2018 •

edited

Loading

andig commented Sep 25, 2020 •

edited

Loading