Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scaling AKS cluster puts nodes in NotReady state #274

Closed
Zimmergren opened this issue Mar 27, 2018 · 45 comments
Closed

Scaling AKS cluster puts nodes in NotReady state #274

Zimmergren opened this issue Mar 27, 2018 · 45 comments
Labels
Needs Information SR-Support Request Support Request has been required/made

Comments

@Zimmergren
Copy link

I've got a set of clusters running in AKS. Recently I've seen that triggering a scaling operation on the cluster (on this sample cluster from 1 to 2 nodes), causes the new node to end up in NotReady state.

kubectl get nodes

NAME                       STATUS     ROLES     AGE       VERSION
aks-nodepool1-13497206-1   Ready      agent     17d       v1.9.2
aks-nodepool1-13497206-2   NotReady   agent     36m       v1.9.2

Additional info:

  • Cluster node size: A4 (also happening on another one I have with D5 and a few more nodes)
  • Location: West Europe
  • Reproducible: Happens intermittently.

Questions:

  • Are there any timeouts for a failing scaling operation?
  • Are there ways to say "Kill any failing or NotReady nodes, and re-try scaling"
  • Are there any reliable self-healing mechanisms coupled with these operations?

40 minutes of semi-downtime due to a failed node scale that happens intermittently is going to be hard to define any SLA for ;)

@mooperd
Copy link

mooperd commented Mar 27, 2018

@Zimmergren Please could you show the output of kubectl describe nodeaks-nodepool1-13497206-2 ?

Maybe its linked to this: kubernetes/kubernetes#55867

You can kill the "not ready" virtual machines using the gui console or CLI. A new node will be provisioned.

@Zimmergren
Copy link
Author

Hey @mooperd,

Sure thing.

Command:

λ kubectl describe node aks-nodepool1-13497206-2

Output:

Name:               aks-nodepool1-13497206-2
Roles:              agent
Labels:             agentpool=nodepool1
                    beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/instance-type=Standard_A4
                    beta.kubernetes.io/os=linux
                    failure-domain.beta.kubernetes.io/region=westeurope
                    failure-domain.beta.kubernetes.io/zone=1
                    kubernetes.azure.com/cluster=MC_rg-weu-ac-aks2_cloudcluster2_westeurope
                    kubernetes.io/hostname=aks-nodepool1-13497206-2
                    kubernetes.io/role=agent
                    storageprofile=managed
                    storagetier=Standard_LRS
Annotations:        node.alpha.kubernetes.io/ttl=0
                    volumes.kubernetes.io/controller-managed-attach-detach=true
Taints:             <none>
CreationTimestamp:  Tue, 27 Mar 2018 09:08:44 +0200
Conditions:
  Type                 Status    LastHeartbeatTime                 LastTransitionTime                Reason                     Message
  ----                 ------    -----------------                 ------------------                ------                     -------
  NetworkUnavailable   False     Tue, 27 Mar 2018 09:08:57 +0200   Tue, 27 Mar 2018 09:08:57 +0200   RouteCreated               RouteController created a route
  OutOfDisk            False     Tue, 27 Mar 2018 09:08:54 +0200   Tue, 27 Mar 2018 09:08:44 +0200   KubeletHasSufficientDisk   kubelet has sufficient disk space available
  MemoryPressure       Unknown   Tue, 27 Mar 2018 09:08:54 +0200   Tue, 27 Mar 2018 09:09:36 +0200   NodeStatusUnknown          Kubelet stopped posting node status.
  DiskPressure         Unknown   Tue, 27 Mar 2018 09:08:54 +0200   Tue, 27 Mar 2018 09:09:36 +0200   NodeStatusUnknown          Kubelet stopped posting node status.
  Ready                Unknown   Tue, 27 Mar 2018 09:08:54 +0200   Tue, 27 Mar 2018 09:09:36 +0200   NodeStatusUnknown          Kubelet stopped posting node status.
Addresses:
  InternalIP:  10.240.0.5
  Hostname:    aks-nodepool1-13497206-2
Capacity:
 alpha.kubernetes.io/nvidia-gpu:  0
 cpu:                             8
 memory:                          14339648Ki
 pods:                            110
Allocatable:
 alpha.kubernetes.io/nvidia-gpu:  0
 cpu:                             8
 memory:                          14237248Ki
 pods:                            110
System Info:
 Machine ID:                 40beb5eb909e171860ceee669da56e1d
 System UUID:                3654B5A6-DEC4-4540-8E39-1CDD23AA265C
 Boot ID:                    ad8076b4-fb3e-4738-9f8b-9afa34cbc59d
 Kernel Version:             4.13.0-1007-azure
 OS Image:                   Debian GNU/Linux 9 (stretch)
 Operating System:           linux
 Architecture:               amd64
 Container Runtime Version:  docker://1.13.1
 Kubelet Version:            v1.9.2
 Kube-Proxy Version:         v1.9.2
PodCIDR:                     10.244.12.0/24
ExternalID:                  a6b55436-b4de-4045-8e39-1bdd23aa265b
Non-terminated Pods:         (3 in total)
  Namespace                  Name                       CPU Requests  CPU Limits  Memory Requests  Memory Limits
  ---------                  ----                       ------------  ----------  ---------------  -------------
  default                    omsagent-hwqbv             0 (0%)        0 (0%)      0 (0%)           0 (0%)
  kube-system                kube-proxy-kbp4d           100m (1%)     0 (0%)      0 (0%)           0 (0%)
  kube-system                kube-svc-redirect-v9tlw    0 (0%)        0 (0%)      0 (0%)           0 (0%)
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  CPU Requests  CPU Limits  Memory Requests  Memory Limits
  ------------  ----------  ---------------  -------------
  100m (1%)     0 (0%)      0 (0%)           0 (0%)
Events:         <none>

@mooperd
Copy link

mooperd commented Mar 27, 2018

2/3 nodes in my 1.8.7 AKS cluster are currently broken.

Events:
  Type     Reason         Age                 From                               Message
  ----     ------         ----                ----                               -------
  Warning  ImageGCFailed  3m (x1270 over 4d)  kubelet, aks-nodepool1-34207704-1  failed to get image stats: rpc error: code = Unavailable desc = grpc: the connection is unavailable
Events:
  Type     Reason                            Age                 From                               Message
  ----     ------                            ----                ----                               -------
  Warning  FailedNodeAllocatableEnforcement  5m (x2282 over 5d)  kubelet, aks-nodepool1-34207704-0  Failed to update Node Allocatable Limits "": failed to set supported cgroup subsystems for cgroup : Failed to set config for supported subsystems : failed to write 3585630208 to memory.limit_in_bytes: write /var/lib/docker/overlay2/b09109ad9b444cfd6ea70dba4a375c1548357ee2f1dd45191ea62931e1bf6aee/merged/sys/fs/cgroup/memory/memory.limit_in_bytes: invalid argument

Full output here: https://gist.github.com/mooperd/c0bece0ba011a57d60030612de09a3f1

@mooperd
Copy link

mooperd commented Mar 27, 2018

I'm using this command to prune dead nodes:

for i in $(kubectl get nodes | grep NotReady | awk '{print $1}'); do az vm delete --name $i --resource-group <resource group> -y & done

@mooperd
Copy link

mooperd commented Mar 27, 2018

Similar issue: #256

@Zimmergren
Copy link
Author

Thanks @mooperd for your comments.

I've also done the whole "delete nodes by myself" routine in the past, but I would really hope to avoid that and have it self-heal and deal with any such dead nodes.

Right now I'm automating that using some API calls to the Azure Resource Manager API's and I'm deleting the VM if it's dead, then triggering a new scale-out.

Let's see what comes out of MS on this topic; I'm seeing lots of people in my community experiencing the same thing.

@rfum
Copy link

rfum commented Apr 7, 2018

same issue here i've looked for node logs also nothing seems suspicious only ssh connection is slower than before.

  • kubelet seems alive
  • Disk usage seems fine
  • Memory usage also seems fine

@rfum
Copy link

rfum commented Apr 7, 2018

#102 also same issue here you can find my logs from comment

@gkli
Copy link

gkli commented May 8, 2018

+1, same issue, can repro 100% of the time... create simple deployment, set replica count to a small number (5?) once started and working, change replica count to a significantly larger number (100+) -- watch the system fail... Kubernetes version: 1.9.6

@celamb4
Copy link

celamb4 commented May 11, 2018

Reproduced #102 with 1.9.6, then scaled up and run into this one.

@VincentSurelle
Copy link

Same issue here. 100% reproducibility

@oportingale
Copy link

Hi, I am also experiencing this on a freshly built cluster Kubernetes version 1.10.5.
Started with 3 nodes and configured autoscale with a maximum of 10 nodes. I increased replicas of my deployment and nodes scaled up to 5. I then changed replicas down but the autoscale down didnt work.
I ended up manually scaling back down so the VM's have been deleted in the resource group however kubectl now lists nodes that dont even exist!
aks-agentpool-10477987-0 Ready agent 21h v1.10.5
aks-agentpool-10477987-1 NotReady agent 19h v1.10.5
aks-agentpool-10477987-2 Ready agent 19h v1.10.5
aks-agentpool-10477987-3 Ready agent 18h v1.10.5
aks-agentpool-10477987-4 NotReady agent 17h v1.10.5

Is there any update on this issue?

Thanks

Olly

@dragonsappire
Copy link

I reproduced it serveral times when trying to scale up to more pods with nodes in ready state. I double checked the node, they were all ready and after deleting the wrong pod deployment(in kubectl portal), the deployment will continue to run success.

@celamb4
Copy link

celamb4 commented Nov 13, 2018

Cant believe this is still open. This is a critical issue.

@agolomoodysaada
Copy link
Contributor

It's unacceptable for a GA product

@encircled
Copy link

Same here, k8s version 1.11.4

@mschuurmans
Copy link

Any updates on this? having the same issue on k8s verison 1.11.5

@sonttran
Copy link

sonttran commented Mar 5, 2019

Any updates on this? Having the same issue on k8s v1.11.6-gke.2.

@jnoller
Copy link
Contributor

jnoller commented Mar 6, 2019

@sonttran Please open an azure technical support ticket - this issue could have one of many causes, and we can not ask for the needed subscription details / information on github. Nodes going into notready can be triggered by so many different things that we will need to investigate the specific cluster(s) you have.

@agolomoodysaada
Copy link
Contributor

@jnoller but... it's an issue on all clusters!

@DimaMTR
Copy link

DimaMTR commented Mar 17, 2019

Any updates on the issue? Having same error during autoscale of node pool in GKE on version 1.12.5-gke.10

@Zimmergren
Copy link
Author

Are there any further plans to investigate this on a wider scale than per-cluster, @jnoller? It looks like we can reproduce it on every cluster from dev to prod, and by the looks of comments in this thread, a lot of folks have these issues.

I also can give a huge kudos and 👍 to Azure support, as they've been able to fix this every time for me. However, usually, it takes quite some time before we get to the point of having it resolved (time-zones, support response time SLA's, same e-mail conversations in every ticket before reaching the conclusion that something has to be restarted on the Azure backend side). If there's anything I can do, submit more details, troubleshooting cases and logs; just let me know 💪

@jnoller
Copy link
Contributor

jnoller commented Mar 18, 2019

@Zimmergren Currently, this issue is a catch-all for similar bad behavior - we do not have widespread reports of this, and based on the support data (which I audit and read) this issue is still cluster specific. Custom VNETs, port blocking on the NSGs, and more can contribute to it (and, as I discovered last week, custom admission controllers).

Please file a support request, link to this issue in the ticket and ask for it to be escalated as needed to on-call engineering so that we can dig into it and identify the actual root cause versus the symptom.

@Azure Azure deleted a comment from mooperd Mar 18, 2019
@Zimmergren
Copy link
Author

Thanks @jnoller , I appreciate your quick responses here 💪
I'll go ahead and submit tickets the next time it happens, and we'll explore it from there - thanks!

@Flodu31
Copy link

Flodu31 commented Jul 24, 2020

Hello,

I have the same issue, on AKS 1.16.10...

Any update on this?
Thanks.
Florent

@rwerlang
Copy link

Hello, what is the status on this issue?
Just had the same problem here on AKS 1.16.10.

There are no logs to support what happened.
This is critical, since it is happening in production environment.

@palma21
Copy link
Member

palma21 commented Jul 27, 2020

Hi folks, apologies this issue is seemingly without any response. The problem is that all the reports here are not consistent with one problem but likely with small isolated problems and the conflation of issues are making them hard to solve/pinpoint.

I've seen a few folks that were actually describing the by-design experience of upgrade: https://docs.microsoft.com/en-us/azure/aks/upgrade-cluster#upgrade-an-aks-cluster
So just to frame the discussion outlining the process:

With a list of available versions for your AKS cluster, use the az aks upgrade command to upgrade. During the upgrade process, AKS adds a new node to the cluster that runs the specified Kubernetes version, then carefully cordon and drains one of the old nodes to minimize disruption to running applications. When the new node is confirmed as running application pods, the old node is deleted. This process repeats until all nodes in the cluster have been upgraded.

This means that during upgrade you will see a lot of nodes flipping between ready and not ready and that is normal. Since we'll always have n+1 in ready state for your applications.

Nonetheless I've see a few folks describing behaviors that are not expected. If you believe what you see is not expected, please open a support ticket. Feel free to paste it here once open and I'm happy to track it to conclusion and pinpoint your issue.

@palma21 palma21 added the SR-Support Request Support Request has been required/made label Jul 27, 2020
@ghost
Copy link

ghost commented Jul 27, 2020

Hi there 👋 AKS bot here. This issue has been tagged as needing a support request so that the AKS support and engineering teams have a look into this particular cluster/issue.

Follow the steps here to create a support ticket for Azure Kubernetes Service and the cluster discussed in this issue.

Please do mention this issue in the case description so our teams can coordinate to help you.

Thank you!

@ghost
Copy link

ghost commented Aug 8, 2020

Action required

@ghost ghost added the Needs Attention 👋 Issues needs attention/assignee/owner label Aug 8, 2020
@palma21 palma21 removed Needs Attention 👋 Issues needs attention/assignee/owner action-required labels Aug 8, 2020
@ghost ghost added the stale Stale issue label Sep 25, 2020
@ghost
Copy link

ghost commented Sep 25, 2020

Case being worked with Microsoft Support, adding stale label for automatic closure if not other reports are added.

@andig
Copy link

andig commented Sep 25, 2020

For the record: in my case we were working on this with support until I finally gave up using AKS. If a root cause is ever detected it would be great for MSFT/Azure to let us know here.

@ghost ghost removed the stale Stale issue label Sep 25, 2020
@palma21
Copy link
Member

palma21 commented Sep 25, 2020

Happy to take a look at your support case (you have a number?) and a cluster where you see this today. Unfortunately this ticket is quite old (opened with 1.9) and mixes upstream issues with possible AKS older issues.
Nodes not ready can happen for a myriad of reasons a lot of them actually expected, so it's better to look at a case by case and I'm happy to assist with that.

@palma21 palma21 added the stale Stale issue label Sep 25, 2020
@andig
Copy link

andig commented Sep 25, 2020

Thanks @palma21 for stepping in. My ticket is older (2y) so not worth chasing today. Unresolved issue was the running nodes became unresponsive during cluster scale-up effectively killing the running workloads. Probably not relevant any longer. Unfortunately I had to give up efforts as the cluster was effectively useless and ticket remained unresolved after weeks of investigation.

@ghost ghost removed the stale Stale issue label Sep 25, 2020
@palma21 palma21 added the stale Stale issue label Sep 25, 2020
@ghost ghost removed the stale Stale issue label Sep 26, 2020
@ghost ghost added the stale Stale issue label Oct 3, 2020
@ghost
Copy link

ghost commented Oct 3, 2020

Case being worked with Microsoft Support, adding stale label for automatic closure if not other reports are added.

@Azure Azure deleted a comment Oct 26, 2020
@ghost ghost removed the stale Stale issue label Oct 26, 2020
@Azure Azure deleted a comment Oct 26, 2020
@Azure Azure deleted a comment Oct 26, 2020
@palma21
Copy link
Member

palma21 commented Oct 26, 2020

Closing as stale.

@palma21 palma21 closed this as completed Oct 26, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Nov 25, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Needs Information SR-Support Request Support Request has been required/made
Projects
None yet
Development

No branches or pull requests