-
Notifications
You must be signed in to change notification settings - Fork 318
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scaling AKS cluster puts nodes in NotReady state #274
Comments
@Zimmergren Please could you show the output of Maybe its linked to this: kubernetes/kubernetes#55867 You can kill the "not ready" virtual machines using the gui console or CLI. A new node will be provisioned. |
Hey @mooperd, Sure thing. Command:
Output:
|
2/3 nodes in my 1.8.7 AKS cluster are currently broken.
Full output here: https://gist.github.com/mooperd/c0bece0ba011a57d60030612de09a3f1 |
I'm using this command to prune dead nodes:
|
Similar issue: #256 |
Thanks @mooperd for your comments. I've also done the whole "delete nodes by myself" routine in the past, but I would really hope to avoid that and have it self-heal and deal with any such dead nodes. Right now I'm automating that using some API calls to the Azure Resource Manager API's and I'm deleting the VM if it's dead, then triggering a new scale-out. Let's see what comes out of MS on this topic; I'm seeing lots of people in my community experiencing the same thing. |
same issue here i've looked for node logs also nothing seems suspicious only ssh connection is slower than before.
|
#102 also same issue here you can find my logs from comment |
+1, same issue, can repro 100% of the time... create simple deployment, set replica count to a small number (5?) once started and working, change replica count to a significantly larger number (100+) -- watch the system fail... Kubernetes version: 1.9.6 |
Reproduced #102 with 1.9.6, then scaled up and run into this one. |
Same issue here. 100% reproducibility |
Hi, I am also experiencing this on a freshly built cluster Kubernetes version 1.10.5. Is there any update on this issue? Thanks Olly |
I reproduced it serveral times when trying to scale up to more pods with nodes in ready state. I double checked the node, they were all ready and after deleting the wrong pod deployment(in kubectl portal), the deployment will continue to run success. |
Cant believe this is still open. This is a critical issue. |
It's unacceptable for a GA product |
Same here, k8s version 1.11.4 |
Any updates on this? having the same issue on k8s verison 1.11.5 |
Any updates on this? Having the same issue on k8s |
@sonttran Please open an azure technical support ticket - this issue could have one of many causes, and we can not ask for the needed subscription details / information on github. Nodes going into notready can be triggered by so many different things that we will need to investigate the specific cluster(s) you have. |
@jnoller but... it's an issue on all clusters! |
Any updates on the issue? Having same error during autoscale of node pool in GKE on version |
Are there any further plans to investigate this on a wider scale than per-cluster, @jnoller? It looks like we can reproduce it on every cluster from dev to prod, and by the looks of comments in this thread, a lot of folks have these issues. I also can give a huge kudos and 👍 to Azure support, as they've been able to fix this every time for me. However, usually, it takes quite some time before we get to the point of having it resolved (time-zones, support response time SLA's, same e-mail conversations in every ticket before reaching the conclusion that something has to be restarted on the Azure backend side). If there's anything I can do, submit more details, troubleshooting cases and logs; just let me know 💪 |
@Zimmergren Currently, this issue is a catch-all for similar bad behavior - we do not have widespread reports of this, and based on the support data (which I audit and read) this issue is still cluster specific. Custom VNETs, port blocking on the NSGs, and more can contribute to it (and, as I discovered last week, custom admission controllers). Please file a support request, link to this issue in the ticket and ask for it to be escalated as needed to on-call engineering so that we can dig into it and identify the actual root cause versus the symptom. |
Thanks @jnoller , I appreciate your quick responses here 💪 |
Hello, I have the same issue, on AKS 1.16.10... Any update on this? |
Hello, what is the status on this issue? There are no logs to support what happened. |
Hi folks, apologies this issue is seemingly without any response. The problem is that all the reports here are not consistent with one problem but likely with small isolated problems and the conflation of issues are making them hard to solve/pinpoint. I've seen a few folks that were actually describing the by-design experience of upgrade: https://docs.microsoft.com/en-us/azure/aks/upgrade-cluster#upgrade-an-aks-cluster
This means that during upgrade you will see a lot of nodes flipping between ready and not ready and that is normal. Since we'll always have n+1 in ready state for your applications. Nonetheless I've see a few folks describing behaviors that are not expected. If you believe what you see is not expected, please open a support ticket. Feel free to paste it here once open and I'm happy to track it to conclusion and pinpoint your issue. |
Hi there 👋 AKS bot here. This issue has been tagged as needing a support request so that the AKS support and engineering teams have a look into this particular cluster/issue. Follow the steps here to create a support ticket for Azure Kubernetes Service and the cluster discussed in this issue. Please do mention this issue in the case description so our teams can coordinate to help you. Thank you! |
Action required |
Case being worked with Microsoft Support, adding stale label for automatic closure if not other reports are added. |
For the record: in my case we were working on this with support until I finally gave up using AKS. If a root cause is ever detected it would be great for MSFT/Azure to let us know here. |
Happy to take a look at your support case (you have a number?) and a cluster where you see this today. Unfortunately this ticket is quite old (opened with 1.9) and mixes upstream issues with possible AKS older issues. |
Thanks @palma21 for stepping in. My ticket is older (2y) so not worth chasing today. Unresolved issue was the running nodes became unresponsive during cluster scale-up effectively killing the running workloads. Probably not relevant any longer. Unfortunately I had to give up efforts as the cluster was effectively useless and ticket remained unresolved after weeks of investigation. |
Case being worked with Microsoft Support, adding stale label for automatic closure if not other reports are added. |
Closing as stale. |
I've got a set of clusters running in AKS. Recently I've seen that triggering a scaling operation on the cluster (on this sample cluster from 1 to 2 nodes), causes the new node to end up in
NotReady
state.kubectl get nodes
Additional info:
Questions:
40 minutes of semi-downtime due to a failed node scale that happens intermittently is going to be hard to define any SLA for ;)
The text was updated successfully, but these errors were encountered: