-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Could not get a CSINode object for the node #4811
Comments
Did you find what the issue was? |
@80kk did you get around this issue? We started seeing this error recently |
Cluster Autoscaler update has fixed the issue. |
I think this happens when the pod requests a PVC on AWS (or others) that is not available in the AZ of the node. The real scheduler sees that this won't work, but the CAS "fake scheduler run" doesn't. After awhile CAS marks the node as underutilized, kills it, and scales up again. Eventually the scale-up node lands in the right AZ, and the pod is scheduled. On other providers which support multi-zone storage, this is not a problem. If CAS update really did fix it, I'm very interested in how. If it's caused by something else, feel free to chime in here. And, feel free to chat with your AWS AM about this. aws/containers-roadmap#608 and 724 have some of the most 👍 of all in the roadmap and aren't particularly hard to fix. |
I had a brand new AWS ASG scaled to 0 and had the same issue at deploy time. It was solved by manually scaling up. Afterwards, the CAS started working as expected. |
What version of CA did the fix? |
Its unclear which version of autoscaler has the fix for this. We are using CA 1.23.1 and just hit this issue after updating k8s to 1.23 |
@80kk Can you please let us know which version is this fixed? We are facing simiar issue with CA 1.21.1 version and we are planning our EKS upgrade to 1.24 soon. similary we will need to update the CA version as well. |
@ricarhincapie
It is my understanding that the CAS caches seen nodes, so it will be able to scale up from 0 until it restarts. Might be wrong |
It seems this is fixed by #4491 in K8s 1.24+ |
We're still observing this issue on AWS EKS 1.24 with Cluster Autoscaler 1.26.1 |
While the original reported issue was observed on KOPS provisioned Kubernetes cluster and I am now using EKS with Amazon EBS CSI Driver. |
What was the solution for this error? |
I subscribed to this issue because I had the exact same error, but it was not linked to the CA ; it was linked to my lack of knowledge on AWS/EKS Terraform provider. Configuring the addons correctly did the trick. If it can help someone, I described my configuration in a blog post. |
@bcouetil what you do in your example is creating the node group only in one availability zone. This is the same is how @afirth noticed in the commend here above . |
That way of segregating node pools in zones is way older than the For as long as I can remember, at least 4 years, I've always done that, because scaling never worked 100% for multi-zones pools. |
In our case we found that CAS was trying to scale up a node group whose AZ can no longer allocate more of the specified instance type (c5n.metal in our case). The indicator for this kind of issue is that the Status of the node group will be "Degraded" and its Health Status tab will show something like:
|
Which component are you using?:
CA
cluster-autoscaler
What version of the component are you using?:
1.20.1
k8s.gcr.io/autoscaling/cluster-autoscaler:v1.20.1
Component version:
What k8s version are you using (
kubectl version
)?:1.23.5
What environment is this in?:
AWS
Could someone please tell me what this error is about? I found sometimes that it takes ages for cluster to scale up and I am wondering if this is related somehow:
The thing is that in each node group there is a still place for the new nodes. At least 5 in each.
The text was updated successfully, but these errors were encountered: