Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Could not get a CSINode object for the node #4811

Closed
80kk opened this issue Apr 12, 2022 · 17 comments
Closed

Could not get a CSINode object for the node #4811

80kk opened this issue Apr 12, 2022 · 17 comments
Labels
area/cluster-autoscaler kind/bug Categorizes issue or PR as related to a bug.

Comments

@80kk
Copy link

80kk commented Apr 12, 2022

Which component are you using?:
CA

cluster-autoscaler

What version of the component are you using?:
1.20.1

k8s.gcr.io/autoscaling/cluster-autoscaler:v1.20.1

Component version:

What k8s version are you using (kubectl version)?:

1.23.5

What environment is this in?:

AWS

Could someone please tell me what this error is about? I found sometimes that it takes ages for cluster to scale up and I am wondering if this is related somehow:

I0412 08:06:16.062769       1 scheduler_binder.go:775] Could not get a CSINode object for the node "template-node-for-nodes-a.domain.net-7982597919630627426-0": csinode.storage.k8s.io "template-node-for-nodes-a.domain.net-7982597919630627426-0" not found
I0412 08:06:16.062801       1 scheduler_binder.go:801] All bound volumes for Pod "namespace/pod-75b64dff96-99vxn" match with Node "template-node-for-nodes-a.domain.net-7982597919630627426-0"
I0412 08:06:16.062828       1 filter_out_schedulable.go:157] Pod namespace.pod-75b64dff96-99vxn marked as unschedulable can be scheduled on node template-node-for-nodes-a.domain.net-7982597919630627426-0. Ignoring in scale up.
I0412 08:06:16.063127       1 scheduler_binder.go:775] Could not get a CSINode object for the node "template-node-for-nodes-c.domain.net-4246696157256546175-0": csinode.storage.k8s.io "template-node-for-nodes-c.domain.net-4246696157256546175-0" not found
I0412 08:06:16.063143       1 scheduler_binder.go:801] All bound volumes for Pod "namespace/pod-64755c698f-ghcdt" match with Node "template-node-for-nodes-c.domain.net-4246696157256546175-0"
I0412 08:06:16.063166       1 filter_out_schedulable.go:157] Pod namespace.pod-64755c698f-ghcdt marked as unschedulable can be scheduled on node template-node-for-nodes-c.domain.net-4246696157256546175-0. Ignoring in scale up.

The thing is that in each node group there is a still place for the new nodes. At least 5 in each.

@80kk 80kk added the kind/bug Categorizes issue or PR as related to a bug. label Apr 12, 2022
@JohnMops
Copy link

JohnMops commented Jun 8, 2022

Did you find what the issue was?

@mohitreddy1996
Copy link

@80kk did you get around this issue? We started seeing this error recently

@80kk
Copy link
Author

80kk commented Jun 28, 2022

Cluster Autoscaler update has fixed the issue.

@80kk 80kk closed this as completed Jun 28, 2022
@afirth
Copy link
Member

afirth commented Jun 29, 2022

I think this happens when the pod requests a PVC on AWS (or others) that is not available in the AZ of the node. The real scheduler sees that this won't work, but the CAS "fake scheduler run" doesn't. After awhile CAS marks the node as underutilized, kills it, and scales up again. Eventually the scale-up node lands in the right AZ, and the pod is scheduled. On other providers which support multi-zone storage, this is not a problem.
solution - make a separate node group for each AZ.
caveat - scale to/from 0 is broken in default EKS. Workarounds and issue at aws/containers-roadmap#608

If CAS update really did fix it, I'm very interested in how. If it's caused by something else, feel free to chime in here. And, feel free to chat with your AWS AM about this. aws/containers-roadmap#608 and 724 have some of the most 👍 of all in the roadmap and aren't particularly hard to fix.

@RicHincapie
Copy link

I had a brand new AWS ASG scaled to 0 and had the same issue at deploy time. It was solved by manually scaling up. Afterwards, the CAS started working as expected.

@decipher27
Copy link

decipher27 commented Sep 12, 2022

What version of CA did the fix?
Could not get a CSINode object for the node "ip-10.xxx.x.xx..ap-south-1.compute.internal": csinode.storage.k8s.io "ip-10-xxx.xx-ap-south-1.compute.internal" not found

@laxmanvallandas
Copy link

laxmanvallandas commented Dec 1, 2022

Cluster Autoscaler update has fixed the issue.

Its unclear which version of autoscaler has the fix for this. We are using CA 1.23.1 and just hit this issue after updating k8s to 1.23
@80kk , Can you post the version?

@KiranReddy230
Copy link

@80kk Can you please let us know which version is this fixed? We are facing simiar issue with CA 1.21.1 version and we are planning our EKS upgrade to 1.24 soon. similary we will need to update the CA version as well.

@afirth
Copy link
Member

afirth commented Jan 4, 2023

@ricarhincapie

I had a brand new AWS ASG scaled to 0 and had the same issue at deploy time. It was solved by manually scaling up. Afterwards, the CAS started working as expected.

It is my understanding that the CAS caches seen nodes, so it will be able to scale up from 0 until it restarts. Might be wrong

@afirth
Copy link
Member

afirth commented Jan 4, 2023

It seems this is fixed by #4491 in K8s 1.24+
aws/containers-roadmap#724 (comment)

@Chili-Man
Copy link

We're still observing this issue on AWS EKS 1.24 with Cluster Autoscaler 1.26.1

@80kk
Copy link
Author

80kk commented Jan 18, 2023

While the original reported issue was observed on KOPS provisioned Kubernetes cluster and I am now using EKS with Amazon EBS CSI Driver.

@zentavr
Copy link

zentavr commented Jun 15, 2023

What was the solution for this error?

@bcouetil
Copy link

I subscribed to this issue because I had the exact same error, but it was not linked to the CA ; it was linked to my lack of knowledge on AWS/EKS Terraform provider.

Configuring the addons correctly did the trick.

If it can help someone, I described my configuration in a blog post.

@zentavr
Copy link

zentavr commented Jun 25, 2023

@bcouetil what you do in your example is creating the node group only in one availability zone.

This is the same is how @afirth noticed in the commend here above .

@bcouetil
Copy link

bcouetil commented Jun 25, 2023

That way of segregating node pools in zones is way older than the aws-ebs-csi-driver.

For as long as I can remember, at least 4 years, I've always done that, because scaling never worked 100% for multi-zones pools.

@relaxdiego
Copy link

In our case we found that CAS was trying to scale up a node group whose AZ can no longer allocate more of the specified instance type (c5n.metal in our case). The indicator for this kind of issue is that the Status of the node group will be "Degraded" and its Health Status tab will show something like:

Could not launch On-Demand Instances. InsufficientInstanceCapacity - We currently do not have sufficient c5n.metal capacity in the Availability Zone you requested (eu-central-1a). Our system will be working on provisioning additional capacity. You can currently get c5n.metal capacity by not specifying an Availability Zone in your request or choosing eu-central-1b, eu-central-1c. Launching EC2 instance failed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cluster-autoscaler kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests