-
Notifications
You must be signed in to change notification settings - Fork 40.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
v1.30: kube-scheduler crashes with: Observed a panic: "integer divide by zero" #124930
Comments
This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/sig scheduling |
cc @alculquicondor |
@chengjoey @AxeZhan can you take a look? I believe all the latest patch releases are affected too, because of #124559 |
cc @sanposhiho |
/priority critical-urgent |
This crash happens when preFilter plugins filtered out all nodes ...
If preFilter filtered out some nodes, here len(nodes) will be a subset of allNodes. |
I think so. In general, we just want to try a different set of nodes. For the case of Daemonsets, it doesn't really matter, as we will just test one node. |
/assign Successfully catched this with a unit test. |
I am having the same issue with 1.28.10 on a clean new cluster created with kops 1.28.5:
The pod that it's trying to create is pending with the following messages:
(just for the sake of being indexed for those who will search for the same error to find this github issue) |
Fix is on the way: #124933 Just waiting for reviews from approver. After this pr got merged, I'll do cherry-picks for v1.28(1.27?)~1.30 |
@sara-hann This change will be included in the upcoming patch releases, scheduled for |
I'm seeing the issue after upgrade v1.26.11->v1.27.14 |
Yes, a broken pod could be causing this. You can also try downgrading to v1.27.13 |
Phew, manually changing the scheduler version to 1.27.13 fixed the issue and showed the broken pod in logs. Thanks! |
This fixes kubernetes/kubernetes#124930 Change-Id: Ib1f96372acdd1eeef6a0206688bd032aa73ef0a0 Reviewed-on: https://review.monogon.dev/c/monogon/+/3172 Reviewed-by: Lorenz Brun <lorenz@monogon.tech> Tested-by: Jenkins CI
What happened?
On Kubernetes v1.30.0 (and v1.30.1),
kube-scheduler
can crash with the following panic if a pod is defined in a certain way:The crash happens here because the
len(nodes)
is0
in certain cases.What did you expect to happen?
kube-scheduler
should not crash, on v1.29.4 it doesn't happen.On v1.29.4 the
kube-scheduler
is just printing these error logs, but doesn't crash:How can we reproduce it (as minimally and precisely as possible)?
Create a pod like this:
The important part is that the affinity doesn't match a real/valid node.
Once this pod is being processed by
kube-scheduler
it will crash and continue to do so until the pod is deleted.Anything else we need to know?
This issue was triggered during the rotation of control-plane nodes. In our setup we update the control plane by scaling from 1 to 2 instances. Once both a ready, the old is being terminated via EC2 API. At this stage the kube-controller-manager sometimes manages to create a replacement daemonset pod once the old node is being deleted. This results in a pod targeting a no longer existing node via
affinity
as illustrated in the example above. For some reason, when thekube-scheduler
is crashing thekube-controller-manager
doesn't delete the extra/invalid daemonset pod. Not sure if this is another issue or it has always happened in our setup, but only v1.30 makeskube-scheduler
crash which causes an actual issue.Kubernetes version
Cloud provider
OS version
Install tools
Container runtime (CRI) and version (if applicable)
Related plugins (CNI, CSI, ...) and versions (if applicable)
The text was updated successfully, but these errors were encountered: