node.cluster.x-k8s.io/uninitialized cause race condition when creating cluster. #8357

lubronzhan · 2023-03-23T22:39:34Z

What steps did you take and what happened?

Hi this new label node.cluster.x-k8s.io/uninitialized will cause clusters creation to fail for out of tree cloud-providers for example cloud-provider-vsphere. Since cloud-providers only tolerates existing k8s tolerations, Example here: https://github.com/kubernetes/cloud-provider-vsphere/blob/master/releases/v1.26/vsphere-cloud-controller-manager.yaml#L218-L230

CPI is crucial to initialize the node, setting provideID and externalIP on the node. Now node will stuck in uninitialized state, because CPI can't be deployed because of the toleration. And CAPI needs the providerID on the node to find specific node. Since can't find the providerID of the node, so it will keep erroring out, and won't remove the tolerations node.cluster.x-k8s.io=uninitialized:NoSchedule.

This is a breaking change that requires all cloud-providers to adopt this tolerations

What did you expect to happen?

Cluster creation suceeds

Cluster API version

Use CAPI 1.4.0-rc1

Kubernetes version

1.25

Anything else you would like to add?

No response

Label(s) to be applied

/kind bug
One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels.

The text was updated successfully, but these errors were encountered:

ykakarap · 2023-03-23T23:55:05Z

@fabriziopandini @CecileRobertMichon @richardcase @jackfrancis This might the case with other cloud providers too. Has anyone observed a similar problem with the other providers?

Context on the taint:
The node.cluster.x-k8s.io/uninitialized:NoSchedule is added to the nodes at creation and is removed after the labels are synced form the machine to the nodes at least once. This is to avoid workloads from getting scheduled on nodes that do not match the labels. Example: If a user wants to schedule a workload on node that does not have gpu: true label. In this case the workload might initially get scheduled on a node as the machine might not have synced the gpu: true label to the node yet.

Potential fixes:

Switch to using node.cluster.x-k8s.io/uninitialized:PreferNoSchedule. This is a softer version and wont prevent CPI pods from getting created on the nodes. Therefore the node wont be stuck in uninitialized state.
Apply the taint only for worker nodes and not control plane nodes. This way CPI is not blocked from installation (which I believe is only installed on control plane nodes. Example)

ykakarap · 2023-03-24T00:35:16Z

I would vote for option 2.

Option 1 will not be very effective at preventing the original problem of unwanted workload scheduling (if all nodes have the taint then the scheduler will effectively ignore it(?)).

In option 2, the workloads will actually block from scheduling on the worker nodes. The user will still be able to schedule workloads on the control plane nodes(probably not a common case).

lubronzhan · 2023-03-24T05:15:18Z

This way CPI is not blocked from installation (which I believe is only installed on control plane nodes. Example)

https://github.com/kubernetes/cloud-provider-aws/blob/f33bf21384e7fba50052b1fb8774b76ffd268d50/charts/aws-cloud-controller-manager/values.yaml#L16
https://github.com/kubernetes-sigs/cloud-provider-azure/blob/50993fd066e82a8e214b9d6d081757078c46f5d5/helm/cloud-provider-azure/values.yaml#L51
https://github.com/kubernetes/cloud-provider-gcp/blob/f7280f55e36f66992cbc17c47bcd478f6a1121e0/deploy/packages/default/manifest.yaml#L23-L32
https://github.com/kubernetes/cloud-provider-openstack/blob/92d83ea6299fefb6fe4c858dc8ee0204130c4bad/docs/magnum-auto-healer/using-magnum-auto-healer.md?plain=1#L165
https://github.com/kubernetes/cloud-provider-alibaba-cloud/blob/929f66b15c02e5ac2614a63564b2352b48b8a24a/docs/examples/cloud-controller-manager.yml#L186

Yeah all cloud-provider deploys on control plane

fabriziopandini · 2023-03-24T09:32:42Z

/triage accepted

Thanks, @lubronzhan for reporting!
I also prefer option 2, but let's wait for more feedback from the providers implementers

apricote · 2023-03-24T09:50:27Z

The node.cluster.x-k8s.io/uninitialized:NoSchedule is added to the nodes at creation and is removed after the labels are synced form the machine to the nodes at least once.

What preconditions need to be met for CAPI to reconcile the labels? The docs are pretty light on this, only telling me that it happens, not if any conditions need to me met.

yastij · 2023-03-24T10:04:49Z

I think that as long as we’re documenting that there’s cases where if you’re using inequality based selection based on some label syncing to CPs, you could still end up with pods landing on CPs, we should be fine going with option 2.

we also might want to broadcast to providers a change required for the next CAPI minor release to add the toleration. This should give enough soak time for folks to adapt and update their manifests

mdbooth · 2023-03-24T10:14:55Z

Option 2 would work for cloud-provider-openstack. Our default deployment has:

    spec:
      nodeSelector:
        node-role.kubernetes.io/control-plane: ""
      tolerations:
      - key: node.cloudprovider.kubernetes.io/uninitialized
        value: "true"
        effect: NoSchedule
      - key: node-role.kubernetes.io/master
        effect: NoSchedule
      - key: node-role.kubernetes.io/control-plane
        effect: NoSchedule

So we run only on the control plane. Note that there is no fundamental reason that the cloud provider should only run on the control plane, but it must also be able to run on the control plane in order to bootstrap. I think option 2 is safe in any case.

fabriziopandini · 2023-03-24T10:47:22Z

there’s cases where if you’re using inequality based selection based on some label syncing to CPs, you could still end up with pods landing on CPs

If I’m not wrong this requires both inequality selector + a toleration to node-role.kubernetes.io/control-plane, so we should be fine (it is an intentional choice of the users)

fabriziopandini · 2023-03-24T15:17:51Z

potential fix
Option 1: #8359
Option 2: #8358

ykakarap · 2023-03-24T17:33:26Z

We are going ahead with option 2.
This fix will be cherry-picked to release-1.4 and will be part of v1.4.0.

furkatgofurov7 · 2023-03-24T22:42:48Z

We have uplifted to v1.4.0-rc.0 in CAPM3 provider, but we have not seen any issues with cluster creation. Although we have the same tolerations but nothing extra as other providers https://github.com/metal3-io/cluster-api-provider-metal3/blob/bf9a58b393025aaa4a0ecf10088d31b352b159c5/config/manager/manager.yaml#L63-L67 🤔

CecileRobertMichon · 2023-03-24T22:48:56Z

We are running into this in the CAPZ PR to bump CAPI to v1.4.0-rc-0: https://kubernetes.slack.com/archives/CEX9HENG7/p1679689897005289?thread_ts=1679521084.692349&cid=CEX9HENG7

The symptom: Calico CNI pods are failing to schedule, failing with Warning FailedScheduling 2m36s default-scheduler 0/4 nodes are available: 4 node(s) had untolerated taint {node.cluster.x-k8s.io/uninitialized: }. preemption: 0/4 nodes are available: 4 Preemption is not helpful for scheduling
https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_cluster-api-provider-azure/3298/pull-cluster-api-provider-azure-e2e/1639072622090129408/artifacts/clusters/capz-e2e-4cl6mc-ipv6/calico-system/calico-kube-controllers-f7574cc46-cvvkp/pod-describe.txt

cc @willie-yao

fabriziopandini · 2023-03-26T13:50:44Z

@CecileRobertMichon @willie-yao @lubronzhan
it will be great if you could validate the fix that we merged on Friday...

lubronzhan · 2023-03-28T19:03:33Z

@srm09 already verified with CAPV. Thanks

srm09 · 2023-03-28T22:38:20Z

Running the e2e test suite for the CAPV PR which is using the v1.4.0 release.
kubernetes-sigs/cluster-api-provider-vsphere#1833

willie-yao · 2023-03-28T22:48:19Z

We are testing CAPZ with the v1.4.0 release and are still running into issues with ReplicaSet has timed out progressing. The ReplicaSet describe shows 1 node(s) had untolerated taint {node.cluster.x-k8s.io/uninitialized: }

lubronzhan · 2023-03-29T03:18:16Z

Looks like it's a webserver. Your test just create a cluster and deploys it

        "spec": {
          "containers": [
            {
              "name": "webb80ju6",
              "image": "httpd",

You need to check why your CAPI log to see why it doesn;t remove the toleration

willie-yao · 2023-03-29T18:23:41Z

@lubronzhan We have discovered that this is an issue with SSA not being able to apply a patch to labels when there is a duplicate field. This issue is tracked here: #8417

k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 23, 2023

srm09 mentioned this issue Mar 24, 2023

✨Bump cluster-api dependency to v1.4.1 kubernetes-sigs/cluster-api-provider-vsphere#1833

Merged

This was referenced Mar 24, 2023

🐛 set uninitialized taint only on worker nodes #8358

Merged

🐛 use PreferNoSchedule uninitialized taint #8359

Closed

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 24, 2023

fabriziopandini mentioned this issue Mar 24, 2023

Umbrella issue for Label Sync Between Machines and underlying Kubernetes Nodes proposal implementation #7730

Closed

4 tasks

k8s-ci-robot closed this as completed in #8358 Mar 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

node.cluster.x-k8s.io/uninitialized cause race condition when creating cluster. #8357

node.cluster.x-k8s.io/uninitialized cause race condition when creating cluster. #8357

lubronzhan commented Mar 23, 2023 •

edited

Loading

ykakarap commented Mar 23, 2023

ykakarap commented Mar 24, 2023

lubronzhan commented Mar 24, 2023

fabriziopandini commented Mar 24, 2023

apricote commented Mar 24, 2023

yastij commented Mar 24, 2023

mdbooth commented Mar 24, 2023

fabriziopandini commented Mar 24, 2023

fabriziopandini commented Mar 24, 2023

ykakarap commented Mar 24, 2023

furkatgofurov7 commented Mar 24, 2023

CecileRobertMichon commented Mar 24, 2023

fabriziopandini commented Mar 26, 2023

lubronzhan commented Mar 28, 2023

srm09 commented Mar 28, 2023

willie-yao commented Mar 28, 2023 •

edited

Loading

lubronzhan commented Mar 29, 2023

willie-yao commented Mar 29, 2023

node.cluster.x-k8s.io/uninitialized cause race condition when creating cluster. #8357

node.cluster.x-k8s.io/uninitialized cause race condition when creating cluster. #8357

Comments

lubronzhan commented Mar 23, 2023 • edited Loading

What steps did you take and what happened?

What did you expect to happen?

Cluster API version

Kubernetes version

Anything else you would like to add?

Label(s) to be applied

ykakarap commented Mar 23, 2023

ykakarap commented Mar 24, 2023

lubronzhan commented Mar 24, 2023

fabriziopandini commented Mar 24, 2023

apricote commented Mar 24, 2023

yastij commented Mar 24, 2023

mdbooth commented Mar 24, 2023

fabriziopandini commented Mar 24, 2023

fabriziopandini commented Mar 24, 2023

ykakarap commented Mar 24, 2023

furkatgofurov7 commented Mar 24, 2023

CecileRobertMichon commented Mar 24, 2023

fabriziopandini commented Mar 26, 2023

lubronzhan commented Mar 28, 2023

srm09 commented Mar 28, 2023

willie-yao commented Mar 28, 2023 • edited Loading

lubronzhan commented Mar 29, 2023

willie-yao commented Mar 29, 2023

lubronzhan commented Mar 23, 2023 •

edited

Loading

willie-yao commented Mar 28, 2023 •

edited

Loading