Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

node.cluster.x-k8s.io/uninitialized cause race condition when creating cluster. #8357

Closed
lubronzhan opened this issue Mar 23, 2023 · 18 comments · Fixed by #8358
Closed

node.cluster.x-k8s.io/uninitialized cause race condition when creating cluster. #8357

lubronzhan opened this issue Mar 23, 2023 · 18 comments · Fixed by #8358
Labels
kind/bug Categorizes issue or PR as related to a bug. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@lubronzhan
Copy link
Contributor

lubronzhan commented Mar 23, 2023

What steps did you take and what happened?

Hi this new label node.cluster.x-k8s.io/uninitialized will cause clusters creation to fail for out of tree cloud-providers for example cloud-provider-vsphere. Since cloud-providers only tolerates existing k8s tolerations, Example here: https://github.com/kubernetes/cloud-provider-vsphere/blob/master/releases/v1.26/vsphere-cloud-controller-manager.yaml#L218-L230

CPI is crucial to initialize the node, setting provideID and externalIP on the node. Now node will stuck in uninitialized state, because CPI can't be deployed because of the toleration. And CAPI needs the providerID on the node to find specific node. Since can't find the providerID of the node, so it will keep erroring out, and won't remove the tolerations node.cluster.x-k8s.io=uninitialized:NoSchedule.

This is a breaking change that requires all cloud-providers to adopt this tolerations

What did you expect to happen?

Cluster creation suceeds

Cluster API version

Use CAPI 1.4.0-rc1

Kubernetes version

1.25

Anything else you would like to add?

No response

Label(s) to be applied

/kind bug
One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels.

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 23, 2023
@ykakarap
Copy link
Contributor

@fabriziopandini @CecileRobertMichon @richardcase @jackfrancis This might the case with other cloud providers too. Has anyone observed a similar problem with the other providers?

Context on the taint:
The node.cluster.x-k8s.io/uninitialized:NoSchedule is added to the nodes at creation and is removed after the labels are synced form the machine to the nodes at least once. This is to avoid workloads from getting scheduled on nodes that do not match the labels. Example: If a user wants to schedule a workload on node that does not have gpu: true label. In this case the workload might initially get scheduled on a node as the machine might not have synced the gpu: true label to the node yet.

Potential fixes:

  1. Switch to using node.cluster.x-k8s.io/uninitialized:PreferNoSchedule. This is a softer version and wont prevent CPI pods from getting created on the nodes. Therefore the node wont be stuck in uninitialized state.
  2. Apply the taint only for worker nodes and not control plane nodes. This way CPI is not blocked from installation (which I believe is only installed on control plane nodes. Example)

@ykakarap
Copy link
Contributor

I would vote for option 2.

Option 1 will not be very effective at preventing the original problem of unwanted workload scheduling (if all nodes have the taint then the scheduler will effectively ignore it(?)).

In option 2, the workloads will actually block from scheduling on the worker nodes. The user will still be able to schedule workloads on the control plane nodes(probably not a common case).

@fabriziopandini
Copy link
Member

/triage accepted

Thanks, @lubronzhan for reporting!
I also prefer option 2, but let's wait for more feedback from the providers implementers

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 24, 2023
@apricote
Copy link
Member

The node.cluster.x-k8s.io/uninitialized:NoSchedule is added to the nodes at creation and is removed after the labels are synced form the machine to the nodes at least once.

What preconditions need to be met for CAPI to reconcile the labels? The docs are pretty light on this, only telling me that it happens, not if any conditions need to me met.

@yastij
Copy link
Member

yastij commented Mar 24, 2023

I think that as long as we’re documenting that there’s cases where if you’re using inequality based selection based on some label syncing to CPs, you could still end up with pods landing on CPs, we should be fine going with option 2.

we also might want to broadcast to providers a change required for the next CAPI minor release to add the toleration. This should give enough soak time for folks to adapt and update their manifests

@mdbooth
Copy link
Contributor

mdbooth commented Mar 24, 2023

Option 2 would work for cloud-provider-openstack. Our default deployment has:

    spec:
      nodeSelector:
        node-role.kubernetes.io/control-plane: ""
      tolerations:
      - key: node.cloudprovider.kubernetes.io/uninitialized
        value: "true"
        effect: NoSchedule
      - key: node-role.kubernetes.io/master
        effect: NoSchedule
      - key: node-role.kubernetes.io/control-plane
        effect: NoSchedule

So we run only on the control plane. Note that there is no fundamental reason that the cloud provider should only run on the control plane, but it must also be able to run on the control plane in order to bootstrap. I think option 2 is safe in any case.

@fabriziopandini
Copy link
Member

there’s cases where if you’re using inequality based selection based on some label syncing to CPs, you could still end up with pods landing on CPs

If I’m not wrong this requires both inequality selector + a toleration to node-role.kubernetes.io/control-plane, so we should be fine (it is an intentional choice of the users)

@fabriziopandini
Copy link
Member

potential fix
Option 1: #8359
Option 2: #8358

@ykakarap
Copy link
Contributor

We are going ahead with option 2.
This fix will be cherry-picked to release-1.4 and will be part of v1.4.0.

@furkatgofurov7
Copy link
Member

We have uplifted to v1.4.0-rc.0 in CAPM3 provider, but we have not seen any issues with cluster creation. Although we have the same tolerations but nothing extra as other providers https://github.com/metal3-io/cluster-api-provider-metal3/blob/bf9a58b393025aaa4a0ecf10088d31b352b159c5/config/manager/manager.yaml#L63-L67 🤔

@CecileRobertMichon
Copy link
Contributor

We are running into this in the CAPZ PR to bump CAPI to v1.4.0-rc-0: https://kubernetes.slack.com/archives/CEX9HENG7/p1679689897005289?thread_ts=1679521084.692349&cid=CEX9HENG7

The symptom: Calico CNI pods are failing to schedule, failing with Warning FailedScheduling 2m36s default-scheduler 0/4 nodes are available: 4 node(s) had untolerated taint {node.cluster.x-k8s.io/uninitialized: }. preemption: 0/4 nodes are available: 4 Preemption is not helpful for scheduling
https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_cluster-api-provider-azure/3298/pull-cluster-api-provider-azure-e2e/1639072622090129408/artifacts/clusters/capz-e2e-4cl6mc-ipv6/calico-system/calico-kube-controllers-f7574cc46-cvvkp/pod-describe.txt

cc @willie-yao

@fabriziopandini
Copy link
Member

@CecileRobertMichon @willie-yao @lubronzhan
it will be great if you could validate the fix that we merged on Friday...

@lubronzhan
Copy link
Contributor Author

@srm09 already verified with CAPV. Thanks

@srm09
Copy link
Contributor

srm09 commented Mar 28, 2023

Running the e2e test suite for the CAPV PR which is using the v1.4.0 release.
kubernetes-sigs/cluster-api-provider-vsphere#1833

@willie-yao
Copy link
Contributor

willie-yao commented Mar 28, 2023

We are testing CAPZ with the v1.4.0 release and are still running into issues with ReplicaSet has timed out progressing. The ReplicaSet describe shows 1 node(s) had untolerated taint {node.cluster.x-k8s.io/uninitialized: }

@lubronzhan
Copy link
Contributor Author

Looks like it's a webserver. Your test just create a cluster and deploys it

        "spec": {
          "containers": [
            {
              "name": "webb80ju6",
              "image": "httpd",

You need to check why your CAPI log to see why it doesn;t remove the toleration

@willie-yao
Copy link
Contributor

@lubronzhan We have discovered that this is an issue with SSA not being able to apply a patch to labels when there is a duplicate field. This issue is tracked here: #8417

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet