Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix control plane node join logic #745

Merged

Conversation

ashish-amarnath
Copy link
Contributor

What this PR does / why we need it:
Fixes the control plane node join logic
Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #740

Special notes for your reviewer:

Please confirm that if this PR changes any image versions, then that's the sole change this PR makes.

Release note:


@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Apr 25, 2019
@ashish-amarnath ashish-amarnath changed the title Fix control plane node join logic [WIP} Fix control plane node join logic Apr 25, 2019
@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Apr 25, 2019
@ashish-amarnath ashish-amarnath changed the title [WIP} Fix control plane node join logic [WIP] Fix control plane node join logic Apr 25, 2019
Copy link
Contributor

@chuckha chuckha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm having a little trouble with the isNodeJoin function. I think it's grown in scope beyond the name implies. Could you please add a godoc about what the function is trying to do?

pkg/cloud/aws/actuators/machine/actuator.go Outdated Show resolved Hide resolved
@ashish-amarnath
Copy link
Contributor Author

This PR is untested. I am working on getting myself an AWS account to test this. 🤞

@ashish-amarnath
Copy link
Contributor Author

I need some help testing this PR. I haven't managed to get myself an AWS account yet :(

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Apr 28, 2019
@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 28, 2019
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Apr 28, 2019
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 29, 2019
@ashish-amarnath
Copy link
Contributor Author

/retest

@sethp-nr
Copy link
Contributor

Well, the code here seems to do the right thing in that my first controlplane instance does get created with a JoinConfiguration and tries to run kubeadm join.

Unfortunately, that didn't work because of kubernetes/kubeadm#1432: I had to manually remove the etcd member with sudo ETCDCTL_API=3 etcdctl --endpoints <FUNCTIONING_CONTROL_PLANE>:2379 --cacert /etc/kubernetes/pki/etcd/ca.crt --cert /etc/kubernetes/pki/etcd/peer.crt --key /etc/kubernetes/pki/etcd/peer.key member remove <MEMBER_ID> and then edit the kubeadm-config configmap to remove the old controlplane from the ClusterStatus.

@sethp-nr
Copy link
Contributor

Also, I know this is a bigger question, but I did get to wondering if "machine existence" is the right way to decide whether to init or join. Do you think there's a way to depend on something in the cluster's status?

@ashish-amarnath
Copy link
Contributor Author

ashish-amarnath commented Apr 30, 2019

@sethp-nr Thanks for testing this out for me. 🙏

Yeah, I agree that we should make that decision on cluster status and not merely MachineExists. I don't think we have a status field for that today.

https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/master/pkg/apis/awsprovider/v1alpha1/awsclusterproviderstatus_types.go
https://github.com/kubernetes-sigs/cluster-api/blob/master/pkg/apis/cluster/v1alpha1/cluster_types.go

This would be something the new data model will have. @vincepri can confirm

@ashish-amarnath ashish-amarnath changed the title [WIP] Fix control plane node join logic Fix control plane node join logic Apr 30, 2019
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 30, 2019
@sethp-nr
Copy link
Contributor

Oh, good catch – I see the new phase field in the cluster status.

It sounds like whomever is responsible for that first init would be responsible for setting the cluster phase to something that indicates for new control plane machines to join, do I have that right? Would it be hasty to create a provider-specific field for now that we update when we reconcile the cluster and then drive this logic from that data, maybe as a follow-up?

@sethp-nr
Copy link
Contributor

And it was my pleasure to test this out – thank you for a quick turnaround!

@@ -96,37 +96,50 @@ func machinesEqual(m1 *clusterv1.Machine, m2 *clusterv1.Machine) bool {
return m1.Name == m2.Name && m1.Namespace == m2.Namespace
}

// isNodeJoin determines if a machine, in scope, should join of the cluster.
func (a *Actuator) isNodeJoin(scope *actuators.MachineScope, controlPlaneMachines []*clusterv1.Machine) (bool, error) {
switch set := scope.Machine.ObjectMeta.Labels["set"]; set {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not directly related to this PR, but instead of switching on the set label, this could just check if scope.Machine.Spec.Versions.ControlPlane != ""

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is to determine whether or not this machine in scope should join the cluster or not and also be able to handle all "kinds" of machines- controlplane, worker, and any other we might add later.

scope.Machine.Spec.Versions.ControlPlane != "" will just tell us whether this is a controlplane machine or not.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I do think that we should be able to potentially special case other solutions as well, but currently we have two different ways of determining a control-plane machine within cluster-api upstream (for the purposes of clusterctl) and per-provider. If we can standardize on a single solution for controlplane or not, then it is less cognitive overhead for the end user.

Removing the requirement of the set label for control plane instances would also reduce user confusion when trying to deploy a MachineSet or MachineDeployment, where it isn't overly clear that they need to include those in the labels for the template.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree on not having 2 ways of doing the same. As it is not directly related to this PR, I'll follow-up with a PR for this.

pkg/cloud/aws/actuators/machine/actuator.go Outdated Show resolved Hide resolved
pkg/cloud/aws/actuators/machine/actuator.go Outdated Show resolved Hide resolved
return true, nil
case "controlplane":
// Controlplane machines will join the cluster if the cluster has an existing control plane.
controlplaneExists := false
var err error
for _, cm := range controlPlaneMachines {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there be some type of locking here to avoid parallel initialization between multiple control plane hosts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! But the upstream cluster deployer doesn't support multiple controlplane nodes in the namespace.
https://github.com/kubernetes-sigs/cluster-api/blob/master/cmd/clusterctl/clusterdeployer/clusterclient/clusterclient.go#L1045
So there will be no racing to be the first controlplane machine.
Once the upstream master start supporting multiple controlplane machines in the cluster, we'll have to make isNodeJoin thread safe.

Maybe worth adding comments about this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created this issue kubernetes-sigs/cluster-api#925 and I'll link it in the code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, multiple control-plane nodes should work today if using clusterctl it will instantiate them serially. We even have a make target for easy testing make create-cluster-ha

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The serial instantiation is more for the benefit of kubeadm which will not support parallel init and controlplane join until v1.15.

Copy link
Contributor Author

@ashish-amarnath ashish-amarnath May 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. I had forgotten about that. Thanks for reminding. I am thinking do we want to have something cluster spec to synchronize. Also, implementing that will be kinda expanding the scope of this fix. Do you think we can address that in a follow-up PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@detiber This PR, this is a bug fix. We can make isNodeJoin thread safe as an enhancement and fix to this issue #763
I don't think it should block this PR, do you disagree?

pkg/cloud/aws/actuators/machine/actuator.go Outdated Show resolved Hide resolved
add comments about thread safety of isNodeJoin
use shared scope instead of creating one for each controlplane verify
@detiber
Copy link
Member

detiber commented May 6, 2019

/assign @vincepri

@vincepri
Copy link
Member

vincepri commented May 6, 2019

Generally LGTM, leaving the final review/lgtm to @chuckha and @sethp-nr

@sethp-nr
Copy link
Contributor

sethp-nr commented May 7, 2019

lgtm!

@chuckha
Copy link
Contributor

chuckha commented May 8, 2019

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 8, 2019
@ashish-amarnath
Copy link
Contributor Author

@chuckha needs approve too

@chuckha
Copy link
Contributor

chuckha commented May 8, 2019

/approve

headdesk.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ashish-amarnath, chuckha

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 8, 2019
@k8s-ci-robot k8s-ci-robot merged commit 97c7d7f into kubernetes-sigs:master May 8, 2019
@ashish-amarnath ashish-amarnath deleted the fix-controlplane-join branch May 8, 2019 21:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Terminating controlplane-0 results in a new cluster
7 participants