Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kubeadm join is not fault tolerant to etcd endpoint failures #1432

Closed
danbeaulieu opened this issue Mar 1, 2019 · 29 comments
Closed

kubeadm join is not fault tolerant to etcd endpoint failures #1432

danbeaulieu opened this issue Mar 1, 2019 · 29 comments
Assignees
Labels
area/HA kind/bug Categorizes issue or PR as related to a bug. priority/backlog Higher priority than priority/awaiting-more-evidence.
Milestone

Comments

@danbeaulieu
Copy link

What keywords did you search in kubeadm issues before filing this one?

etcd kubeadm join clusterstatus

Is this a BUG REPORT or FEATURE REQUEST?

BUG REPORT

Versions

kubeadm version (use kubeadm version):
$ kubeadm version
kubeadm version: &version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.4", GitCommit:"c27b913fddd1a6c480c229191a087698aa92f0b1", GitTreeState:"clean", BuildDate:"2019-02-28T13:35:32Z", GoVersion:"go1.11.5", Compiler:"gc", Platform:"linux/amd64"}

Environment:

  • Kubernetes version (use kubectl version): 1.13.4
  • Cloud provider or hardware configuration: Self managed AWS
  • OS (e.g. from /etc/os-release):Ubuntu 18.04.2
  • Kernel (e.g. uname -a): 4.15.0-1032-aws
  • Others:

What happened?

kubeadm join --experimental-control-plane sporadically fails when adding a new node to the control plane cluster after a node is removed.

What you expected to happen?

For the join to succeed.

How to reproduce it (as minimally and precisely as possible)?

Create an HA stacked control plane cluster. Terminate one of the control plane nodes.
Start another node, remove the failed etcd member, delete the failed node (kubectl delete node ...) and run kubeadm join --experimental-control-plane on the new node.

Anything else we need to know?

This is due to a few things:

  1. The ClusterStatus in the config map still lists the node that has been terminated/removed.
  2. There is a bug in go-grpc that manifests in the etcd v3 client where if the first endpoint used in the client constructor is not responsive, the other endpoints provided to the constructor aren't tried. [gRPC] clientconn: set authority with the latest dial target etcd-io/etcd#10489
  3. golang does not guarantee order when using range on maps, such as the ClusterStatus.apiEndpoints map. If there is 1 "bad" endpoint in the ClusterStatus it may be the first one in the endpoints list, or it may not. If it is not the first one and a "healthy" endpoint is then kubeadm does the right thing. If it is the first one then the etcd client fails and kubeadm fails.
@neolit123 neolit123 added help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. area/HA priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. labels Mar 1, 2019
@neolit123 neolit123 added this to the Next milestone Mar 1, 2019
@fabriziopandini
Copy link
Member

fabriziopandini commented Mar 1, 2019

I would not say this is a problem of join, but instead, it is a problem of reset that does not cleanup properly, and there is already work ongoing to fix that. See

For the etcd problem I'm not sure what kubeadm can really do (apart of implementing hacky workaround). IMO this should be fixed on etcd, but I'm open to suggestions

@danbeaulieu
Copy link
Author

danbeaulieu commented Mar 1, 2019

@fabriziopandini in the reproduction steps kubeadm reset is not used, which is a real world scenario outside of intentional chaos engineering. When a node goes away there is no guarantee that kubeadm reset will be called or if it is called that it will complete successfully (network partition etc). So even if kubeadm reset does the right thing, it can't be relied on in 100% of cases.

Kubeadm should be resilient to those scenarios. This issue may be a good place to track either the hacky workaround (order the map? retry etcd client explicitly?) or the integration of the patched etcd client when it is available.

@pytimer
Copy link

pytimer commented Mar 2, 2019

When a node goes away there is no guarantee that kubeadm reset will be called or if it is called that it will complete successfully (network partition etc).

If nodes terminated or shutdown, i use etcdctl to remove this etcd member. It could not use kubeadm. And if etcd cluster have two member, one of node goes away, the cluster not health, it maybe need human intervention.

I am not sure if i understand your words right?

@rosti
Copy link

rosti commented Mar 5, 2019

We need to:

  1. Document how to remove a node on which kubeadm reset cannot be executed for some reason. (Hardware breakage is a valid reason)

  2. Probably introduce kubeadm cleanup that would act like reset, but for other node.

WDYT?

@pytimer
Copy link

pytimer commented Mar 6, 2019

Probably introduce kubeadm cleanup that would act like reset, but for other node.

AFAIK, if node terminated, we needs to execute command that remove etcd member from the cluster on the one of masters and update the ClusterStatus in kubeadm-config.

If these steps using kubeadm, can add reset phases in kubeadm?

@danbeaulieu
Copy link
Author

@pytimer I am ok with kubeadm leaving it up to the user to do the etcd member management. ie if I have a cluster with 3 nodes A,B,C and I terminate C (and don't run kubeadm reset on C) I am left with an unhealthy etcd member, a node (kubectl get nodes) that is "not ready", and a ClusterStatus.apiEndpoints with the 3 nodes listed. At this point the cluster is still healthy since we only lost 1 node out of 3.

When adding a new node D before joining the cluster it first:

  • Removes the unhealthy etcd member C
  • kubectl delete node C
  • Updates the ClusterStatus in the configmap to remove C

At this point we have a 2 node cluster, all nodes healthy and the kubeadm join ... can proceed without issues.

The goal of this issue is to not have to perform the explicit ClusterStatus update. If there is a way forward where we don't have to do any of the cleanup prework, all the better but it may be a better idea to start with a smaller scope.

@fabriziopandini
Copy link
Member

@danbeaulieu thanks for explaination
@rosti let's raise the point at the next planning session; I see two problems to be addressed:

  • cleanup of a lost control-plane nodes
  • join control-plane resilient to lost control-plane nodes

@anitgandhi
Copy link

I was following the discussion in #1300 and now here, just wanted to throw in my support on this - everything @danbeaulieu pointed out is something we've noticed in our cluster management. Namely, we had a situation where we had a 3-node cluster, and for whatever reason one of the nodes died. AWS ASG brought a new one up, but it hung up on the situation described above.

Our fix (in our provisioner) was the same as what Dan described, when a new node joins, reconcile the membership with etcd and the ClusterStatus.

@fabriziopandini
Copy link
Member

@danbeaulieu @anitgandhi thanks for your support!
After last changes landed in v1.14, kubeadm hopefully manages properly the controlled removal of a control-plane node (removal of the etcd member and cleanup of the ClusterConfiguration, while node deletion is still out of the radars but AFAIK this was nice to have, not blocking)

It was also agreed to to try to direct some of our bandwidth in v1.15 on implementing reset phases, so we can offer to the users some tools to recover from an uncontrolled loss of a control-plane node

What is still missing to the above picture is to define how to determine that we are in this situation, and how to change the join workflow in this case

Any help, suggestion, contribution with this regards will be really appreciated

@rbtr
Copy link

rbtr commented Mar 22, 2019

@fabriziopandini I think this bug still exists in the etcd preflight:
to reiterate, if the endpoints passed to the the etcd client are stale and contain some members that no longer exist/no longer have etcd running, there is a chance, due to how the etcd client works, that the call to cli.Sync() will try to connect to the nonexistent endpoint and will error out. that error propagates up and halts the kubeadm process.
I have encountered this in some etcd client code of my own and am seeing this happen with kubeadm intermittently too. The simple fix that I have used is:

  • iterate through the list of potential etcd cluster endpoints
  • create a client for just that endpoint, and check its health
  • IFF it is healthy, call cli.Sync() to ask that endpoint what the other cluster endpoints are

This prevents the edge case where we try to cli.Sync() with a node that doesn't exist/isn't healthy.

I will PR this unless anyone sees a problem with it, as it is safer in the long run and doesn't preempt the other node reset work being done.

@fabriziopandini
Copy link
Member

/lifecycle active

@k8s-ci-robot k8s-ci-robot added the lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. label Mar 22, 2019
@bluebeach
Copy link

When a node goes away there is no guarantee that kubeadm reset will be called or if it is called that it will complete successfully (network partition etc).

If nodes terminated or shutdown, i use etcdctl to remove this etcd member. It could not use kubeadm. And if etcd cluster have two member, one of node goes away, the cluster not health, it maybe need human intervention.

I am not sure if i understand your words right?

Thanks, it works after remove etcd member manually

@timothysc timothysc added priority/backlog Higher priority than priority/awaiting-more-evidence. and removed priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. labels May 3, 2019
@timothysc timothysc modified the milestones: Next, v1.15 May 3, 2019
@neolit123 neolit123 modified the milestones: v1.15, v1.16 Jun 10, 2019
@neolit123 neolit123 added kind/feature Categorizes issue or PR as related to a new feature. and removed help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. labels Jun 10, 2019
@neolit123 neolit123 self-assigned this Jun 13, 2019
@neolit123
Copy link
Member

this is hopefully going to be fixed in 1.16
xref kubernetes/kubernetes#72102

@mauilion
Copy link

For folks that happen across this issue. I've created a static pod at git.io/etcdclient.yaml that can be used to interact with etcd once you have deployed it.
Instructions and details here:

https://mauilion.dev/posts/etcdclient/

@neolit123
Copy link
Member

@mauilion

    - name: ETCDCTL_CERT
      value: /etc/kubernetes/pki/etcd/healthcheck-client.crt
    - name: ETCDCTL_KEY
      value: /etc/kubernetes/pki/etcd/healthcheck-client.key
    - name: ETCDCTL_ENDPOINTS

i've just send a PR that may be the first step in deprecating the /etcd/healthcheck-client*, btw.
kubernetes/kubernetes#81385

PTAL and do tell if you object on serving a HTTP probe on localhost.
i do not see a problem with that, then again the security folks might come at us with "hey, Eve can now see the etcd metrics of Alice". but well, if Eve gains access to Alices computer, she can do worse.

@neolit123 neolit123 added kind/bug Categorizes issue or PR as related to a bug. and removed kind/feature Categorizes issue or PR as related to a new feature. labels Aug 29, 2019
@neolit123
Copy link
Member

this should be fixed in 1.16 when kubeadm will use etcd 3.3.15.
please re-open if needed.

@danbeaulieu
Copy link
Author

danbeaulieu commented Nov 4, 2019

@neolit123 I don't believe this is fixed in 1.16.2.

# kubeadm version
kubeadm version: &version.Info{Major:"1", Minor:"16", GitVersion:"v1.16.2", GitCommit:"c97fe5036ef3df2967d086711e6c0c405941e14b", GitTreeState:"clean", BuildDate:"2019-10-15T19:15:39Z", GoVersion:"go1.12.10", Compiler:"gc", Platform:"linux/amd64"}
I1104 20:33:18.728787   21308 local.go:75] [etcd] Checking etcd cluster health
I1104 20:33:18.728803   21308 local.go:78] creating etcd client that connects to etcd pods
I1104 20:33:18.742041   21308 etcd.go:107] etcd endpoints read from pods: https://10.97.129.1:2379,https://10.97.129.2:2379,https://10.97.129.3:2379,https://10.97.131.111:2379,https://10.97.128.146:2379
context deadline exceeded
error syncing endpoints with etc
k8s.io/kubernetes/cmd/kubeadm/app/util/etcd.NewFromCluster
	/workspace/anago-v1.16.2-beta.0.19+c97fe5036ef3df/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubeadm/app/util/etcd/etcd.go:123
k8s.io/kubernetes/cmd/kubeadm/app/phases/etcd.CheckLocalEtcdClusterStatus
	/workspace/anago-v1.16.2-beta.0.19+c97fe5036ef3df/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubeadm/app/phases/etcd/local.go:79
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/join.runCheckEtcdPhase
	/workspace/anago-v1.16.2-beta.0.19+c97fe5036ef3df/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/join/checketcd.go:68
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).Run.func1
	/workspace/anago-v1.16.2-beta.0.19+c97fe5036ef3df/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow/runner.go:236
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).visitAll
	/workspace/anago-v1.16.2-beta.0.19+c97fe5036ef3df/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow/runner.go:424
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).Run
	/workspace/anago-v1.16.2-beta.0.19+c97fe5036ef3df/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow/runner.go:209
k8s.io/kubernetes/cmd/kubeadm/app/cmd.NewCmdJoin.func1
	/workspace/anago-v1.16.2-beta.0.19+c97fe5036ef3df/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubeadm/app/cmd/join.go:169
k8s.io/kubernetes/vendor/github.com/spf13/cobra.(*Command).execute
	/workspace/anago-v1.16.2-beta.0.19+c97fe5036ef3df/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/spf13/cobra/command.go:830
k8s.io/kubernetes/vendor/github.com/spf13/cobra.(*Command).ExecuteC
	/workspace/anago-v1.16.2-beta.0.19+c97fe5036ef3df/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/spf13/cobra/command.go:914
k8s.io/kubernetes/vendor/github.com/spf13/cobra.(*Command).Execute
	/workspace/anago-v1.16.2-beta.0.19+c97fe5036ef3df/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/spf13/cobra/command.go:864
k8s.io/kubernetes/cmd/kubeadm/app.Run
	/workspace/anago-v1.16.2-beta.0.19+c97fe5036ef3df/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubeadm/app/kubeadm.go:50
main.main
	_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubeadm/kubeadm.go:25
runtime.main
	/usr/local/go/src/runtime/proc.go:200
runtime.goexit
	/usr/local/go/src/runtime/asm_amd64.s:1337
error execution phase check-etcd

To test, I created a 2 node HA cluster, then modified the ClusterStatus in kubeadm-config map to add in 3 bogus nodes https://10.97.129.1:2379,https://10.97.129.2:2379,https://10.97.129.3:2379 and then ran kubeadm join.

After the failure I removed the bad nodes from the ClusterStatus and reran kubeadm join and the node joined successfully.

Current work around is to edit the kubeadm-config configmap to remove any "bad" nodes before running join ... which is same as the workaround pre 1.16.

Also, the log message etcd endpoints read from pods... is a little misleading as the endpoints are being read from the ClusterStatus in the configmap.

@neolit123
Copy link
Member

neolit123 commented Nov 4, 2019

@danbeaulieu
what happens if you deploy kubeadm 1.16 with etcd 3.3.17-0 as the image tag (this is possible using the kubeadm config https://godoc.org/k8s.io/kubernetes/cmd/kubeadm/app/apis/kubeadm/v1beta2).

as you can see we already switched 1.17 to this version, but such backports (e.g. for 1.16) are tricky:
https://github.com/kubernetes/kubernetes/blob/master/cmd/kubeadm/app/constants/constants.go#L427

@danbeaulieu
Copy link
Author

I don't believe this is an etcd server version issue, I think it is an etcd client version issue used by kubeadm. The client is not fault tolerant to bad endpoints in the endpoint list if the first endpoint is bad.

There are a lot of k/k and etcd related PRs and issues related to this bug but it isn't clear if this was ever actually fixed in 1.16 kubeadm.

@neolit123
Copy link
Member

i don't think that folks ouside of the etcd maintainers understand this.

we updated the client in master / 1.17 too.
you could build kubeadm from source and try again?

cd kubernetes
make all WHAT=cmd/kubeadm

@rbtr
Copy link

rbtr commented Nov 4, 2019

@danbeaulieu this should be fixed by using the client in etcd versions > 3.3.14
@neolit123 it's unclear to me which version of the etcd client that kubeadm is actually using as k/k has 3.3.10 and 3.4 in the go.mod. i'm going to see if i can pin that down and might open another issue/PR once i have figured it out

@neolit123
Copy link
Member

the etcd client version used in the latest 1.16 is v3.3.17:
https://github.com/kubernetes/kubernetes/blob/release-1.16/go.mod#L41

but latest 1.16 does not mean latest stable 1.16.

the SHA for the v1.16.2 tag tells me that the etcd client is at v3.3.15 there:
https://github.com/kubernetes/kubernetes/blob/c97fe5036ef3df2967d086711e6c0c405941e14b/go.mod#L41

so if this is supposed to be fixed by > 3.3.14, it seems it's not.

@rbtr
Copy link

rbtr commented Nov 4, 2019

etcd v3.3.15 release notes specifically reference that it fixes this issue...
regardless, as of this commit kubeadm should be using the same 3.4 version as the rest of k/k so it should actually be fixed in master regardless of what 3.3.x revision the fixes went in

@neolit123
Copy link
Member

etcd v3.3.15 release notes specifically reference that it fixes this issue...

^ @danbeaulieu

regardless, as of this commit kubeadm should be using the same 3.4 version as the rest of k/k so it should actually be fixed in master regardless of what 3.3.x revision the fixes went in

yes, for 1.17 and master kubeadm and k8s is now using the 3.4 client.
the go.mod and import changes made this more difficult to see because the new etcd repository has no tags.

@danbeaulieu
Copy link
Author

@neolit123 do you see any thing that could explain why this doesn't seem to be fixed in 1.16.2 (v3.3.15 client) according to my reproduction steps? Is anyone able to reproduce?

@neolit123
Copy link
Member

nothing on the kubeadm side at least. we are planning to add e2e tests for removing and re-adding nodes eventually, but everyone is out of bandwidth ATM:
#1661

please try 1.17 as well - there are already pre-releases with images out. as mentioned above, it bundles etcd 3.4 server and a newer client.

@cgebe
Copy link

cgebe commented Dec 29, 2019

Hitting the same issue with 1.17.0, deleted a master node + etcd without running kubeadm reset on the node first.

[check-etcd] Checking that the etcd cluster is healthy
I1229 20:40:28.493126    1674 local.go:75] [etcd] Checking etcd cluster health
I1229 20:40:28.493152    1674 local.go:78] creating etcd client that connects to etcd pods
I1229 20:40:28.511562    1674 etcd.go:107] etcd endpoints read from pods: https://116.203.251.18:2379,https://116.203.251.14:2379,https://116.203.251.13:2379
I1229 20:40:28.524072    1674 etcd.go:166] etcd endpoints read from etcd: https://116.203.251.13:2379,https://116.203.251.18:2379,https://116.203.251.14:2379
I1229 20:40:28.524128    1674 etcd.go:125] update etcd endpoints: https://116.203.251.13:2379,https://116.203.251.18:2379,https://116.203.251.14:2379
I1229 20:40:48.574818    1674 etcd.go:388] Failed to get etcd status for https://116.203.251.14:2379: failed to dial endpoint https://116.203.251.14:2379 with maintenance client: context deadline exceeded

116.203.251.14 was deleted.

What is the current workaround for this issue? Editing kubeadm-config accordingly and restarting etcd + kube-api-server pods does not solve the issue.

@fabriziopandini
Copy link
Member

@cgebe if I'm not wrong you have to delete the member using etcdctl

Btw: kubernetes/enhancements#1380 is going to remove problems related to the kubeadm ClusterStatus getting stale

@cgebe
Copy link

cgebe commented Jan 1, 2020

@fabriziopandini Thanks, removed the dead etcd node manually!

ETCDCTL_API=3 etcdctl --endpoints 127.0.0.1:2379 --cacert /etc/kubernetes/pki/etcd/ca.crt --cert /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key member list
ETCDCTL_API=3 etcdctl --endpoints 127.0.0.1:2379 --cacert /etc/kubernetes/pki/etcd/ca.crt --cert /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key member remove <node_id>

Join thereafter worked as expected!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/HA kind/bug Categorizes issue or PR as related to a bug. priority/backlog Higher priority than priority/awaiting-more-evidence.
Projects
None yet
Development

Successfully merging a pull request may close this issue.