Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scale down control plane not working (2 CP to 1 CP) #96

Closed
nasusoba opened this issue Mar 20, 2024 · 2 comments · Fixed by #103
Closed

Scale down control plane not working (2 CP to 1 CP) #96

nasusoba opened this issue Mar 20, 2024 · 2 comments · Fixed by #103

Comments

@nasusoba
Copy link
Contributor

Create a cluster with 3 control plane and then scale down to 1 is buggy, need more investigation.

@mogliang
Copy link
Collaborator

Wonderful! @nasusoba
So we've begin to gain from e2e test!

@nasusoba
Copy link
Contributor Author

nasusoba commented Mar 27, 2024

Setup: Create a k3s cluster with 3 CP and 1 worker node. Scale down to 1 CP.
Behavior: After scaling down to 1CP, the k3s on the remaining 1CP keeps crashing, the api server is not reachable

Log in the remaining 1 CP:

Mar 25 06:50:53 capik3s-create-9j5sww-control-plane-mbvhz k3s[394]: time="2024-03-25T06:50:53Z" level=info msg="Starting k3s v1.28.6+k3s2 (c9f49a3b)"

....

Mar 25 07:30:12 capik3s-create-9j5sww-control-plane-mbvhz k3s[394]: {"level":"info","ts":"2024-03-25T07:30:12.108733Z","caller":"membership/cluster.go:472","msg":"removed member","cluster-id":"b61dd090a9d8a70a","local-member-id":"f460ff1f4322ce47","removed-remote-peer-id":"c627ba7d600c404b","removed-remote-peer-urls":["https://172.18.0.5:2380"]}

....

Mar 25 07:30:30 capik3s-create-9j5sww-control-plane-mbvhz k3s[394]: {"level":"info","ts":"2024-03-25T07:30:30.608487Z","caller":"membership/cluster.go:421","msg":"added member","cluster-id":"b61dd090a9d8a70a","local-member-id":"f460ff1f4322ce47","added-peer-id":"39056615cf9e3fe5","added-peer-peer-urls":["https://172.18.0.5:2380"]}
Mar 25 07:30:30 capik3s-create-9j5sww-control-plane-mbvhz k3s[394]: {"level":"info","ts":"2024-03-25T07:30:30.608519Z","caller":"rafthttp/peer.go:133","msg":"starting remote peer","remote-peer-id":"39056615cf9e3fe5"}
Mar 25 07:30:30 capik3s-create-9j5sww-control-plane-mbvhz k3s[394]: {"level":"info","ts":"2024-03-25T07:30:30.60855Z","caller":"rafthttp/pipeline.go:72","msg":"started HTTP pipelining with remote peer","local-member-id":"f460ff1f4322ce47","remote-peer-id":"39056615cf9e3fe5"}
Mar 25 07:30:30 capik3s-create-9j5sww-control-plane-mbvhz k3s[394]: {"level":"info","ts":"2024-03-25T07:30:30.608734Z","caller":"rafthttp/stream.go:169","msg":"started stream writer with remote peer","local-member-id":"f460ff1f4322ce47","remote-peer-id":"39056615cf9e3fe5"}
Mar 25 07:30:30 capik3s-create-9j5sww-control-plane-mbvhz k3s[394]: {"level":"info","ts":"2024-03-25T07:30:30.609126Z","caller":"rafthttp/stream.go:169","msg":"started stream writer with remote peer","local-member-id":"f460ff1f4322ce47","remote-peer-id":"39056615cf9e3fe5"}
Mar 25 07:30:30 capik3s-create-9j5sww-control-plane-mbvhz k3s[394]: {"level":"info","ts":"2024-03-25T07:30:30.609384Z","caller":"rafthttp/peer.go:137","msg":"started remote peer","remote-peer-id":"39056615cf9e3fe5"}
Mar 25 07:30:30 capik3s-create-9j5sww-control-plane-mbvhz k3s[394]: {"level":"info","ts":"2024-03-25T07:30:30.609419Z","caller":"rafthttp/transport.go:317","msg":"added remote peer","local-member-id":"f460ff1f4322ce47","remote-peer-id":"39056615cf9e3fe5","remote-peer-urls":["https://172.18.0.5:2380"]}

We call etcd RemoveMember from CAPI, but after 10s, the removed member get added back (with a different ID)!

Seems that after adding etcd.k3s.cattle.io/remove="true" annotation to the workerload cluster node before deleting the Machine in management cluster, it will prevent the machine from being added back as a member in etcd, and the scaling down of CP could be completed. refer to [master] Add etcd-member-management controller to K3s by Oats87 · Pull Request #4001 · k3s-io/k3s (github.com)

But the node drain is never finished (the machine will be forced to deleted after 10min). Need more investigation on why draining is failed, and if adding the annotation is the official way to remove a node.

@nasusoba nasusoba changed the title Scale down control plane not working Scale down control plane not working (2 CP to 1 CP) Apr 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants