KCP: upgrades are disruptive to api server clients #2652

sethp-nr · 2020-03-12T02:31:28Z

What happened

Continuing from #2651, when we were running a KCP upgrade on a self-hosted cluster we noticed that the process was fairly disruptive.

To be clear, the process would be disruptive even being run from an outboard management cluster, but with cluster-api being a client of itself we got an up-close look at what other clients would experience during an upgrade.

The summary of the issue is that the flow looks roughly like this:

Pick a control plane machine for replacement, cp
Remove cp's etcd membership
Remove the cp Machine, which cascades to the infrastructure provider
The infrastructure provider shuts down the underlying resource, stopping static pods

Between step 3 and sometime during step 4, the apiserver pod is still running, but the local etcd pod is crashing repeatedly.

What I haven't dug into is what the API server running on that host is doing, but from the behavior we observed client-side it seemed like it might be up and accepting connections, but unable to process them.

We're also not sure whether it was a single API server that was borked or if all the api servers were each experiencing some kind of partial failure at the time, but given our 100% failure rate it seems likely that there's another layer of disruption happening here.

Ideas

Is it wise to teach the apiserver pod to talk to non-local etcd processes, so each apiserver can continue functioning normally as long as there are some etcd members healthy? (we'd have to update members after later joins occurred)
Can we shut down the API server pod independently from stopping the infrastructure?
Should we reverse the order so that the etcd member remove happens after the Machine is gone?

Vince suggested that #2525 would probably help, and I'm inclined to agree. It looks like at least in our default mode (I haven't checked to see if we can override) the api server is configured with this flag:

    - --etcd-servers=https://127.0.0.1:2379

So even if we sort out the etcd leadership more gracefully, removing the cp node from the etcd cluster will cause the apiserver on that machine to have trouble.

The text was updated successfully, but these errors were encountered:

ncdc · 2020-03-18T17:34:12Z

Now that #2525 is in, we should retest and see if this is still an issue.

sethp-nr · 2020-03-18T17:47:02Z

When I was testing Chuck's PR yesterday it was on top of #2525 and I was still seeing a couple of cases where the controllers running in the management cluster (in kind) had broken connections or timeouts trying to talk to the workload cluster (in AWS).

From what I can see, the apiserver pod is still showing as Ready on the node where etcd has entered CrashLoopBackoff, which I interpret to mean that 1) requests are still being routed to it, and 2) are doomed to some kind of failure.

I think we need to coordinate the shutdown of the apiserver pod differently relative to the etcd member removal; I'm not sure we're getting the full benefit of the leadership stability work yet.

vincepri · 2020-06-10T20:10:41Z

/priority important-longterm

vincepri · 2020-07-31T15:39:31Z

/assign @CecileRobertMichon
/milestone v0.3.9

To re-triage for v0.3.9 and see if we have any other action items for v0.3 or next release.

CecileRobertMichon · 2020-08-04T23:40:41Z

@sethp-nr I am seeing similar behaviors when updating a single control plane cluster for a brief moment when the second replacement control plane joins and the number of control plane nodes goes from 1 to 2. This is expected I believe as there is a loss of etcd quorum. However, I am not reproducing such behavior with a larger number of control planes (tried with 7). I do see the etcd pod of the node that is marked as getting deleted going into CrashLoopBackOff before it goes to Terminating while the Machine is in Deleting phase but the apiserver availability seems unaffected.

kube-system etcd-capi-quickstart-control-plane-mhnpk 0/1 CrashLoopBackOff 4 64m

Is that consistent with your experience? If not, do you have any specific steps you could share to repro the issue?

vincepri · 2020-08-24T18:12:14Z

/milestone Next

vincepri · 2020-10-22T16:43:35Z

@rudoi Do you know if this still happening?

/assign @yastij
To re-triage

vincepri · 2020-10-22T16:43:48Z

/priority awaiting-more-evidence

rudoi · 2020-10-22T16:46:44Z

@rudoi Do you know if this still happening?

/assign @yastij
To re-triage

We haven't seen this issue in quite some time, but we 1. haven't done too many KCP upgrades recently and 2. have migrated primarily to EKS controlplanes so our sample space is quite a bit smaller.

fejta-bot · 2021-01-20T17:46:44Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fabriziopandini · 2021-01-20T18:42:32Z

/remove-lifecycle stale

yastij · 2021-01-20T20:57:26Z

I've done some testing on VMC, I didn't see this happening with the latest CAPI. Let's close this and open back if someone sees this again

/close

k8s-ci-robot · 2021-01-20T20:57:33Z

@yastij: Closing this issue.

In response to this:

I've done some testing on VMC, I didn't see this happening with the latest CAPI. Let's close this and open back if someone sees this again

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

rudoi mentioned this issue Mar 12, 2020

KCP: expose additional LeaderElection configuration as flags #2653

Closed

ncdc added this to the v0.3.x milestone Mar 18, 2020

ncdc added area/control-plane Issues or PRs related to control-plane lifecycle management priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. labels Mar 18, 2020

ncdc added the kind/bug Categorizes issue or PR as related to a bug. label Mar 18, 2020

sethp-nr mentioned this issue Mar 31, 2020

KubeadmControlPlane v2 iteration and robustness #2753

Closed

sethp-nr mentioned this issue Apr 20, 2020

KCP should shut down kube-apiserver gracefully #2937

Closed

vincepri removed the priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. label Jun 10, 2020

k8s-ci-robot added the priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. label Jun 10, 2020

jhead-slg mentioned this issue Jul 30, 2020

Control Plane rolling update stall with EIP kubernetes-sigs/cluster-api-provider-packet#163

Closed

k8s-ci-robot assigned CecileRobertMichon Jul 31, 2020

k8s-ci-robot modified the milestones: v0.3.x, v0.3.9 Jul 31, 2020

k8s-ci-robot modified the milestones: v0.3.9, Next Aug 24, 2020

k8s-ci-robot assigned yastij Oct 22, 2020

vincepri removed the priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. label Oct 22, 2020

k8s-ci-robot added the priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. label Oct 22, 2020

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 20, 2021

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 20, 2021

k8s-ci-robot closed this as completed Jan 20, 2021

kdw174 mentioned this issue Jul 25, 2024

kube-apiserver does not gracefully terminate when a control plane machine is scaled down #10934

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KCP: upgrades are disruptive to api server clients #2652

KCP: upgrades are disruptive to api server clients #2652

sethp-nr commented Mar 12, 2020

ncdc commented Mar 18, 2020

sethp-nr commented Mar 18, 2020

vincepri commented Jun 10, 2020

vincepri commented Jul 31, 2020

CecileRobertMichon commented Aug 4, 2020

vincepri commented Aug 24, 2020

vincepri commented Oct 22, 2020

vincepri commented Oct 22, 2020

rudoi commented Oct 22, 2020

fejta-bot commented Jan 20, 2021

fabriziopandini commented Jan 20, 2021

yastij commented Jan 20, 2021

k8s-ci-robot commented Jan 20, 2021

KCP: upgrades are disruptive to api server clients #2652

KCP: upgrades are disruptive to api server clients #2652

Comments

sethp-nr commented Mar 12, 2020

ncdc commented Mar 18, 2020

sethp-nr commented Mar 18, 2020

vincepri commented Jun 10, 2020

vincepri commented Jul 31, 2020

CecileRobertMichon commented Aug 4, 2020

vincepri commented Aug 24, 2020

vincepri commented Oct 22, 2020

vincepri commented Oct 22, 2020

rudoi commented Oct 22, 2020

fejta-bot commented Jan 20, 2021

fabriziopandini commented Jan 20, 2021

yastij commented Jan 20, 2021

k8s-ci-robot commented Jan 20, 2021