Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KCP: upgrades are disruptive to api server clients #2652

Closed
sethp-nr opened this issue Mar 12, 2020 · 13 comments
Closed

KCP: upgrades are disruptive to api server clients #2652

sethp-nr opened this issue Mar 12, 2020 · 13 comments
Assignees
Labels
area/control-plane Issues or PRs related to control-plane lifecycle management kind/bug Categorizes issue or PR as related to a bug. priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done.

Comments

@sethp-nr
Copy link
Contributor

What happened

Continuing from #2651, when we were running a KCP upgrade on a self-hosted cluster we noticed that the process was fairly disruptive.

To be clear, the process would be disruptive even being run from an outboard management cluster, but with cluster-api being a client of itself we got an up-close look at what other clients would experience during an upgrade.

The summary of the issue is that the flow looks roughly like this:

  1. Pick a control plane machine for replacement, cp
  2. Remove cp's etcd membership
  3. Remove the cp Machine, which cascades to the infrastructure provider
  4. The infrastructure provider shuts down the underlying resource, stopping static pods

Between step 3 and sometime during step 4, the apiserver pod is still running, but the local etcd pod is crashing repeatedly.

What I haven't dug into is what the API server running on that host is doing, but from the behavior we observed client-side it seemed like it might be up and accepting connections, but unable to process them.

We're also not sure whether it was a single API server that was borked or if all the api servers were each experiencing some kind of partial failure at the time, but given our 100% failure rate it seems likely that there's another layer of disruption happening here.

Ideas

  • Is it wise to teach the apiserver pod to talk to non-local etcd processes, so each apiserver can continue functioning normally as long as there are some etcd members healthy? (we'd have to update members after later joins occurred)
  • Can we shut down the API server pod independently from stopping the infrastructure?
  • Should we reverse the order so that the etcd member remove happens after the Machine is gone?

Vince suggested that #2525 would probably help, and I'm inclined to agree. It looks like at least in our default mode (I haven't checked to see if we can override) the api server is configured with this flag:

    - --etcd-servers=https://127.0.0.1:2379

So even if we sort out the etcd leadership more gracefully, removing the cp node from the etcd cluster will cause the apiserver on that machine to have trouble.

@ncdc ncdc added this to the v0.3.x milestone Mar 18, 2020
@ncdc ncdc added area/control-plane Issues or PRs related to control-plane lifecycle management priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. labels Mar 18, 2020
@ncdc
Copy link
Contributor

ncdc commented Mar 18, 2020

Now that #2525 is in, we should retest and see if this is still an issue.

@ncdc ncdc added the kind/bug Categorizes issue or PR as related to a bug. label Mar 18, 2020
@sethp-nr
Copy link
Contributor Author

When I was testing Chuck's PR yesterday it was on top of #2525 and I was still seeing a couple of cases where the controllers running in the management cluster (in kind) had broken connections or timeouts trying to talk to the workload cluster (in AWS).

From what I can see, the apiserver pod is still showing as Ready on the node where etcd has entered CrashLoopBackoff, which I interpret to mean that 1) requests are still being routed to it, and 2) are doomed to some kind of failure.

I think we need to coordinate the shutdown of the apiserver pod differently relative to the etcd member removal; I'm not sure we're getting the full benefit of the leadership stability work yet.

@vincepri vincepri removed the priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. label Jun 10, 2020
@vincepri
Copy link
Member

/priority important-longterm

@k8s-ci-robot k8s-ci-robot added the priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. label Jun 10, 2020
@vincepri
Copy link
Member

/assign @CecileRobertMichon
/milestone v0.3.9

To re-triage for v0.3.9 and see if we have any other action items for v0.3 or next release.

@CecileRobertMichon
Copy link
Contributor

@sethp-nr I am seeing similar behaviors when updating a single control plane cluster for a brief moment when the second replacement control plane joins and the number of control plane nodes goes from 1 to 2. This is expected I believe as there is a loss of etcd quorum. However, I am not reproducing such behavior with a larger number of control planes (tried with 7). I do see the etcd pod of the node that is marked as getting deleted going into CrashLoopBackOff before it goes to Terminating while the Machine is in Deleting phase but the apiserver availability seems unaffected.

kube-system etcd-capi-quickstart-control-plane-mhnpk 0/1 CrashLoopBackOff 4 64m

Is that consistent with your experience? If not, do you have any specific steps you could share to repro the issue?

@vincepri
Copy link
Member

/milestone Next

@k8s-ci-robot k8s-ci-robot modified the milestones: v0.3.9, Next Aug 24, 2020
@vincepri
Copy link
Member

@rudoi Do you know if this still happening?

/assign @yastij
To re-triage

@vincepri vincepri removed the priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. label Oct 22, 2020
@vincepri
Copy link
Member

/priority awaiting-more-evidence

@k8s-ci-robot k8s-ci-robot added the priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. label Oct 22, 2020
@rudoi
Copy link
Contributor

rudoi commented Oct 22, 2020

@rudoi Do you know if this still happening?

/assign @yastij
To re-triage

We haven't seen this issue in quite some time, but we 1. haven't done too many KCP upgrades recently and 2. have migrated primarily to EKS controlplanes so our sample space is quite a bit smaller.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 20, 2021
@fabriziopandini
Copy link
Member

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 20, 2021
@yastij
Copy link
Member

yastij commented Jan 20, 2021

I've done some testing on VMC, I didn't see this happening with the latest CAPI. Let's close this and open back if someone sees this again

/close

@k8s-ci-robot
Copy link
Contributor

@yastij: Closing this issue.

In response to this:

I've done some testing on VMC, I didn't see this happening with the latest CAPI. Let's close this and open back if someone sees this again

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/control-plane Issues or PRs related to control-plane lifecycle management kind/bug Categorizes issue or PR as related to a bug. priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done.
Projects
None yet
Development

No branches or pull requests

9 participants