-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KCP: upgrades are disruptive to api server clients #2652
Comments
Now that #2525 is in, we should retest and see if this is still an issue. |
When I was testing Chuck's PR yesterday it was on top of #2525 and I was still seeing a couple of cases where the controllers running in the management cluster (in kind) had broken connections or timeouts trying to talk to the workload cluster (in AWS). From what I can see, the apiserver pod is still showing as I think we need to coordinate the shutdown of the apiserver pod differently relative to the etcd member removal; I'm not sure we're getting the full benefit of the leadership stability work yet. |
/priority important-longterm |
/assign @CecileRobertMichon To re-triage for v0.3.9 and see if we have any other action items for v0.3 or next release. |
@sethp-nr I am seeing similar behaviors when updating a single control plane cluster for a brief moment when the second replacement control plane joins and the number of control plane nodes goes from 1 to 2. This is expected I believe as there is a loss of etcd quorum. However, I am not reproducing such behavior with a larger number of control planes (tried with 7). I do see the etcd pod of the node that is marked as getting deleted going into
Is that consistent with your experience? If not, do you have any specific steps you could share to repro the issue? |
/milestone Next |
/priority awaiting-more-evidence |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle stale |
I've done some testing on VMC, I didn't see this happening with the latest CAPI. Let's close this and open back if someone sees this again /close |
@yastij: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
What happened
Continuing from #2651, when we were running a KCP upgrade on a self-hosted cluster we noticed that the process was fairly disruptive.
To be clear, the process would be disruptive even being run from an outboard management cluster, but with cluster-api being a client of itself we got an up-close look at what other clients would experience during an upgrade.
The summary of the issue is that the flow looks roughly like this:
cp
cp
's etcd membershipcp
Machine, which cascades to the infrastructure providerBetween step 3 and sometime during step 4, the apiserver pod is still running, but the local etcd pod is crashing repeatedly.
What I haven't dug into is what the API server running on that host is doing, but from the behavior we observed client-side it seemed like it might be up and accepting connections, but unable to process them.
We're also not sure whether it was a single API server that was borked or if all the api servers were each experiencing some kind of partial failure at the time, but given our 100% failure rate it seems likely that there's another layer of disruption happening here.
Ideas
Vince suggested that #2525 would probably help, and I'm inclined to agree. It looks like at least in our default mode (I haven't checked to see if we can override) the api server is configured with this flag:
So even if we sort out the etcd leadership more gracefully, removing the
cp
node from the etcd cluster will cause the apiserver on that machine to have trouble.The text was updated successfully, but these errors were encountered: