-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kube-apiserver: failover on multi-member etcd cluster fails certificate check on DNS mismatch #83028
Comments
/sig api-machinery |
do you have your full apiserver invocation (specifically the /cc @jpbetz |
Hello @liggitt here it is:
Please note i have the same command (on amd64 production) but in kubernetes 1.15.4 without any problem. It seems there is a regression in 1.16.0 code |
this seems like the same issue as #72102 (comment) which was supposed to be resolved by #81434 in 1.16 |
@liggitt note the api server works, but if kube-control-plane-baeg4ahr.k8s.lan (which is the currently searched) apiserver is broken, /healthz return 500 but apiserver works anyway in degraded etcd mode.
It seems etcd client wants first node certificate on each node |
Agree with @liggitt that #72102 (comment) is the mostly likely cause. I'd check if that resolves the problem first. |
@jpbetz doesn't the error message seem strange to you?
The error message says the certificate is valid for the hostname we're trying to connect to, and is not valid for the first hostname listed in |
Oh wait. The problem cluster already 1.16. Looking |
@jpbetz I don't think our fix handles DNS names in failover, since we can only get target IP from remote connection ref. etcd-io/etcd@db61ee1 |
matching priority in original bug (#72102) |
We need to verify whether this was actually a regression in 1.16, or if the same issue existed in 1.15.4 and just never got hit because the first server specified in |
k8s 1.15 uses etcd 3.3.13 and k8s 1.16 uses etcd 3.3.15. In etcd 3.3.14 we switched over to the new grpc based client side load balancer implementation. etcd-io/etcd@db61ee1 that @gyuho mentioned fixed one failover issue introduced by the new balancer (that @gyuho fixed), but this is different. It's quite probable this is a regression but I agree we should verify. |
cc @dims |
ack @jpbetz will follow along :) |
I've created a reproduction of the issue, it appears to be an issue on both 1.16 and 1.15:https://github.com/jpbetz/etcd/blob/etcd-lb-dnsname-failover/reproduction.md I'm not 100% certain I've gotten the reproduction correct, so extra eyes on it are welcome. |
This was reproduced on 1.15.x as well, so it doesn't appear to be a 1.16 regression. The fix in 1.16 resolved IP TLS validation, but not hostname/DNS validation. |
Removing milestone, but leaving at critical. If a contained fix is developed, I'd recommend it be picked to 1.16.x if possible |
If the 1st etcd member in the
If the etcd member becomes unavailable after the kube-apiserver is started, the kube-apiserver will continue to run but will report the issue in the logs repeatedly, e.g.:
|
Created grpc/grpc-go#3038 to discuss the issue with the gPRC team |
etcd backports of fix: |
thanks for your time. Can we have a backport on 1.15 too please ? |
unfortunately the transitive dependencies make a backport to 1.15 prohibitive. see #72102 (comment) and #72102 (comment) |
Given that we missed the cut for Kubernetes version 1.16.2, I decided to find a workaround for this problem, to allow my API servers to talk to etcd servers (running version 3.3.17). My etcd server certificates include SANs both for the machine's DNS name and the subdomain within which they sit for DNS discovery. My API servers start with a set of URLs that mention those per-machine DNS names. Here's what turned out to work well enough for now: Use a wildcard SAN in the etcd server certificates in place of the per-machine SAN. Given a subdomain for these machines like cluster-1.kubernetes.local and etcd DNS names like etcd0.cluster-1.kubernetes.local, the certificates normally have DNS name SANS as follows:
I instead created certificates with the wildcard:
Restarting the etcd servers with these temporary certificates satisfied the Kubernetes API servers—tested at both version 1.16.1 and 1.16.2. |
I was under the impression that just by upgrading etcd to versoin 3.3.17 the issue was fixed. I don't see where it says that we need to upgrade to 1.16.2. Can someone give me a hit of the k8s and etcd versions needed to fix this issue? |
I'm not saying that upgrading Kubernetes is necessary. I had already upgraded to version 1.16.1 when I first noticed this problem. The question was whether to roll back to our previous version of 1.15.1, which predates this problem. Since I was already running version 1.16.1, and had been preparing to upgrade to 1.16.2—falsely assuming that what became #83968 would be included—so I figured I'd test both of these versions, and share my findings with others contemplating using either of these today. |
Ok got it. The explanation is here #72102 (comment) we still need to wait for #83968 . Targeting 1.16.3 for the fix |
The DNS certificate check issue existed in 1.15.x as well. The handling of ipv6 addresses was what regressed in 1.16 (#83550) |
I don't see our API servers running version 1.15.1 complaining like this, so perhaps it wasn't until later in the 1.15 patch sequence. |
The cert verification issue is #72102 and has existed for many releases. It only appears when the API server fails over to a server other than the first one passed to --etcd-servers, so if the first one is available, no error is observed. |
Well, we must be very lucky around here! |
What happened: Kubernetes APIServer connects to etcd in HTTPS but the certificate check is invalid
What you expected to happen: When kube-apiserver connect to kube-control-plane-mo2phooj with the correct certificate it should not fail because it search for another etcd node certificate.
How to reproduce it (as minimally and precisely as possible): do a etcd 3.4 HTTPS setup with 3 https nodes with each node with its own SSL certificate
Anything else we need to know?:
Environment:
kubectl version
): 1.16.0cat /etc/os-release
): Debian 9/arm64uname -a
): 4.4.167-1213-rockchip-ayufan-g34ae07687fceThe text was updated successfully, but these errors were encountered: