-
Notifications
You must be signed in to change notification settings - Fork 858
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cert-manager cannot renew k8s-io-prod certificate due to second IPv6 ingress #1476
Comments
@munnerz: The label(s) In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/area infra/cert-manager |
@munnerz A possible short-term workaround for (2) : kubernetes/ingress-gce#43 (comment) |
Nice find 😄 here is a list of in-use ports after running
I can go ahead and update the named ports for each instance group to this list. What is the best way to coordinate running a potentially risky command like this? :) |
/assign |
@munnerz What services would be potentially impacted (documenting a list would be great)? That way we can determine risk and who we'd have to notify. |
An update - @thockin and I worked together to 1) clean up old named ports and 2) deal with the "second Ingress" issues (caused by the addition of the IPv6 ingress) by copying across rules temporarily for this renewal only to get us over the 'hump' The certificate now has In the meantime, there's a number of issues we should work to resolve (copied from Slack):
@cblecker you can see the list here, for future reference 😄: k8s.io/k8s.io/certificate-prod.yaml Lines 14 to 66 in 4260fd2
|
/priority important-soon |
awesome! thank you! |
GCP folks are looking at (3) and then (2) longer term |
/assign I'm going to work on 1(c) above today (controller to sync |
Update - certificate has been renewed:
|
It seems strange to me that more people are not hitting this issue. Is there some mitigating difference in how people are deploying services that I haven't thought of? |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
Do we need this now?
…On Mon, Jan 3, 2022 at 1:32 AM Arnaud M. ***@***.***> wrote:
/remove-lifecycle stale
/lifecycle frozen
—
Reply to this email directly, view it on GitHub
<#1476 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABKWAVDIC5QDBA2PGEQGND3UUFUKXANCNFSM4UVDE5QA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you were assigned.Message ID:
***@***.***>
|
@thockin I don't think we need this again. I just don't want to close it until cert-manager is fully removed. |
@ameukam should this be on the radar for 2023? |
Yes we should be removing cert-manager #4160 |
Issue addressed and GKE Networking now offers /close |
@ameukam: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
As initially reported and discussed on Slack, the
k8s-io-prod
certificate (used for the redirector service) is failing to renew.After some debugging, there are two issues at play here:
Hello IPv6 I'm kubernetes #1374 adding a second IPv6 only Ingress resource and the AAAA record configured to point to this - cert-manager only knows to update a single Ingress (named via the
edit-in-place
annotation) to injectpath
entries for the HTTP01 challenge solvers. As per https://letsencrypt.org/docs/ipv6-support/, if an AAAA record is returned then Let's Encrypt will prefer that and utilise it first. If we update cert-manager to only update the IPv6 Ingress resource, Let's Encrypt will quite likely pass validation (as it won't check IPv4) however, because cert-manager performs a 'self check' to ensure all routes are serving traffic correctly, and because our Pods do not utilise IPv6, the self check performed by cert-manager will never pass either. Ultimately, we need to ensure both Ingress resources contain the challenge path entries (which is not something that cert-manager supports today).When running
kubectl describe
on our Ingress resources, the follow error is shown:It appears that ingress-gce does not clean up 'unused named ports' - it was originally added in kubernetes/ingress-gce#430, but was later reverted in kubernetes/ingress-gce#585.
We can see that there are a lot of named ports associated with the 3 'unmanaged instance groups' that ingress-gce creates:
As you can see here, there are far fewer than 1000 nodePorts in our aaa cluster:
(note that some of these nodePort entries are for the 'challenge solvers' for the currently on-going/blocked renewal, and so they are not present in the
get-named-ports
output).Short term solutions/moving forward
The current certificate expires on December 19th (so in ~9 days). We need to resolve both of these issues to get a renewal now.
For (1), I propose we take the simplest approach of manually copying the
path
entries that cert-manager injects into the second Ingress resource. We will then manually remove them again afterwards. This will allow both the v4 and v6 front end IPs to respond to HTTP01 challenge requests.For (2), we need to manually clean up some of these named ports. We have a list of all the
nodePort
allocations fromkubectl get svc -A
above, so we can write a script to calculate which ports are not actually used and set the full list appropriately for each of the instance groups that ingress-gce manages. If we make a mistake here or miss a port, I am not sure whether GCP will actually just reject because it'd break a load balancer, or if it'll make the associated service be unavailable until that port is added back.Longer term solutions
For (1), I see a few avenues:
a) cert-manager is modified to be able to update multiple Ingress resources to inject routes for solving. This is a little out of the ordinary, but isn't the worst thing, especially given how many other awkward hoops we have to jump through to make HTTP01 solving work with that wide variety of ingress controller implementations.
b) write a controller that can be used by ingress-gce users to 'sync' the
path
entries on Ingress resources, treating one as authoritative. This would mean we could configure either the v4 or v6 Ingress to mirror the routes specified on the other one (that cert-manager updates). This feels cleaner than cert-manager updating multiple resources, but it'd be good to get feedback herec) improve cert-manager's extensibility story to allow for "out of tree" HTTP01 solvers, which would in turn mean we could have a standalone 'ingress-gce-solver' which would understand ingress-gce's nuances. (this would also allow for e.g. an out of tree IngressRoute/VirtualService/Ingress v2 solver too). This is certainly something that the cert-manager project should do anyway, though may be a bit more involved as a resolution given the scope of the issue here.
d) not use IPv6, or use something like ingress-nginx running in-cluster (exposed with a TCP load balancer) to handle ingresses
For (2), we are likely to not be affected for a while again, but:
a) patching ingress-gce to allow it to clean up unused ports (may take a while to have this change in effect)
b) write automation to make it easy for us to clean up unused ports (after learning how to do this safely whilst resolving the issue we face today)
I'm going to mark this as
priority/critical-urgent
as we have a count-down timer on this before it becomes very visible/public 😅 - if people are available for a call today/over the coming days, we should try and move swiftly to get agreement on the short term solution so we can all go home for the holidays 🎅 🎄/priority critical-urgent
/area cert-manager
/cc @thockin @dims @aojea @BenTheElder @bartsmykla
The text was updated successfully, but these errors were encountered: