-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Migrate mybinder.org from kube-lego to cert-manager for LetsEncrypt #1148
Comments
Ok, I have a new plan of attack for this issue. Having just (re-)installed cert-manager on Hub23, I relied a lot on the CI pipeline to make changes. So I think I'm going to do the same on this cluster and follow the docs @consideRatio and I wrote in Oslo here. In the comment, each time a "perform helm upgrade" instruction appears will correspond to a PR merge. I'll do this first for the staging cluster, then prod if all goes well. |
Travis really does not like the cert-manager helm chart!!
We will also need to think about how to include the CRDs as well:
I'm not sure Travis can dial out to install the CRDs? |
Ugh. A lot of the above is because the list of requirements in |
Got so close to having this work, but then ran into this error when deploying onto staging:
|
Ok, so Travis doesn't have permission to install the Custom Resource Definitions onto the cluster:
Truncated output, full output can be found here. |
Some new documentation has appeared about installing certmanager with BinderHub: https://binderhub.readthedocs.io/en/latest/https.html would be nice to hear what you think and maybe has some hints in it (though I think we've done all of that already). Should we come up with a plan for how to get this implemented? This is how I'd do it (I think):
What I am unsure about is how/when the switch from the certificates we currently have to those obtained by certmanager will happen. |
Cool, I will check this out!
Yep, this sounds good to me.
I wasn't going to change the name of the k8s secrets the certificates were stored in, so I believe certmanager will just watch those and renew them when they expire. @consideRatio may be able to verify this hunch. |
Without reading everything, I'll quickly on lunch time briefly say: Cert-manager does the following:
If cert-manager is configured to get a certificate for a ingress-resource, but there is already a k8s secret resource with a TLS certificate within it available, cert-manager will not do anything unless the certificate is about to expire, at which point it will update it. |
This was relevant for me to read: https://docs.cert-manager.io/en/latest/tasks/upgrading/upgrading-0.10-0.11.html#additional-annotation-changes |
Ok, so I tried running
I'm guessing that my local helm version is way ahead of the version installed on the GKE clusters. |
Locally installed helm version 2.11.0, which is the version installed by Travis, and still getting the following error caused by the
|
This only seems to happen when I try and talk to the GKE cluster from my macbook - Update: Secrets in cloud shell resolved - running deploy.py for staging there |
It may be that the humans have permission but running |
I'm not sure that I even have permission to install the CRDs now... Edit: So I tried to set up a new service account with "Kubernetes Engine Viewer" role which is suggested (from my digging) to grant the |
Does anyone know how I can fix this?
|
@sgibson91, I'm a bit confused about things here, specifically the distinction between KSA (kubernetes service account) and GSA (google cloud service account) and their permissions on GCP and K8S. I think you are trying to interact with the k8s api server as a GSA, and I think that GSA's can be granted a k8s (cluster)role by a (cluster)rolebinding. I think the GSA have not been coupled with a high enough clusterrole yet, and you need to increase the permissions for the GSA in interaction with the k8s api-server. Consider step 6 in this documentation: http://z2jh.jupyter.org/en/latest/google/step-zero-gcp.html, I think you need to do something very much like this. Before just giving cluster-admin rights to the GSA I would also be curious to see what current clusterrolebindings are coupling clusterroles with the GSA currently. You could do |
Thanks @consideRatio. |
I just got an email from let's encrypt about the need to switch away from kube-lego:
|
I've never managed to get cert-manager to work anywhere ever :D @sgibson91 have you had better luck? jupyterhub/zero-to-jupyterhub-k8s#1539 now lets you have automatic HTTPS from z2jh. We could re-use the same code for binder. I'm happy to put some effort into that later this month. This would also make it much easier for other deployers to use HTTPS... |
@yuvipanda Yes! I have cert-manager running on the Turing mybinder cluster and Hub23. My blocker with the staging/prod GKE clusters is giving a service account the right permissions to install the CRDs (that and my lack of time!). I'm just not as familiar with gcloud as I am Azure. My new plan for this issue was to give myself the cluster role binding rather than Travis. We'd only need to change the CRDs if we switched cert-manager versions. I can have another look this weekend. |
I think the turing cluster now runs on cert-manager. I think switching over requires a manual install of the CRDs on the cluster, switching which dependency we use and updating the annotations. I've been using cert-manager for about half a year for deployments and it "just works". I think I essentially do what is in https://binderhub.readthedocs.io/en/latest/https.html. A pitfall is if you installed the CRDs in an older version than the current one (some kind of shadowing happens with no error message or warning). |
I don't think we need to add extra rights to the service account. I'd install the CRDs manually because it is a one time step and requires privileges that aren't usually needed (trying to keep the permissions the service accounts have as low as possible) |
That's awesome, @sgibson91 @betatim! Glad to hear it's worked out :) We should also probably change the let's encrypt account email from yuvipanda@gmail.com to something more general :D |
For the email I'd suggest we use https://groups.google.com/forum/#!forum/binder-team (binder-team@googlegrgoups.com) |
@sgibson91 @betatim regarding updating the annotations, I suggest you don't do that. Instead, keep the old This is how I've configured my defaults for example. cert-manager:
ingressShim:
defaultIssuerName: "letsencrypt-prod"
defaultIssuerKind: "ClusterIssuer"
defaultACMEChallengeType: "http01" |
@betatim absolutely, I'll give it another go go this weekend and let you know if I come across any road blocks. +1 on the email address, do we have a binder team Gmail or something? |
If installing the CRDs can be done manually for a while without much issues, that is a suitable option until Helm 3 is used and cert-manager provides is CRDs in a Helm 3 manner that will make them easy to install using Helm as well - i think. Perhaps this was even possible with Helm 2, assuming cert-manager the helm chart was updated. I have investigated this in the past, and they are working on making that easier etc, but for now, it isn't so easy - so delaying adaptations could be suitable. |
Don't know what I did but I just successfully installed the CRDs onto the staging cluster! Steps I took are here: https://hackmd.io/j0NflItbRUO_9Z0dI4EUVQ |
Example of things that confuse me:
I just activated my service account, why is it using travis-deployer? Answer: It's not rewriting my kube-config file. |
Here's the most recent thing that I don't understand, but the Turing Way Book Dash is about to kick off so I should probably start paying attention. Error from helm upgrade:
But!
|
@sgibson91 authenticating like that with gcloud, makes future calls with the GCP API be done as that GCP Service account, but, that doesnt mean you act as that user in kubernetes when using kubectl - that isnt the GCP api. With k8s, you act with credentials in your kubeconfig. So, to update your kubeconfig, perhaps you need to do: / erik from mobile on a train, somewhat limited capacity |
Hmmm what makes helm understand it should look for a serviceaccount in the stageing namespace specifically? I think it may be looking in the wrong namespace. When writing rolebindings etc, you can specify the namespace using a :staging etc i think, i dont remember fully and im just dropping wild guesses of what may be relevant. Add --debug and --namespace staging to that helm upgrade command perhaps. |
Unfortunately, I find
|
If I run with
|
Yes :/ This is bonkers. |
OH BTW! Disablecert-manager webhook! It adds complexity that breaks stuff only to verify that you have valid yaml in cert-manager resources. See: https://gitlab.com/gitlab-org/charts/gitlab/issues/1809 to not get this issue bugging us, i make a pr to close that issue also where i close it. Note though that they have an alias for cert-manager being certmanager in their requirements.yaml file, so they use that name in their helm values yaml, but we should use a dash so it is cert-manager In short, that webhook is only a way for cert-manager to provide info on configuration errors of the cert-manager k8s resources, and it is called a webhook because it is being registered with the k8s api server to verify resources with a webhook mechanism like: "hello k8s api-server, whenever resources i care about are created or modified, please let me verify them first by asking me at this URL!" |
Yeah, @betatim thinks there may be something left over from previous attempts to install cert-manager which is blocking the upgrade. I may try on a fresh GKE cluster. |
I wrote a long-ish comment in #1362 (comment) My attempts at getting this working are in #1368 #1369 and reverts in #1370 |
I think we should switch to a mode where we propose a plan of action/commands to run and. then think about it for a bit/let others comment, then do it. My proposal would be to take the changes in #1368 and #1369 (so as minimal as possible), install only the CRDs manually, attempt a deploy from a local machine instead of travis. This is to cover the possibility that running |
Hi all. Just ran into this trying to redeploy a fresh binderhub without kube-lego (linked issue above). The current documentation works well for a fresh install (https://binderhub.readthedocs.io/en/latest/https.html), but does add a lot of manual config and extra cert-manager kubernetes pieces running on a cluster! Per @yuvipanda 's earlier comment:
So far the change to traefik on jupyterhubm has been working really well for us! It would be fantastic if enabling https were as simple as:
Anyone willing to take a stab at unifying the default https config between z2jub and z2bhub? |
@scottyhq ooooh perhaps I can do that, starting next week? |
It is crazy that this is turning out to be so tricky. Investigating an option based on traefik as ingress would be nice. For mybinder.org we'd have to make it flexible enough (or continue down the nginx + cert manager road) to be able to handle the other Ingress objects we have that aren't related to BinderHub (for example the federation proxy and such) |
cert-manager is up and running on prod and staging. Closing this now. |
Hi all,
I've been chatting with @consideRatio and @minrk at the workshop about enabling LetsEncrypt/HTTPS for the Turing private BinderHub. The common theme seems to be that kube-lego is deprecated and cert-manager is the new way forward. Min invited me to upgrade mybinder.org to cert-manager as a learning exercise and produce some documentation of the process. Erik's NeurIPS deployment uses cert-manager so I'll look into that too.
Which clusters have been migrated:
The text was updated successfully, but these errors were encountered: