Skip to content

Commit

Permalink
Merge pull request kubernetes#1 from deads2k/auto-recovery
Browse files Browse the repository at this point in the history
add cert auto-recovery for kube-apiserver and kube-controller-manager design
  • Loading branch information
openshift-merge-robot authored Oct 15, 2019
2 parents 2af83e6 + 3e93e2c commit c55ee82
Showing 1 changed file with 135 additions and 0 deletions.
135 changes: 135 additions & 0 deletions enhancements/kube-apiserver/auto-cert-recovery.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,135 @@
---
title: automatic-cert-recovery-for-kube-apiserver-kube-controller-manager
authors:
- "@deads2k"
reviewers:
- "@tnozicka"
- "@sttts"
approvers:
- "@sttts"
creation-date: 2019-08-22
last-updated: 2019-10-11
status: implementable
see-also:
replaces:
superseded-by:
---

# Automatic Cert Recovery for kube-apiserver and kube-controller-manager

## Release Signoff Checklist

- [x] Enhancement is `implementable`
- [ ] Design details are appropriately documented from clear requirements
- [ ] Test plan is defined
- [ ] Graduation criteria for dev preview, tech preview, GA
- [ ] User-facing documentation is created in [openshift/docs]

## Summary

Fully automate the recovery of kube-apiserver and kube-controller-manager certificates currently documented [here](https://docs.openshift.com/container-platform/4.1/disaster_recovery/scenario-3-expired-certs.html).
Currently, there are helper commands to make the effort more practical, but we think we can fully automate the process
to avoid human error and intervention.


## Motivation

If the kube-apiserver and kube-controller-manager operators are offline for an extended period of time, the cluster
cannot automatically restart itself because the certificates are invalid. This comes up in training clusters where
clusters are suspended frequently. It also comes up for products like code-ready-containers which creates and suspends
VMs for later restart. It is theoretically possible to automate the recovery steps, but they are slow and error prone.

### Goals

1. Provide a zero touch, always-on certificate recovery for the kube-apiserver and kube-controller-manager

### Non-Goals

1. Provide automation for any other part of disaster recovery.
2. Provide mechanisms to keep certificates up to date for any other component (kubelet for instance).
3. Provide mechanisms to approve CSRs. That is still the domain of the cloud team.

## Proposal

We will take our existing `cluster-kube-apiserver-operator regenerated-certificates` command and create a simple, non-leader-elected
controller which will watch for expired certificates and regenerate them. It will connect to the kube-apiserver using
localhost with an SNI name option wired to a 10 year cert. When there is no work to do, this controller will do nothing.
This controller will run as another container in our existing static pods.
The recovery flow will look like this:

1. kas-static-pod/kube-apiserver starts with expired certificates
2. kas-static-pod/cert-syncer connects to localhost kube-apiserver with using a long-lived SNI cert (localhost-recovery). It sees expired certs.
3. kas-static-poc/cert-regenerator connects to localhost kube-apiserver with a long-lived SNI cert (localhost-recovery).
It sees expired certs and refreshes them as appropriate. Being in the same repo, it uses the same logic.
We will add an overall option to the library-go cert rotation to say, "only refresh on expired"
so that it never collides with the operator during normal operation. The library-go cert rotation impl is resilient to
multiple actors already.
4. kas-static-pod/cert-syncer sees updated certs and places them for reload. (this already works)
5. kas-static-pod/kube-apiserver starts serving with new certs. (this already works)
6. kcm-static-pod/kube-controller-manager starts with expired certificates
7. kcm-static-pod/cert-syncer connects to localhost kube-apiserver with using a long-lived SNI cert (localhost-recovery). It sees expired certs.
8. kcm-static-poc/cert-regenerator connects to localhost kube-apiserver with a long-lived SNI cert (localhost-recovery). It sees expired certs and refreshes them as appropriate. Being in the same
repo, it uses the same logic. We will probably add an overall option to the library-go cert rotation to say, "only refresh on expired"
so that it never collides with the operator during normal operation. The library-go cert rotation impl is resilient to
multiple actors already.
9. kcm-static-pod/cert-syncer sees updated certs and places them for reload. (this already works)
10. kcm-static-pod/kube-controller-manager wires up a library-go/pkg/controller/fileobserver to the CSR signer and suicides on the update
11. At this point, kas and kcm are both up and running with valid serving certs and valid CSR signers.
12. Kubelets will start creating CSRs for signers, but the machine approver is down.
**A cluster-admin must manually approve client CSRs for the master kubelets**
13. Master kubelets are able to communicate to the kas and get listings of pods,
1. kcm creates pods for operators including sdn-o and sdn,
2. kube-scheduler sees a valid kas serving cert and schedules those pods to masters,
3. master kubelets run the sdn, sdn-o, kas-o, kcm-o, ks-o and the system re-bootstraps.

### Implementation Details/Notes/Constraints

This requires these significant pieces

- [ ] kcm fileobserver
- [ ] kcm-o to rewire configuration to auto-refresh CSR signer
- [ ] kcm-o to provide a cert regeneration controller
- [ ] kas-o to provide a cert regeneration controller
- [ ] kas-o to create and wire a long-lived serving cert/key pair for localhost-recovery
- [ ] library-go cert rotation library to support an override for only rotating when certs are expired
- [ ] remove old manual recovery commands

### Risks and Mitigations

1. If we wire the communiction unsafely we can get a CVE.
2. If we don't delay past "normal" rotation, the kas-o logs will be hard to interpret.
3. If something goes wrong, manual recovery may be harder.

## Design Details

### Test Plan

Disaster recovery tests are still outstanding with an epic that may not be approved. Lack of testing here doesn't introduce
additional risk beyond that already accepted.

This will be tested as part of normal disaster recovery tests. It's built on already unit tested libraries and affects
destination files already checked with unit and e2e tests.

### Graduation Criteria

This will start as GA.

### Upgrade / Downgrade Strategy

Being attached to the existing static pod, upgrades and downgrades will produce matching containers, so our producer and
consumers are guaranteed to match.

### Version Skew Strategy

Because each deployment in the payload is atomic, it will not skew. There are no external changes.

## Implementation History

Major milestones in the life cycle of a proposal should be tracked in `Implementation
History`.

## Alternatives

This process can be run by a laborious and error prone manual process that three existent teams have already had trouble with.


0 comments on commit c55ee82

Please sign in to comment.