-
Notifications
You must be signed in to change notification settings - Fork 479
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: ETCD-295: Automated Backups of Etcd #1370
Changes from 1 commit
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,284 @@ | ||
--- | ||
title: automated-backups-of-etcd | ||
authors: | ||
- "@hasbro17" | ||
reviewers: | ||
- "@deads2k" | ||
- "@dusk125" | ||
- "@Elbehery" | ||
- "@tjungblu" | ||
- "@williamcaban" | ||
approvers: | ||
- "@deads2k" | ||
- "@williamcaban" | ||
api-approvers: | ||
- "@deads2k" | ||
creation-date: 2023-03-17 | ||
last-updated: 2023-03-17 | ||
tracking-link: | ||
- https://issues.redhat.com/browse/ETCD-81 | ||
see-also: | ||
- "https://issues.redhat.com/browse/OCPBU-252" | ||
- "https://issues.redhat.com/browse/OCPBU-160" | ||
--- | ||
|
||
|
||
# Automated Backups of etcd | ||
|
||
## Summary | ||
|
||
Enable the automated backups of etcd snapshots and other metadata necessary to restore an openshift cluster from a quorum loss scenario. | ||
|
||
## Motivation | ||
|
||
The [current documented procedure](https://docs.openshift.com/container-platform/4.12/backup_and_restore/control_plane_backup_and_restore/backing-up-etcd.html) for performing a backup of an OpenShift | ||
cluster is manually initiated. This reduces the likelihood of a timely backup being available during a disaster recovery scenario. | ||
|
||
The procedure also requires gaining a root shell on a control plane | ||
node. Shell access to OpenShift control plane nodes access is generally | ||
discouraged due to the potential for affecting the reliability of the | ||
node. | ||
|
||
### Goals | ||
|
||
- One-time cluster backup can be triggered without requiring a root shell | ||
- Scheduled backups can be configured after cluster installation | ||
- Backups are saved locally on the host filesystem of control-plane nodes | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How does this work for hosted control planes?
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I expect this enhancement to cover only self-hosted. Hosted control planes (hypershift), took separate ownership of etcd by developing their own operator. (see https://github.com/openshift/hypershift/blob/a0377ae793b20fb565ac8852ee4db1e5bb4780da/control-plane-operator/controllers/hostedcontrolplane/etcd/reconcile.go#L1). Hypershift will have to developer their own distinct backup and restore mechanism. The problem-space for a self-hosted cluster without external management is very different than a hosted cluster with external management. Both in terms of observability of the situation and in terms of cluster state reconciliation. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "Separate ownership" today, doesn't mean it needs to be separate forever, does it? It seems like it would be possible to move towards a unified approach. For example, in a hosted control plane world, it's not clear to me which persona(s) would have which access to backups. Possibly the Cluster Service Consumer, if we lump backup restoration in as similar to "drive upgrades". Or possibly the Cluster Service Provider, if we lump backup restoration in under "Has cluster admin management". But either way, I'd expect etcd backup/restore tooling to live on the management cluster and not leak out to the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. interesting take, I would rather see more separation instead of more unification. For example, we rebased etcd earlier this year and discovered it broke a customer cluster. checkout etcd-io/etcd#15062 We didn't even know that Hypershift was using openshift/etcd as their etcd. They are certainly not using CEO, which makes reusing the backup CRD moot as they would have to implement their own business logic for backups anyway. CEO is also currently a singleton in the cluster (don't think we want to change that either), so namespacing that CRD would make little sense in OCP. |
||
- This feature is validated with an e2e restore test that ensures the backups saved can be used to recover the cluster from a quorum loss scenario | ||
|
||
|
||
### Non-Goals | ||
- Have automated backups enabled by default with cluster installation | ||
- Save cluster backups to PersistentVolumes or cloud storage | ||
- This would be a future enhancement or extension to the API | ||
Comment on lines
+52
to
+53
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This may be required for adoption in managed openshift and hosted control planes to mitigate possible risks. For automated backups to be useful for managed it should be usable by the end customer. But the end customer is not managing the control plane and do not have control on the configuration of the host volume, so there is a risk we may not be able to accept depending on how much impact those backups have. |
||
- Automate cluster restoration | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this expected as a future enhancement? Not stated here as in the prior bullet. |
||
|
||
|
||
### User Stories | ||
|
||
- As a cluster administrator I want to initiate a cluster backup without requiring a root shell on a control plane node so as to minimize the risk involved | ||
- As a cluster administrator I want to schedule recurring cluster backups so that I have a recent cluster state to recover from in the event of quorum loss (i.e. losing a majority of control-plane nodes) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How many backups are retained when there is this schedule? |
||
- As a cluster administrator I want to have failure to take cluster backups for more than a configurable period to be reported to me via critical alerts | ||
|
||
|
||
## Proposal | ||
|
||
### Workflow Description | ||
|
||
To enable automated backups a new singleton CRD can be used to specify the backup schedule and other configuration as well as monitor the status of the backup operations. | ||
|
||
A new controller in the cluster-etcd-operator would then reconcile this CRD to save backups locally per the specified configuration. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. super nitpick, I think we should make the existing edit: that's also covered below - thanks! |
||
|
||
### API Extensions | ||
|
||
A new CRD `EtcdBackup` will be introduced to the API group `config.openshift.io` with the version `v1alpha1`. The CRD would be feature-gated until the prerequisite e2e test has an acceptable pass rate. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. glad to see alpha here. |
||
|
||
The spec and status structures will be as listed below: | ||
|
||
|
||
```Go | ||
// EtcdBackupSpec represents the configuration of the EtcdBackup CRD. | ||
type EtcdBackupSpec struct { | ||
|
||
// Reason defines the reason for the most recent backup. | ||
// Setting this to a value different from the most recent reason will trigger a one-time backup | ||
// The default "" means no backup is triggered | ||
// +kubebuilder:default:="" | ||
// +optional | ||
Reason string `json:"reason"` | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In our existing operators, we have a |
||
|
||
// Schedule defines the recurring backup schedule in Cron format | ||
// every 2 hours: 0 */2 * * * | ||
// every day at 3am: 0 3 * * * | ||
// Setting to an empty string "" means disabling scheduled backups | ||
// Default: "" | ||
// TODO: Define how we do validation for the format (e.g validation webhooks) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. try CEL first |
||
// and the limits on the frequency to disallow unrealistic schedules (e.g */2 * * * * every 2 mins) | ||
// TODO: What is the behavior if the backup doesn't complete in the specified interval | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. having an object like Insights collection helps here too, since it gives an "obvious" item to delete. Versus finding a specific instance of a job in a managed namespace. |
||
// e.g: every 1hr but the backup takes 2hrs to complete | ||
// Wait for the current one to complete? | ||
// And do we generate an alert even if MaxDurationToCompleteBackup is not set, since the scheduled interval has passed? | ||
Schedule string `json:"schedule"` | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. most of the fields you can set reasonable, but timezeone is probably required. |
||
|
||
// MaxDurationToCompleteBackup sets the max duration after which if | ||
// an initiated backup has not successfully completed a critical alert will be generated. | ||
// The format is as accepted by time.ParseDuration(), e.g "1h10m" | ||
// See https://pkg.go.dev/time#ParseDuration | ||
// Setting to "0" disables the alert. | ||
// Default: "1h" | ||
// TODO: Validation on min and max durations. | ||
// The max should be smaller than the scheduled interval | ||
// for the next backup else the alert will never be generated | ||
MaxDurationToCompleteBackup string `json: "maxDurationWithoutBackup"` | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
do you think that's easy to get out of the cron syntax? I would maybe leave this out for now and just degrade the operator like we're currently doing in UpgradeBackupController. Taking a snapshot shouldn't take longer than a minute. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I have no preference for a timeout as long as the operator reports degraded. |
||
|
||
// RetentionCount defines the maximum number of backups to retain. | ||
// If the number of successful backups matches retentionCount | ||
// the oldest backup will be removed before a new backup is initiated. | ||
// The count here is for the total number of backups | ||
// TODO: Validation on min and max counts. | ||
// TODO: Retention could also be based on the total size, or time period | ||
// e.g retain 500mb of backups, or discard backups older than a month | ||
// If we want to extend this API in the future we should make retention | ||
// a type so we can add more optional fields later | ||
RetentionCount int `json: "retentionCount"` | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. if we already have good ideas on other types of retention, maybe add a struct for this? I know it's TP MVP, but let's not introduce something we have to completely change in v1alpha2 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Agreed, you want to get close on your first attempt and this looks ripe for discriminated union. |
||
|
||
} | ||
``` | ||
|
||
**TODO:** How do we feature gate the entire type? | ||
For a field within a non-gated type it seems to be the tag [`// +openshift:enable:FeatureSets=TechPreviewNoUpgrade`](https://github.com/openshift/api/blob/master/example/v1/types_stable.go#L33). No such tag visible on the [whole type gated example](https://github.com/openshift/api/blob/1957a8d7445bf2332f027f93a24d7573f77a0dc0/example/v1alpha1/types_notstable.go#L19). | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. example of a gate on an entire type here: https://github.com/openshift/api/blob/master/example/v1alpha1/0000_50_notstabletype-techpreview.crd.yaml |
||
|
||
|
||
```Go | ||
// EtcdBackupStatus represents the status of the EtcdBackup CRD. | ||
type EtcdBackupStatus struct { | ||
|
||
// Conditions represents the observations of the EtcdBackup's current state. | ||
// TODO: Identify different condition types/reasons that will be needed. | ||
// +optional | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. in the actual PR, you'll need the SSA tags. |
||
Conditions []metav1.Condition `json:"conditions,omitempty"` | ||
deads2k marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
// ObservedGeneration is the most recent generation observed for this | ||
// EtcdBackup. It corresponds to the EtcdBackup's .metadata.generation that the condition was set based upon. | ||
// This lets us know whether the latest condition is stale. | ||
// +optional | ||
ObservedGeneration int64 `json:"observedGeneration,omitempty"` | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This hasn't been done in our config API before. I suggest not starting now. |
||
|
||
// TODO: How do we track the state of the current/last backup? | ||
// Would condition states be enough? e.g Created/In-progress/Complete/Failed/Unknown | ||
// Track metadata on last successful backup? | ||
// node name, path, size, timestamp | ||
// With backup retention do we also track the last backup removed? | ||
} | ||
``` | ||
|
||
|
||
### Implementation Details/Notes/Constraints [optional] | ||
|
||
#### Backup controller | ||
|
||
A backup controller in the cluster-etcd-operator will reconcile the CRD to execute the backup schedule and report the status of ongoing backups and failures to the `EtcdBackup` status. | ||
|
||
There are different options to explore on how we want to execute saving the backup snapshot and any other required metadata. As well as how we enforce the schedule. | ||
|
||
#### Execution method | ||
|
||
Create a backup pod that runs the backup script on a designated master node to save the backup file on the node's host filesystem: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. the job construct seems like a good choice here. I don't object to a pod, but are you sure you don't want a job instead? |
||
- The CEO already has an existing [upgradebackupcontroller][upgradebackupcontroller] that deploys a [backup pod][backup pod] that runs the [backup script][backup script]. | ||
- It may be simpler to reuse that but we may need to modify the backup script to save or append additional metadata. | ||
- Making changes to a bash script without unit tests would make it harder to maintain. | ||
|
||
As an alternative we could implement an equivalent Go cmd for the backup pod to execute a variant of the backup script. | ||
- The etcd Go client provides a [`snapshot save()`](https://pkg.go.dev/go.etcd.io/etcd/client/v3@v3.5.7/snapshot) util that can be used for this method. | ||
- We would be maintaining two potentially different backup procedures this way. The backup script won't be replaced by the `EtcdBackup` API since it can still be used in a quorum-loss scenario where the API server is down. | ||
|
||
Since it is an implementation detail with no API impact we can start with utilizing the backup script for simplicity and codify it later. | ||
|
||
#### Backup schedule | ||
|
||
To enforce the specified schedule we can utilize [CronJobs](https://kubernetes.io/docs/concepts/workloads/controllers/cron-jobs/) to run backup pods. | ||
|
||
**TODO:** If load balancing between control-plane nodes is a requirement then it needs to be determined how we can achieve that with the Job or Pod spec. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. actually, that's an interesting issue we wouldn't have with a single mounted PVC. |
||
- The [`nodeSelector`](https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#nodeselector) field would let us choose the group of control-plane/master nodes but not round-robin between them. | ||
- The [`nodeName`](https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#nodename) field would let us choose the node | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. since the operator creates a pod, I encourage directly assigning nodeNames. |
||
- More brittle as node names can change and would require writing custom scheduling logic to load balance across nodes | ||
|
||
See the Open Questions section for load balancing concerns. | ||
|
||
#### Saving the backup | ||
|
||
The backup pod needs to be able to save the backup data onto the host filesystem. To achieve this the backup pod can be mounted with a [hostPath Volume](https://kubernetes.io/docs/concepts/storage/volumes/#hostpath). | ||
|
||
**TODO:** Does anything prevent us from doing this e.g authorization or security issues. | ||
|
||
#### Alerting | ||
|
||
TBD | ||
|
||
### Risks and Mitigations | ||
|
||
TBD | ||
|
||
## Design Details | ||
|
||
### Open Questions | ||
|
||
### Load balancing and IO impact | ||
- What would be the IO impact or additional load caused by a recurring execution of the backup pod and taking the snapshot? | ||
- If significant, should we avoid selecting the leader to reduce the risk of overloading the leader and triggering a leader election? | ||
- We can find the leader via the [endpoint status](https://etcd.io/docs/v3.5/dev-guide/api_reference_v3/#message-statusresponse-apietcdserverpbrpcproto) but the leader can change during the course of the backup or we may not have the option of filtering the leader out if the backup pod is unschedulable on non-leader nodes (e.g low disk space). | ||
- Other than the leader IO considerations, when picking where to run and save the backup, do we need to load balance or round-robin across all control-plane nodes? | ||
- Leaving it to the scheduler to schedule the backup pod (or having the operator save on the node it's running on) is simpler, but we could end up with all/most backups being saved on a single node. In a disaster recovery scenario we could lose that node and the backups saved. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think a force assignment policy of some kind is a good choice. |
||
- Could be a future improvement if we start with the simpler choice of leaving it to the scheduler. | ||
|
||
### Test Plan | ||
|
||
Prior to merging the backup controller implementation, an e2e test would be added to openshift/origin to test the validity of the backups generated from the automated backups feature. The pass rate on this test will feature gate the API until we pass an acceptable threshold to make it available by default. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Gating by techpreview means you can merge your impl at will. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. promoting to "accessible-by-default" will be allowed once there is an acceptable history of passing restore |
||
|
||
This test would effectively run through the backup and restore procedure as follows: | ||
- Start with a cluster that has the `EtcdBackups` API enabled | ||
- Save a backup of the cluster by triggering a one time backup using the `EtcdBackups` API, or create a schedule to do that | ||
- Modify the cluster state post backup (e.g create pods, endpoints etc) so we can later ensure that restoring from backup does not include this additional state | ||
- Induce a disaster recovery scenario of 2/3 nodes lost with etcd quorum loss | ||
- Step through the recovery procedure from the saved backup | ||
- Ensure that the control-plane has recovered and is stable | ||
- Validate that we have restored the cluster to the previous state so that it does not have the post-backup changes | ||
|
||
See the [restore test design doc](https://docs.google.com/document/d/1NkdOwo53mkNBCktV5tkUnbM4vi7bG4fO5rwMR0wGSw8/edit?usp=sharing) for a more detailed breakdown on the restore test and the validation requirements. | ||
|
||
Along with the e2e restore test, comprehensive unit testing of the backup controller will also be added to ensure the correct reconciliation and updates of the: | ||
- Backup schedule | ||
- Backup retention | ||
- Backup status | ||
- Alerting rules | ||
|
||
### Graduation Criteria | ||
|
||
**TODO:** Need to clarify how feature gating relates to dev preview -> tech preview -> GA. | ||
|
||
#### Dev Preview -> Tech Preview | ||
|
||
- TBD | ||
|
||
#### Tech Preview -> GA | ||
|
||
- TBD | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this is known. Track record of reliable restore. I propose 99% on all platforms as a starting point since 99% of our tests currently pass at this threshold and consequences of failed restore is extremely high. |
||
|
||
### Upgrade / Downgrade Strategy | ||
|
||
- TBD | ||
|
||
### Version Skew Strategy | ||
|
||
TBD | ||
|
||
### Operational Aspects of API Extensions | ||
|
||
TBD | ||
|
||
**TODO:** Mention where the conditions for the `EtcdBackups` controller being degraded would be set. Most likely on the `etcd.operator.openshift.io` `Etcd/cluster` CRD where all the other CEO controllers set their degraded status. | ||
|
||
#### Failure Modes | ||
|
||
TBD | ||
|
||
#### Support Procedures | ||
|
||
TBD | ||
|
||
## Implementation History | ||
|
||
TBD | ||
|
||
## Alternatives | ||
|
||
TBD | ||
|
||
## Infrastructure Needed | ||
|
||
The new API will be introduced in the `openshift/api` repo and the controller will be added to the existing cluster-etcd-operator in the `openshift/cluster-etcd-operator` repo. | ||
|
||
|
||
[upgradebackupcontroller]: https://github.com/openshift/cluster-etcd-operator/blob/0584b0d1c8868535baf889d8c199f605aef4a3ae/pkg/operator/upgradebackupcontroller/upgradebackupcontroller.go#L284-L298 | ||
[backup pod]: https://github.com/openshift/cluster-etcd-operator/blob/0584b0d1c8868535baf889d8c199f605aef4a3ae/bindata/etcd/cluster-backup-pod.yaml#L31-L38 | ||
[backup script]: https://github.com/openshift/cluster-etcd-operator/blob/0584b0d1c8868535baf889d8c199f605aef4a3ae/bindata/etcd/cluster-backup.sh#L121-L129 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Beyond discouraged, it is not possible for hosted control planes in a managed environment where hosted cluster admin does not administer the env in which the hosted control plane (and etcd) is running.