Backup and Restore implementation #779

tylerpotts · 2021-08-19T18:05:10Z

Summary

QHub is currently lacking a backup and restore solution. Initially this issue was not sufficiently complex since all state was stored on a single nfs filestore. We talked about having a kubernetes cron job to run daily restic to update the filesystem to a single s3 bucket. However now there are starting to be databases and state stored in several other pvcs within QHub. We expect this to grow so we need a generic solutions that allows us to backup/restore all storage within a cluster. We are proposing kubernetes backups using velero which looks to be a well adopted open source solution for backup and restore.

Proposed implementation

We realize this is a large issue and it will be most likely easiest to approach this problem in steps.

The first step would be to deploy the velero helm chart within QHub. There are other examples of [deploying a helm chart within QHub in PRs. This being the most similar one https://github.com//pull/733. This will only deploy the velero agent on the kubernetes cluster. This should be configured via a qhub-config.yaml configuration setting. The PR above gives an example of adding this setting. There will additionally be a key credentials that takes an arbitrary dict of credentials to pass on to the helm chart. See https://github.com/vmware-tanzu/helm-charts/blob/main/charts/velero/README.md#option-1-cli-commands. These credentials will be used to setup file backups and block storage backups. schedule will control the frequency of regular backups. Backups should be an optional feature that is disabled by default.

velero:
  enabled: true/false
  schedule: "0 0 * * *"
  credentials:
     ...

Next once velero is deployed on the cluster there should be the ability to trigger a backup manually. Similar to how we handle terraform https://github.com/Quansight/qhub/blob/main/qhub/provider/terraform.py#L23. Since velero is a go binary it should be possible to transparently download the velero binary https://github.com/vmware-tanzu/velero/releases/tag/v1.6.1 and expose it in the cli behind a qhub backup and qhub restore command. For now we would like to create a velero provider in https://github.com/Quansight/qhub/tree/main/qhub/provider that can trigger a backup and restore of the qhub storage.

Initially we would like a simple qhub deploy and qhub restore command. Eventually we could imagine this command growing into more complicated backups but we realize this problem is complicated enough as it is scoped.

Additionally there should be documentation added for the admin and dev guide.

Acceptance Criteria

upon initial deployment of QHub cluster and configuration setting backups enabled the cluster should be backup every 24h to an s3 bucket
qhub backup should trigger a manual backup of the cluster with files being backed up to s3 bucket
qhub restore should trigger a restore action that will refresh the contents of pvcs within cluster (this is less well understood at the moment and may not be possible).

Tasks to complete

Deploy velero helm chart #773 work with @tarundmsharma to complete deployment of helm chart using terraform
QHub backup cli command to trigger backup #777
QHub restore cli command to trigger restore #778

Related to

For history, see PV backups / snapshots #99

Expectations

We see this as a critical story for our QHub story since we are finding that many users want assurances that their data is not long along with having an easy option for a complete teardown of their cluster and creating a new one. Also this will be an important part of our support story for opensource and the enterprise versions of QHub.

We will be looking at several things:
- quality of the PRs
- how we coordinate and communicate over issues/PRs

From @viniciusdc https://blog.kubernauts.io/backup-and-restore-of-kubernetes-applications-using-heptios-velero-with-restic-and-rook-ceph-as-2e8df15b1487

The text was updated successfully, but these errors were encountered:

tylerpotts · 2021-08-26T15:16:57Z

duplicate of #743

closing out

tylerpotts added type: enhancement 💅🏼 New feature or request epic labels Aug 19, 2021

tylerpotts closed this as completed Aug 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Backup and Restore implementation #779

Backup and Restore implementation #779

tylerpotts commented Aug 19, 2021

tylerpotts commented Aug 26, 2021

Backup and Restore implementation #779

Backup and Restore implementation #779

Comments

tylerpotts commented Aug 19, 2021

Summary

Proposed implementation

Acceptance Criteria

Tasks to complete

Related to

Expectations

tylerpotts commented Aug 26, 2021