Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backup and Restore implementation #779

Closed
3 tasks
tylerpotts opened this issue Aug 19, 2021 · 1 comment
Closed
3 tasks

Backup and Restore implementation #779

tylerpotts opened this issue Aug 19, 2021 · 1 comment
Labels
type: enhancement 💅🏼 New feature or request

Comments

@tylerpotts
Copy link
Contributor

Summary

QHub is currently lacking a backup and restore solution. Initially this issue was not sufficiently complex since all state was stored on a single nfs filestore. We talked about having a kubernetes cron job to run daily restic to update the filesystem to a single s3 bucket. However now there are starting to be databases and state stored in several other pvcs within QHub. We expect this to grow so we need a generic solutions that allows us to backup/restore all storage within a cluster. We are proposing kubernetes backups using velero which looks to be a well adopted open source solution for backup and restore.

Proposed implementation

We realize this is a large issue and it will be most likely easiest to approach this problem in steps.

The first step would be to deploy the velero helm chart within QHub. There are other examples of [deploying a helm chart within QHub in PRs. This being the most similar one https://github.com//pull/733. This will only deploy the velero agent on the kubernetes cluster. This should be configured via a qhub-config.yaml configuration setting. The PR above gives an example of adding this setting. There will additionally be a key credentials that takes an arbitrary dict of credentials to pass on to the helm chart. See https://github.com/vmware-tanzu/helm-charts/blob/main/charts/velero/README.md#option-1-cli-commands. These credentials will be used to setup file backups and block storage backups. schedule will control the frequency of regular backups. Backups should be an optional feature that is disabled by default.

velero:
  enabled: true/false
  schedule: "0 0 * * *"
  credentials:
     ...

Next once velero is deployed on the cluster there should be the ability to trigger a backup manually. Similar to how we handle terraform https://github.com/Quansight/qhub/blob/main/qhub/provider/terraform.py#L23. Since velero is a go binary it should be possible to transparently download the velero binary https://github.com/vmware-tanzu/velero/releases/tag/v1.6.1 and expose it in the cli behind a qhub backup and qhub restore command. For now we would like to create a velero provider in https://github.com/Quansight/qhub/tree/main/qhub/provider that can trigger a backup and restore of the qhub storage.

Initially we would like a simple qhub deploy and qhub restore command. Eventually we could imagine this command growing into more complicated backups but we realize this problem is complicated enough as it is scoped.

Additionally there should be documentation added for the admin and dev guide.

Acceptance Criteria

  • upon initial deployment of QHub cluster and configuration setting backups enabled the cluster should be backup every 24h to an s3 bucket
  • qhub backup should trigger a manual backup of the cluster with files being backed up to s3 bucket
  • qhub restore should trigger a restore action that will refresh the contents of pvcs within cluster (this is less well understood at the moment and may not be possible).

Tasks to complete

Related to

Expectations

We see this as a critical story for our QHub story since we are finding that many users want assurances that their data is not long along with having an easy option for a complete teardown of their cluster and creating a new one. Also this will be an important part of our support story for opensource and the enterprise versions of QHub.

We will be looking at several things:
- quality of the PRs
- how we coordinate and communicate over issues/PRs

From @viniciusdc https://blog.kubernauts.io/backup-and-restore-of-kubernetes-applications-using-heptios-velero-with-restic-and-rook-ceph-as-2e8df15b1487

@tylerpotts
Copy link
Contributor Author

duplicate of #743

closing out

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: enhancement 💅🏼 New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant