You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
QHub is currently lacking a backup and restore solution. Initially this issue was not sufficiently complex since all state was stored on a single nfs filestore. We talked about having a kubernetes cron job to run daily restic to update the filesystem to a single s3 bucket. However now there are starting to be databases and state stored in several other pvcs within QHub. We expect this to grow so we need a generic solutions that allows us to backup/restore all storage within a cluster. We are proposing kubernetes backups using velero which looks to be a well adopted open source solution for backup and restore.
Proposed implementation
We realize this is a large issue and it will be most likely easiest to approach this problem in steps.
The first step would be to deploy the velero helm chart within QHub. There are other examples of [deploying a helm chart within QHub in PRs. This being the most similar one https://github.com//pull/733. This will only deploy the velero agent on the kubernetes cluster. This should be configured via a qhub-config.yaml configuration setting. The PR above gives an example of adding this setting. There will additionally be a key credentials that takes an arbitrary dict of credentials to pass on to the helm chart. See https://github.com/vmware-tanzu/helm-charts/blob/main/charts/velero/README.md#option-1-cli-commands. These credentials will be used to setup file backups and block storage backups. schedule will control the frequency of regular backups. Backups should be an optional feature that is disabled by default.
Initially we would like a simple qhub deploy and qhub restore command. Eventually we could imagine this command growing into more complicated backups but we realize this problem is complicated enough as it is scoped.
Additionally there should be documentation added for the admin and dev guide.
Acceptance Criteria
upon initial deployment of QHub cluster and configuration setting backups enabled the cluster should be backup every 24h to an s3 bucket
qhub backup should trigger a manual backup of the cluster with files being backed up to s3 bucket
qhub restore should trigger a restore action that will refresh the contents of pvcs within cluster (this is less well understood at the moment and may not be possible).
We see this as a critical story for our QHub story since we are finding that many users want assurances that their data is not long along with having an easy option for a complete teardown of their cluster and creating a new one. Also this will be an important part of our support story for opensource and the enterprise versions of QHub.
We will be looking at several things:
- quality of the PRs
- how we coordinate and communicate over issues/PRs
Summary
QHub is currently lacking a backup and restore solution. Initially this issue was not sufficiently complex since all state was stored on a single nfs filestore. We talked about having a kubernetes cron job to run daily restic to update the filesystem to a single s3 bucket. However now there are starting to be databases and state stored in several other pvcs within QHub. We expect this to grow so we need a generic solutions that allows us to backup/restore all storage within a cluster. We are proposing kubernetes backups using velero which looks to be a well adopted open source solution for backup and restore.
Proposed implementation
We realize this is a large issue and it will be most likely easiest to approach this problem in steps.
The first step would be to deploy the velero helm chart within QHub. There are other examples of [deploying a helm chart within QHub in PRs. This being the most similar one https://github.com//pull/733. This will only deploy the velero agent on the kubernetes cluster. This should be configured via a qhub-config.yaml configuration setting. The PR above gives an example of adding this setting. There will additionally be a key
credentials
that takes an arbitrary dict of credentials to pass on to the helm chart. See https://github.com/vmware-tanzu/helm-charts/blob/main/charts/velero/README.md#option-1-cli-commands. These credentials will be used to setup file backups and block storage backups.schedule
will control the frequency of regular backups. Backups should be an optional feature that is disabled by default.Next once velero is deployed on the cluster there should be the ability to trigger a backup manually. Similar to how we handle terraform https://github.com/Quansight/qhub/blob/main/qhub/provider/terraform.py#L23. Since velero is a go binary it should be possible to transparently download the velero binary https://github.com/vmware-tanzu/velero/releases/tag/v1.6.1 and expose it in the cli behind a
qhub backup
andqhub restore
command. For now we would like to create a velero provider in https://github.com/Quansight/qhub/tree/main/qhub/provider that can trigger a backup and restore of the qhub storage.Initially we would like a simple
qhub deploy
andqhub restore
command. Eventually we could imagine this command growing into more complicated backups but we realize this problem is complicated enough as it is scoped.Additionally there should be documentation added for the admin and dev guide.
Acceptance Criteria
qhub backup
should trigger a manual backup of the cluster with files being backed up to s3 bucketqhub restore
should trigger a restore action that will refresh the contents of pvcs within cluster (this is less well understood at the moment and may not be possible).Tasks to complete
Related to
Expectations
We see this as a critical story for our QHub story since we are finding that many users want assurances that their data is not long along with having an easy option for a complete teardown of their cluster and creating a new one. Also this will be an important part of our support story for opensource and the enterprise versions of QHub.
We will be looking at several things:
- quality of the PRs
- how we coordinate and communicate over issues/PRs
From @viniciusdc https://blog.kubernauts.io/backup-and-restore-of-kubernetes-applications-using-heptios-velero-with-restic-and-rook-ceph-as-2e8df15b1487
The text was updated successfully, but these errors were encountered: