diff --git a/rfcs/XXXX-gating/README.md b/rfcs/XXXX-gating/README.md new file mode 100644 index 0000000000..fcd6c202fd --- /dev/null +++ b/rfcs/XXXX-gating/README.md @@ -0,0 +1,301 @@ +# RFC-XXXX Gating Flux reconciliation + +**Status:** provisional + +**Creation date:** 2022-09-28 + +**Last update:** 2022-10-04 + +## Summary + +Flux should offer a mechanism for cluster admins and other teams involved in the release process +to manually approve the rollout of changes onto clusters. In addition, Flux should offer +a way to define maintenance time windows and other time-based gates, to allow a better control +of applications and infrastructure changes to critical system. + +## Motivation + +Flux watches sources (e.g. GitRepositories, OCIRepositories, HelmRepositories, S3-compatible Buckets) and +automatically reconciles the changes onto clusters as described with Flux Kustomizations and HelmReleases. +The teams involved in the delivery process (e.g. dev, qa, sre) can decide when changes are delivered +to production by reviewing and approving the proposed changes in a collaborative manner with pull request. +Once a pull request is merged onto a branch that defines the desired state of the production system, +Flux kicks off the reconciliation process. + +There are situations when users want to have a gating mechanism after the desired state changes are merged in Git: + +- Manual approval of container image updates (e.g. https://github.com/fluxcd/flux2/discussions/870) +- Manual approval of infrastructure upgrades (e.g. https://github.com/fluxcd/flux2/issues/959) +- Maintenance window (e.g. https://github.com/fluxcd/flux2/discussions/1004) +- Planned releases +- No Deploy Friday + +### Goals + +- Offer a dedicated API for defining time-based gates in a declarative manner. +- Introduce a `gating-controller` in the Flux suite that manages the `Gate` objects. +- Extend the current Flux APIs and controllers to support gating. + +### Non-Goals + + + +## Proposal + +In order to support manual gating, Flux could be extended with a dedicated API and controller +that would allow users to define `Gate` objects and perform operations like `open` and `close`. + +A `Gate` object could be referenced in sources (Buckets, Git, Helm, OCI Repositories) +and syncs (Kustomizations, HelmReleases, ImageUpdateAutomation) +to block the reconciliation until the gate is opened. + +A `Gate` can be opened or closed by annotating the object with a timestamp or by +calling a specific webhook receiver exposed by notification-controller. + +A `Gate` can be configured to automatically close or open based on a time window defined in the `Gate` spec. + +The `Gate` API would replace Flagger's current +[manual gating mechanism](https://docs.flagger.app/usage/webhooks#manual-gating). + +### User Stories + +#### Story 1 + +> As a member of the SRE team, I want to allow deployments to happen only +> in a particular time frame of my own choosing. + +Define a gate that automatically closes after 1h from the time it has been opened: + +```yaml +apiVersion: gating.toolkit.fluxcd.io/v1alpha1 +kind: Gate +metadata: + name: sre-approval + namespace: flux-system +spec: + interval: 30s + default: closed + window: 1h +``` + +When the gate is created in-cluster, the `gating-controller` uses `spec.default` to set the `Opened` condition: + +```yaml +apiVersion: gating.toolkit.fluxcd.io/v1alpha1 +kind: Gate +metadata: + name: sre-approval + namespace: flux-system +status: + conditions: + - lastTransitionTime: "2021-03-26T10:09:26Z" + message: "Gate closed by default" + reason: ReconciliationSucceeded + status: "False" + type: Opened +``` + +While the gate is closed, all the objects that reference it will wait for an approval: + +```yaml +apiVersion: kustomize.toolkit.fluxcd.io/v1beta1 +kind: Kustomization +metadata: + name: my-app + namespace: flux-system +spec: + gates: + - name: sre-approval + - name: qa-approval +status: + conditions: + - lastTransitionTime: "2021-03-26T10:09:26Z" + message: "Reconciliation is waiting approval, gate 'flux-system/sre-approval' is closed." + reason: GateClosed + status: "False" + type: Approved +``` + +The SRE team can open the gate either by annotating the gate or by calling the notification-controller webhook: + +```sh +kubectl -n flux-system annotate --overwrite gate/sre-approval \ +open.gate.fluxcd.io/requestedAt="$(date -u +"%Y-%m-%dT%H:%M:%SZ")" +``` + +The `gating-controller` extracts the ISO8601 date from the `open.gate` annotation value, +sets the `requestedAt` & `resetToDefaultAt`, and opens the gate for the specified window: + +```yaml +apiVersion: gating.toolkit.fluxcd.io/v1alpha1 +kind: Gate +metadata: + name: sre-approval + namespace: flux-system +status: + requestedAt: "2021-03-26T10:00:00Z" + resetToDefaultAt: "2021-03-26T11:00:00Z" + conditions: + - lastTransitionTime: "2021-03-26T10:00:00Z" + message: "Gate scheduled for closing at 2021-03-26T11:00:00Z" + reason: ReconciliationSucceeded + status: "True" + type: Opened +``` + +While the gate is opened, all the objects that reference it are approved to reconcile at their configured interval. + +The SRE can decide to close the gate ahead of its schedule with: + +```sh +kubectl -n flux-system annotate --overwrite gate/sre-approval \ +close.gate.fluxcd.io/requestedAt="$(date -u +"%Y-%m-%dT%H:%M:%SZ")" +``` + +The `gating-controller` extracts the ISO8601 date from the `close.gate` annotation value, +compares it with the `open.gate` & `requestedAt` date and closes the gate: + +```yaml +apiVersion: gating.toolkit.fluxcd.io/v1alpha1 +kind: Gate +metadata: + name: sre-approval + namespace: flux-system +status: + requestedAt: "2021-03-26T10:10:00Z" + resetToDefaultAt: "2021-03-26T10:10:00Z" + conditions: + - lastTransitionTime: "2021-03-26T10:10:00Z" + message: "Gate close requested" + reason: ReconciliationSucceeded + status: "False" + type: Opened +``` + +The objects that are referencing this gate, will finish their ongoing reconciliation (if any) then pause. + +> As a member of the SRE team, I want to block deployments in a particular time window. + +To enforce a maintenance window of 24 hours, you can define a `Gate` that's opened by default: + +```yaml +apiVersion: gating.toolkit.fluxcd.io/v1alpha1 +kind: Gate +metadata: + name: maintenance + namespace: flux-system +spec: + interval: 30s + default: opened + window: 24h +``` + +To start the maintenance window you can annotate the gate with: + +```sh +kubectl -n flux-system annotate --overwrite gate/maintenance \ +close.gate.fluxcd.io/requestedAt="$(date -u +"%Y-%m-%dT%H:%M:%SZ")" +``` + +The `gating-controller` extracts the ISO8601 date from the `close.gate` +annotation value and closes the gate for the specified window: + +```yaml +apiVersion: gating.toolkit.fluxcd.io/v1alpha1 +kind: Gate +metadata: + name: maintenance + namespace: flux-system +status: + requestedAt: "2021-03-26T10:00:00Z" + resetToDefaultAt: "2021-03-27T10:00:00Z" + conditions: + - lastTransitionTime: "2021-03-26T10:00:00Z" + message: "Gate scheduled for opening at 2021-03-27T11:00:00Z" + reason: ReconciliationSucceeded + status: "False" + type: Opened +``` + +You could also schedule "No Deploy Fridays" with a CronJob that closes the `maintenance` gate at `0 0 * * FRI`. + +#### Story 2 + +> As a member of the SRE team, I want existing deployments to still be +> reconciled during a change freeze. + +Gates can be used to block Flux sources from being refreshed, resulting in Flux +to continue to reconcile existing approved desired states, whislt new changes +are held at a Flux source gate. + +Example: + +```yaml +apiVersion: kustomize.toolkit.fluxcd.io/v1beta1 +kind: GitRepository +metadata: + name: flux-system + namespace: flux-system +spec: + gates: + - name: change-freeze # gate that enforces a change freeze time window +status: + conditions: + - lastTransitionTime: "2022-05-26T01:12:22Z" + message: "Reconciliation is blocked as gate 'flux-system/change-freeze' is closed." + reason: GateClosed + status: "True" + type: Blocked +``` + +This would ensure that Gate changes would not impact the eventual consistency of +mid-flight reconciliations that were already deployed in the cluster. Flux would also +continue to re-create Flux managed objects that were manually deleted from the cluster. + +### Alternatives + +#### Users to implement gating outside of Flux + +##### Before Flux source + +Users could implement their own gating mechanisms as part of their development processes +ensuring that their custom rules are applied before the changes reach their Flux sources +(i.e. the target Git repository). For example, if deployments are not allowed on Fridays, +no PRs would be merged on those days. + +The disadvantage is that some source types may not provide easy ways for users to enforce +such rules. When using different source types (e.g. Git, OCI, Helm), multiple implementations +may be required. + +##### CronJobs and Flux Suspend + +Users can implement a gating mechanism within Kubernetes by leveraging CronJobs and using +the built-in suspend feature in Flux that allows for a Flux object to stop being reconciled +until it is resumed. This alternative does not scale well when considering hundreds of Flux +objects. + +## Design Details + + + +## Implementation History + +