Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Gating Flux reconciliation #3158

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from
Draft
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
301 changes: 301 additions & 0 deletions rfcs/XXXX-gating/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,301 @@
# RFC-XXXX Gating Flux reconciliation

**Status:** provisional

**Creation date:** 2022-09-28

**Last update:** 2022-10-04

## Summary

Flux should offer a mechanism for cluster admins and other teams involved in the release process
to manually approve the rollout of changes onto clusters. In addition, Flux should offer
a way to define maintenance time windows and other time-based gates, to allow a better control
of applications and infrastructure changes to critical system.

## Motivation

Flux watches sources (e.g. GitRepositories, OCIRepositories, HelmRepositories, S3-compatible Buckets) and
automatically reconciles the changes onto clusters as described with Flux Kustomizations and HelmReleases.
The teams involved in the delivery process (e.g. dev, qa, sre) can decide when changes are delivered
to production by reviewing and approving the proposed changes in a collaborative manner with pull request.
Once a pull request is merged onto a branch that defines the desired state of the production system,
Flux kicks off the reconciliation process.

There are situations when users want to have a gating mechanism after the desired state changes are merged in Git:

- Manual approval of container image updates (e.g. https://github.com/fluxcd/flux2/discussions/870)
- Manual approval of infrastructure upgrades (e.g. https://github.com/fluxcd/flux2/issues/959)
- Maintenance window (e.g. https://github.com/fluxcd/flux2/discussions/1004)
- Planned releases
- No Deploy Friday

### Goals

- Offer a dedicated API for defining time-based gates in a declarative manner.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest that the gates should not be exclusively time based, but also support specific revisions

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Offer a dedicated API for defining time-based gates in a declarative manner.
- Offer a dedicated API for defining gates in a declarative manner.

- Introduce a `gating-controller` in the Flux suite that manages the `Gate` objects.
- Extend the current Flux APIs and controllers to support gating.

### Non-Goals

<!--
What is out of scope for this RFC? Listing non-goals helps to focus discussion
and make progress.
-->

## Proposal

In order to support manual gating, Flux could be extended with a dedicated API and controller
that would allow users to define `Gate` objects and perform operations like `open` and `close`.

A `Gate` object could be referenced in sources (Buckets, Git, Helm, OCI Repositories)
and syncs (Kustomizations, HelmReleases, ImageUpdateAutomation)
to block the reconciliation until the gate is opened.

A `Gate` can be opened or closed by annotating the object with a timestamp or by
calling a specific webhook receiver exposed by notification-controller.

A `Gate` can be configured to automatically close or open based on a time window defined in the `Gate` spec.

The `Gate` API would replace Flagger's current
[manual gating mechanism](https://docs.flagger.app/usage/webhooks#manual-gating).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional audit trail for this feature:

  • Enriching existing logs so that actions that can be gated should log:
    • What gate(s) were checked (if any).
    • What gate(s) caused the final result of (not) going ahead with the action.
    • If multiple gates can be defined, changes to which gates are being observed and what logical operators (and/or) are being used, should be descriptively logged.
  • Logs for the new Gate controller:
    • Descriptive summary of the gate state at creation time.
    • Any change of Gate state should log a descriptive summary.
  • The additional logs above would be an "opt-in" feature, so that users that do not require this level of governance don't have to process the extra data.

### User Stories

#### Story 1

> As a member of the SRE team, I want to allow deployments to happen only
> in a particular time frame of my own choosing.

Define a gate that automatically closes after 1h from the time it has been opened:

```yaml
apiVersion: gating.toolkit.fluxcd.io/v1alpha1
kind: Gate
metadata:
name: sre-approval
namespace: flux-system
spec:
interval: 30s
default: closed
window: 1h
```

When the gate is created in-cluster, the `gating-controller` uses `spec.default` to set the `Opened` condition:

```yaml
apiVersion: gating.toolkit.fluxcd.io/v1alpha1
kind: Gate
metadata:
name: sre-approval
namespace: flux-system
status:
conditions:
- lastTransitionTime: "2021-03-26T10:09:26Z"
message: "Gate closed by default"
reason: ReconciliationSucceeded
status: "False"
type: Opened
```

While the gate is closed, all the objects that reference it will wait for an approval:

```yaml
apiVersion: kustomize.toolkit.fluxcd.io/v1beta1
kind: Kustomization
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I propose gates are to be applied at sources only instead of Kustomization and HelmRelease objects, as otherwise may face the non-trivial challenges below:

  • Incompatible desired state when Kustomization / HelmRelease objects are gated whilst others are not, specially when they depend on the same Flux source that is not gated.
    • Example 1: Kustomization objects app2 and app1 depends on GitRepository team-source. A change is merged into the repository and reconciled for object team-source, however app2 has a close Gate and app1 does not. In this case the desired state at both Git repository and the Flux source in the cluster are valid, but only one Kustomizations is being applied leading to an invalid observed state in the cluster. Same applies if both Kustomizations are behind a Gate which has its state changed before the second object is reconcilied (e.g. example 2 below).
  • Enforcing the desired state which was allowed before the Gate closing (User Story 2) will be difficult to manage from any point beyond the source, which could lead to the cluster to be in an invalid state.
    • Example 1: user manually deletes Kubernetes object managed in the hope Flux will re-create it.
    • Example 2: two kustomizations (parent and child) having the same Gate, the latter has a dependsOn pointing to the former. If the Gate closes after parent but before child is applied, the cluster observed state may be invalid (considering the desired state at source).

metadata:
name: my-app
namespace: flux-system
spec:
gates:
- name: sre-approval
- name: qa-approval
status:
conditions:
- lastTransitionTime: "2021-03-26T10:09:26Z"
message: "Reconciliation is waiting approval, gate 'flux-system/sre-approval' is closed."
reason: GateClosed
status: "False"
type: Approved
```

The SRE team can open the gate either by annotating the gate or by calling the notification-controller webhook:

```sh
kubectl -n flux-system annotate --overwrite gate/sre-approval \
open.gate.fluxcd.io/requestedAt="$(date -u +"%Y-%m-%dT%H:%M:%SZ")"
```

The `gating-controller` extracts the ISO8601 date from the `open.gate` annotation value,
sets the `requestedAt` & `resetToDefaultAt`, and opens the gate for the specified window:

```yaml
apiVersion: gating.toolkit.fluxcd.io/v1alpha1
kind: Gate
metadata:
name: sre-approval
namespace: flux-system
status:
requestedAt: "2021-03-26T10:00:00Z"
resetToDefaultAt: "2021-03-26T11:00:00Z"
conditions:
- lastTransitionTime: "2021-03-26T10:00:00Z"
message: "Gate scheduled for closing at 2021-03-26T11:00:00Z"
reason: ReconciliationSucceeded
status: "True"
type: Opened
```

While the gate is opened, all the objects that reference it are approved to reconcile at their configured interval.

The SRE can decide to close the gate ahead of its schedule with:

```sh
kubectl -n flux-system annotate --overwrite gate/sre-approval \
close.gate.fluxcd.io/requestedAt="$(date -u +"%Y-%m-%dT%H:%M:%SZ")"
```

The `gating-controller` extracts the ISO8601 date from the `close.gate` annotation value,
compares it with the `open.gate` & `requestedAt` date and closes the gate:

```yaml
apiVersion: gating.toolkit.fluxcd.io/v1alpha1
kind: Gate
metadata:
name: sre-approval
namespace: flux-system
status:
requestedAt: "2021-03-26T10:10:00Z"
resetToDefaultAt: "2021-03-26T10:10:00Z"
conditions:
- lastTransitionTime: "2021-03-26T10:10:00Z"
message: "Gate close requested"
reason: ReconciliationSucceeded
status: "False"
type: Opened
```

The objects that are referencing this gate, will finish their ongoing reconciliation (if any) then pause.

> As a member of the SRE team, I want to block deployments in a particular time window.

To enforce a maintenance window of 24 hours, you can define a `Gate` that's opened by default:

```yaml
apiVersion: gating.toolkit.fluxcd.io/v1alpha1
kind: Gate
metadata:
name: maintenance
namespace: flux-system
spec:
interval: 30s
default: opened
window: 24h
```

To start the maintenance window you can annotate the gate with:

```sh
kubectl -n flux-system annotate --overwrite gate/maintenance \
close.gate.fluxcd.io/requestedAt="$(date -u +"%Y-%m-%dT%H:%M:%SZ")"
```

The `gating-controller` extracts the ISO8601 date from the `close.gate`
annotation value and closes the gate for the specified window:

```yaml
apiVersion: gating.toolkit.fluxcd.io/v1alpha1
kind: Gate
metadata:
name: maintenance
namespace: flux-system
status:
requestedAt: "2021-03-26T10:00:00Z"
resetToDefaultAt: "2021-03-27T10:00:00Z"
conditions:
- lastTransitionTime: "2021-03-26T10:00:00Z"
message: "Gate scheduled for opening at 2021-03-27T11:00:00Z"
reason: ReconciliationSucceeded
status: "False"
type: Opened
```

You could also schedule "No Deploy Fridays" with a CronJob that closes the `maintenance` gate at `0 0 * * FRI`.

Copy link

@hassenius hassenius Oct 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nvmnd...moved...

#### Story 2

> As a member of the SRE team, I want existing deployments to still be
> reconciled during a change freeze.

Gates can be used to block Flux sources from being refreshed, resulting in Flux

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify that block Flux sources from being refreshed in this case means the gate prevent the source controller from making the new assets available, rather than to suspend the polling in the source controller?

It is desirable for the source to still be able to detect changes, such that in-cluster logic that is tied to the gating mechanisms can be notified of available changes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it stopped polling, it would avoid all the resource consumption (CPU, memory, Network and storage) for an operation that would not cause any side effect to the cluster, which would stream line the controllers to better handle reconciliations that were not gated.

@hassenius is there any specific scenario you have in mind?

to continue to reconcile existing approved desired states, whislt new changes
are held at a Flux source gate.

Example:

```yaml
apiVersion: kustomize.toolkit.fluxcd.io/v1beta1
kind: GitRepository
metadata:
name: flux-system
namespace: flux-system
spec:
gates:
- name: change-freeze # gate that enforces a change freeze time window
status:
conditions:
- lastTransitionTime: "2022-05-26T01:12:22Z"
message: "Reconciliation is blocked as gate 'flux-system/change-freeze' is closed."
reason: GateClosed
status: "True"
type: Blocked
```

This would ensure that Gate changes would not impact the eventual consistency of
mid-flight reconciliations that were already deployed in the cluster. Flux would also
continue to re-create Flux managed objects that were manually deleted from the cluster.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would propose an additional use case, bringing in the reference dimension in addition to the time dimension for gates, such that a gate can be defined as open to a specific reference, either in spec, or as an annotation

Story 3

As a multi-cluster operator, I want to control how cluster configuration changes
are rolled out across the cluster estate from a single configuration source

Gates can be used to control the timing and order of where configurations are reconciled

For example

apiVersion: gating.toolkit.fluxcd.io/v1alpha1
kind: Gate
metadata:
  name: approved-revisions
  namespace: flux-system
spec:
  openTo: 4f08fc93e31

or

apiVersion: gating.toolkit.fluxcd.io/v1alpha1
kind: Gate
metadata:
  annotations:
    open.gate.fluxcd.io/revision: 4f08fc93e31
  name: approved-revisions
  namespace: flux-system
spec:
  default: closed

### Alternatives

#### Users to implement gating outside of Flux

##### Before Flux source

Users could implement their own gating mechanisms as part of their development processes
ensuring that their custom rules are applied before the changes reach their Flux sources
(i.e. the target Git repository). For example, if deployments are not allowed on Fridays,
no PRs would be merged on those days.

The disadvantage is that some source types may not provide easy ways for users to enforce
such rules. When using different source types (e.g. Git, OCI, Helm), multiple implementations
may be required.

##### CronJobs and Flux Suspend

Users can implement a gating mechanism within Kubernetes by leveraging CronJobs and using
the built-in suspend feature in Flux that allows for a Flux object to stop being reconciled
until it is resumed. This alternative does not scale well when considering hundreds of Flux
objects.

## Design Details

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In highly regulated environments Gating can be used to ensure specific processes were observed before a new change made its way to a given environment. Flux controllers that support such mechanism should extend their log messages to express what Gates were taken into account at the time of a reconciliation, and their states.

<!--
This section should contain enough information that the specifics of your
change are understandable. This may include API specs and code snippets.

The design details should address at least the following questions:
- How can this feature be enabled / disabled?
- Does enabling the feature change any default behavior?
- Can the feature be disabled once it has been enabled?
- How can an operator determine if the feature is in use?
- Are there any drawbacks when enabling this feature?
-->

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few ideas on design details to be discussed:

Multiple Gates

Flux objects that support gating can specify multiple gates. By default,
all gates specified must be open for the reconciliation to go ahead. To
change the behavior spec.gates.require can be set to oneOf instead:

apiVersion: kustomize.toolkit.fluxcd.io/v1beta1
kind: GitRepository
metadata:
  name: flux-system
  namespace: flux-system
spec:
  gates:
    # all (default): all gates must be open for the reconciliation to go ahead.
    # oneOf: at least one of the gates must be open for the reconciliation to go ahead.
    require: oneOf # <all|oneOf>
    refs:
    - change-freeze # gate that enforces a change freeze time window
    - bypass-signoff # gate that allows other gates to be overriden.

When oneOf is used, a single open Gate is required for reconcilations
to proceed.

Recovering from wiped Storage

The source artifact storage lives in the running source controller instance. If its Pod
is recreated, all the artifacts will need to be reacquired. This process happens
automatically as part of each source reconciliation, however, when sources
are restricted by a gate they will require a new process.

In such situation the controller will need to fetch the version used before the Gate was
closed, and that was in the storage before it being wiped. Not all types will support this
and therefore here are the outcomes expected:

  • GitRepository: restores artifact based on the last known revision (commit ID).
  • S3 Buckets: fail reconciliation with "artifact not found artifact whilst under closed Gate".
  • OCI: TBC.
  • Helm Repository: TBC.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For OCI we have the upstream digest in status, so like with Git, we can use it to pull the data.

## Implementation History

<!--
Major milestones in the lifecycle of the RFC such as:
- The first Flux release where an initial version of the RFC was available.
- The version of Flux where the RFC graduated to general availability.
- The version of Flux where the RFC was retired or superseded.
-->