-
Notifications
You must be signed in to change notification settings - Fork 620
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Gating Flux reconciliation #3158
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
@@ -0,0 +1,301 @@ | ||||||
# RFC-XXXX Gating Flux reconciliation | ||||||
|
||||||
**Status:** provisional | ||||||
|
||||||
**Creation date:** 2022-09-28 | ||||||
|
||||||
**Last update:** 2022-10-04 | ||||||
|
||||||
## Summary | ||||||
|
||||||
Flux should offer a mechanism for cluster admins and other teams involved in the release process | ||||||
to manually approve the rollout of changes onto clusters. In addition, Flux should offer | ||||||
a way to define maintenance time windows and other time-based gates, to allow a better control | ||||||
of applications and infrastructure changes to critical system. | ||||||
|
||||||
## Motivation | ||||||
|
||||||
Flux watches sources (e.g. GitRepositories, OCIRepositories, HelmRepositories, S3-compatible Buckets) and | ||||||
automatically reconciles the changes onto clusters as described with Flux Kustomizations and HelmReleases. | ||||||
The teams involved in the delivery process (e.g. dev, qa, sre) can decide when changes are delivered | ||||||
to production by reviewing and approving the proposed changes in a collaborative manner with pull request. | ||||||
Once a pull request is merged onto a branch that defines the desired state of the production system, | ||||||
Flux kicks off the reconciliation process. | ||||||
|
||||||
There are situations when users want to have a gating mechanism after the desired state changes are merged in Git: | ||||||
|
||||||
- Manual approval of container image updates (e.g. https://github.com/fluxcd/flux2/discussions/870) | ||||||
- Manual approval of infrastructure upgrades (e.g. https://github.com/fluxcd/flux2/issues/959) | ||||||
- Maintenance window (e.g. https://github.com/fluxcd/flux2/discussions/1004) | ||||||
- Planned releases | ||||||
- No Deploy Friday | ||||||
|
||||||
### Goals | ||||||
|
||||||
- Offer a dedicated API for defining time-based gates in a declarative manner. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
- Introduce a `gating-controller` in the Flux suite that manages the `Gate` objects. | ||||||
- Extend the current Flux APIs and controllers to support gating. | ||||||
|
||||||
### Non-Goals | ||||||
|
||||||
<!-- | ||||||
What is out of scope for this RFC? Listing non-goals helps to focus discussion | ||||||
and make progress. | ||||||
--> | ||||||
|
||||||
## Proposal | ||||||
|
||||||
In order to support manual gating, Flux could be extended with a dedicated API and controller | ||||||
that would allow users to define `Gate` objects and perform operations like `open` and `close`. | ||||||
|
||||||
A `Gate` object could be referenced in sources (Buckets, Git, Helm, OCI Repositories) | ||||||
and syncs (Kustomizations, HelmReleases, ImageUpdateAutomation) | ||||||
to block the reconciliation until the gate is opened. | ||||||
|
||||||
A `Gate` can be opened or closed by annotating the object with a timestamp or by | ||||||
calling a specific webhook receiver exposed by notification-controller. | ||||||
|
||||||
A `Gate` can be configured to automatically close or open based on a time window defined in the `Gate` spec. | ||||||
|
||||||
The `Gate` API would replace Flagger's current | ||||||
[manual gating mechanism](https://docs.flagger.app/usage/webhooks#manual-gating). | ||||||
|
||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Additional audit trail for this feature:
|
||||||
### User Stories | ||||||
|
||||||
#### Story 1 | ||||||
|
||||||
> As a member of the SRE team, I want to allow deployments to happen only | ||||||
> in a particular time frame of my own choosing. | ||||||
|
||||||
Define a gate that automatically closes after 1h from the time it has been opened: | ||||||
|
||||||
```yaml | ||||||
apiVersion: gating.toolkit.fluxcd.io/v1alpha1 | ||||||
kind: Gate | ||||||
metadata: | ||||||
name: sre-approval | ||||||
namespace: flux-system | ||||||
spec: | ||||||
interval: 30s | ||||||
default: closed | ||||||
window: 1h | ||||||
``` | ||||||
|
||||||
When the gate is created in-cluster, the `gating-controller` uses `spec.default` to set the `Opened` condition: | ||||||
|
||||||
```yaml | ||||||
apiVersion: gating.toolkit.fluxcd.io/v1alpha1 | ||||||
kind: Gate | ||||||
metadata: | ||||||
name: sre-approval | ||||||
namespace: flux-system | ||||||
status: | ||||||
conditions: | ||||||
- lastTransitionTime: "2021-03-26T10:09:26Z" | ||||||
message: "Gate closed by default" | ||||||
reason: ReconciliationSucceeded | ||||||
status: "False" | ||||||
type: Opened | ||||||
``` | ||||||
|
||||||
While the gate is closed, all the objects that reference it will wait for an approval: | ||||||
|
||||||
```yaml | ||||||
apiVersion: kustomize.toolkit.fluxcd.io/v1beta1 | ||||||
kind: Kustomization | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I propose gates are to be applied at sources only instead of
|
||||||
metadata: | ||||||
name: my-app | ||||||
namespace: flux-system | ||||||
spec: | ||||||
gates: | ||||||
- name: sre-approval | ||||||
- name: qa-approval | ||||||
status: | ||||||
conditions: | ||||||
- lastTransitionTime: "2021-03-26T10:09:26Z" | ||||||
message: "Reconciliation is waiting approval, gate 'flux-system/sre-approval' is closed." | ||||||
reason: GateClosed | ||||||
status: "False" | ||||||
type: Approved | ||||||
``` | ||||||
|
||||||
The SRE team can open the gate either by annotating the gate or by calling the notification-controller webhook: | ||||||
|
||||||
```sh | ||||||
kubectl -n flux-system annotate --overwrite gate/sre-approval \ | ||||||
open.gate.fluxcd.io/requestedAt="$(date -u +"%Y-%m-%dT%H:%M:%SZ")" | ||||||
``` | ||||||
|
||||||
The `gating-controller` extracts the ISO8601 date from the `open.gate` annotation value, | ||||||
sets the `requestedAt` & `resetToDefaultAt`, and opens the gate for the specified window: | ||||||
|
||||||
```yaml | ||||||
apiVersion: gating.toolkit.fluxcd.io/v1alpha1 | ||||||
kind: Gate | ||||||
metadata: | ||||||
name: sre-approval | ||||||
namespace: flux-system | ||||||
status: | ||||||
requestedAt: "2021-03-26T10:00:00Z" | ||||||
resetToDefaultAt: "2021-03-26T11:00:00Z" | ||||||
conditions: | ||||||
- lastTransitionTime: "2021-03-26T10:00:00Z" | ||||||
message: "Gate scheduled for closing at 2021-03-26T11:00:00Z" | ||||||
reason: ReconciliationSucceeded | ||||||
status: "True" | ||||||
type: Opened | ||||||
``` | ||||||
|
||||||
While the gate is opened, all the objects that reference it are approved to reconcile at their configured interval. | ||||||
|
||||||
The SRE can decide to close the gate ahead of its schedule with: | ||||||
|
||||||
```sh | ||||||
kubectl -n flux-system annotate --overwrite gate/sre-approval \ | ||||||
close.gate.fluxcd.io/requestedAt="$(date -u +"%Y-%m-%dT%H:%M:%SZ")" | ||||||
``` | ||||||
|
||||||
The `gating-controller` extracts the ISO8601 date from the `close.gate` annotation value, | ||||||
compares it with the `open.gate` & `requestedAt` date and closes the gate: | ||||||
|
||||||
```yaml | ||||||
apiVersion: gating.toolkit.fluxcd.io/v1alpha1 | ||||||
kind: Gate | ||||||
metadata: | ||||||
name: sre-approval | ||||||
namespace: flux-system | ||||||
status: | ||||||
requestedAt: "2021-03-26T10:10:00Z" | ||||||
resetToDefaultAt: "2021-03-26T10:10:00Z" | ||||||
conditions: | ||||||
- lastTransitionTime: "2021-03-26T10:10:00Z" | ||||||
message: "Gate close requested" | ||||||
reason: ReconciliationSucceeded | ||||||
status: "False" | ||||||
type: Opened | ||||||
``` | ||||||
|
||||||
The objects that are referencing this gate, will finish their ongoing reconciliation (if any) then pause. | ||||||
|
||||||
> As a member of the SRE team, I want to block deployments in a particular time window. | ||||||
|
||||||
To enforce a maintenance window of 24 hours, you can define a `Gate` that's opened by default: | ||||||
|
||||||
```yaml | ||||||
apiVersion: gating.toolkit.fluxcd.io/v1alpha1 | ||||||
kind: Gate | ||||||
metadata: | ||||||
name: maintenance | ||||||
namespace: flux-system | ||||||
spec: | ||||||
interval: 30s | ||||||
default: opened | ||||||
window: 24h | ||||||
``` | ||||||
|
||||||
To start the maintenance window you can annotate the gate with: | ||||||
|
||||||
```sh | ||||||
kubectl -n flux-system annotate --overwrite gate/maintenance \ | ||||||
close.gate.fluxcd.io/requestedAt="$(date -u +"%Y-%m-%dT%H:%M:%SZ")" | ||||||
``` | ||||||
|
||||||
The `gating-controller` extracts the ISO8601 date from the `close.gate` | ||||||
annotation value and closes the gate for the specified window: | ||||||
|
||||||
```yaml | ||||||
apiVersion: gating.toolkit.fluxcd.io/v1alpha1 | ||||||
kind: Gate | ||||||
metadata: | ||||||
name: maintenance | ||||||
namespace: flux-system | ||||||
status: | ||||||
requestedAt: "2021-03-26T10:00:00Z" | ||||||
resetToDefaultAt: "2021-03-27T10:00:00Z" | ||||||
conditions: | ||||||
- lastTransitionTime: "2021-03-26T10:00:00Z" | ||||||
message: "Gate scheduled for opening at 2021-03-27T11:00:00Z" | ||||||
reason: ReconciliationSucceeded | ||||||
status: "False" | ||||||
type: Opened | ||||||
``` | ||||||
|
||||||
You could also schedule "No Deploy Fridays" with a CronJob that closes the `maintenance` gate at `0 0 * * FRI`. | ||||||
|
||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nvmnd...moved... |
||||||
#### Story 2 | ||||||
|
||||||
> As a member of the SRE team, I want existing deployments to still be | ||||||
> reconciled during a change freeze. | ||||||
|
||||||
Gates can be used to block Flux sources from being refreshed, resulting in Flux | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. To clarify that It is desirable for the source to still be able to detect changes, such that in-cluster logic that is tied to the gating mechanisms can be notified of available changes. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If it stopped polling, it would avoid all the resource consumption (CPU, memory, Network and storage) for an operation that would not cause any side effect to the cluster, which would stream line the controllers to better handle reconciliations that were not gated. @hassenius is there any specific scenario you have in mind? |
||||||
to continue to reconcile existing approved desired states, whislt new changes | ||||||
are held at a Flux source gate. | ||||||
|
||||||
Example: | ||||||
|
||||||
```yaml | ||||||
apiVersion: kustomize.toolkit.fluxcd.io/v1beta1 | ||||||
kind: GitRepository | ||||||
metadata: | ||||||
name: flux-system | ||||||
namespace: flux-system | ||||||
spec: | ||||||
gates: | ||||||
- name: change-freeze # gate that enforces a change freeze time window | ||||||
status: | ||||||
conditions: | ||||||
- lastTransitionTime: "2022-05-26T01:12:22Z" | ||||||
message: "Reconciliation is blocked as gate 'flux-system/change-freeze' is closed." | ||||||
reason: GateClosed | ||||||
status: "True" | ||||||
type: Blocked | ||||||
``` | ||||||
|
||||||
This would ensure that Gate changes would not impact the eventual consistency of | ||||||
mid-flight reconciliations that were already deployed in the cluster. Flux would also | ||||||
continue to re-create Flux managed objects that were manually deleted from the cluster. | ||||||
|
||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would propose an additional use case, bringing in the reference dimension in addition to the time dimension for gates, such that a gate can be defined as open to a specific reference, either in spec, or as an annotation Story 3
Gates can be used to control the timing and order of where configurations are reconciled For example apiVersion: gating.toolkit.fluxcd.io/v1alpha1
kind: Gate
metadata:
name: approved-revisions
namespace: flux-system
spec:
openTo: 4f08fc93e31 or apiVersion: gating.toolkit.fluxcd.io/v1alpha1
kind: Gate
metadata:
annotations:
open.gate.fluxcd.io/revision: 4f08fc93e31
name: approved-revisions
namespace: flux-system
spec:
default: closed |
||||||
### Alternatives | ||||||
|
||||||
#### Users to implement gating outside of Flux | ||||||
|
||||||
##### Before Flux source | ||||||
|
||||||
Users could implement their own gating mechanisms as part of their development processes | ||||||
ensuring that their custom rules are applied before the changes reach their Flux sources | ||||||
(i.e. the target Git repository). For example, if deployments are not allowed on Fridays, | ||||||
no PRs would be merged on those days. | ||||||
|
||||||
The disadvantage is that some source types may not provide easy ways for users to enforce | ||||||
such rules. When using different source types (e.g. Git, OCI, Helm), multiple implementations | ||||||
may be required. | ||||||
|
||||||
##### CronJobs and Flux Suspend | ||||||
|
||||||
Users can implement a gating mechanism within Kubernetes by leveraging CronJobs and using | ||||||
the built-in suspend feature in Flux that allows for a Flux object to stop being reconciled | ||||||
until it is resumed. This alternative does not scale well when considering hundreds of Flux | ||||||
objects. | ||||||
|
||||||
## Design Details | ||||||
|
||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In highly regulated environments Gating can be used to ensure specific processes were observed before a new change made its way to a given environment. Flux controllers that support such mechanism should extend their log messages to express what Gates were taken into account at the time of a reconciliation, and their states. |
||||||
<!-- | ||||||
This section should contain enough information that the specifics of your | ||||||
change are understandable. This may include API specs and code snippets. | ||||||
|
||||||
The design details should address at least the following questions: | ||||||
- How can this feature be enabled / disabled? | ||||||
- Does enabling the feature change any default behavior? | ||||||
- Can the feature be disabled once it has been enabled? | ||||||
- How can an operator determine if the feature is in use? | ||||||
- Are there any drawbacks when enabling this feature? | ||||||
--> | ||||||
|
||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. A few ideas on design details to be discussed: Multiple GatesFlux objects that support gating can specify multiple gates. By default, apiVersion: kustomize.toolkit.fluxcd.io/v1beta1
kind: GitRepository
metadata:
name: flux-system
namespace: flux-system
spec:
gates:
# all (default): all gates must be open for the reconciliation to go ahead.
# oneOf: at least one of the gates must be open for the reconciliation to go ahead.
require: oneOf # <all|oneOf>
refs:
- change-freeze # gate that enforces a change freeze time window
- bypass-signoff # gate that allows other gates to be overriden. When Recovering from wiped StorageThe source artifact storage lives in the running source controller instance. If its Pod In such situation the controller will need to fetch the version used before the Gate was
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For OCI we have the upstream digest in status, so like with Git, we can use it to pull the data. |
||||||
## Implementation History | ||||||
|
||||||
<!-- | ||||||
Major milestones in the lifecycle of the RFC such as: | ||||||
- The first Flux release where an initial version of the RFC was available. | ||||||
- The version of Flux where the RFC graduated to general availability. | ||||||
- The version of Flux where the RFC was retired or superseded. | ||||||
--> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest that the gates should not be exclusively time based, but also support specific revisions