This job silences alerts while Day2 configuration is loaded onto a cluster at initial provisioning, allowing it to not page on-call SREs for normal operations within the cluster.
The silence initially takes effect for 1 hour.
We poll cluster health using osde2e health checks once a minute (this is configurable), until they all report healthy 20 times in a row (configurable) on 30s intervals (configurable). By default, we will clear any active silence and exit successfully if the cluster is (or becomes) more than two hours old (configurable).
If the silence expires while health checks are failing, we reinstate it. (This means it is theoretically possible for alerts to fire for up to one minute if the silence expires right after a health check fails. FIXME.)
make build
make docker-build
make docker-push
This builds the binary for linux, builds the docker image (which requires the binary to be built externally as of right now) and then pushes the updated image to quay.
If you wish to push to a specific repository, org, or image name, you may override the IMAGE_REPO
, IMAGE_ORG
, or IMAGE_NAME
variables, respectively, when invoking the docker-build
and docker-push
targets.
For example, for development purposes, you may wish to export IMAGE_ORG=my_quay_namespace
.
Deploy each of the manifests in the deploy/ folder in alphanumeric order.
If you are overriding any of the IMAGE_*
variables for development purposes, be sure to (temporarily) edit the Job, setting the image
appropriately.
You can iterate by deleting the Job (which will delete its Pod) and recreating it.
The following environment variables can be set in the container, e.g. by editing the Job to include them in spec.template.spec.containers[0].env
.
Remember that the values must be strings; so numeric values must be quoted.
The maximum age of the cluster, in minutes, after which we will clear any silences and exit "successfully".
Default: "120"
(two hours)
The number of consecutive health checks that must succeed before we declare the cluster truly healthy.
Default: "20"
The number of seconds to sleep between successful health checks. Once the cluster is truly healthy, you can expect the job to succeed after an interval of roughly:
CLEAN_CHECK_RUNS
x (CLEAN_CHECK_INTERVAL_SECONDS
+ (time to run one iteration of health checks)) seconds
Default: "30"
(seconds)
The number of seconds to sleep after a failed health check, before rechecking.
Default: "60"
(one minute)
This code runs health checks via a module dependency on github.com/openshift/osde2e
.
That dependency is pinned to a specific commit in go.mod.
That commit must be modified manually to pick up changes in osde2e.
An easy way to bump to the latest commit is to run:
go get -u github.com/openshift/osde2e
Don't forget to build and test with the updated dependency before committing!
- Look for existing active silences before creating a new one
- Implement actual healthchecks (steal them from osde2e) to determine cluster stability
- Find if there's a better and more secure way to talk to the alertmanager API using oauth and serviceaccount tokens.
- Make the default silence expiry shorter; and extend it when health checks fail (OSD-6384).