OSD Cluster Readiness Job

OSD Cluster Readiness Job
TO DO

This job silences alerts while Day2 configuration is loaded onto a cluster at initial provisioning, allowing it to not page on-call SREs for normal operations within the cluster.

The silence initially takes effect for 1 hour.

We poll cluster health using osde2e health checks once a minute (this is configurable), until they all report healthy 20 times in a row (configurable) on 30s intervals (configurable). By default, we will clear any active silence and exit successfully if the cluster is (or becomes) more than two hours old (configurable).

If the silence expires while health checks are failing, we reinstate it. (This means it is theoretically possible for alerts to fire for up to one minute if the silence expires right after a health check fails. FIXME.)

Deploying the Image

make build
make docker-build
make docker-push

This builds the binary for linux, builds the docker image (which requires the binary to be built externally as of right now) and then pushes the updated image to quay.

If you wish to push to a specific repository, org, or image name, you may override the IMAGE_REPO, IMAGE_ORG, or IMAGE_NAME variables, respectively, when invoking the docker-build and docker-push targets. For example, for development purposes, you may wish to export IMAGE_ORG=my_quay_namespace.

Deploying the Job

Deploy each of the manifests in the deploy/ folder in alphanumeric order.

If you are overriding any of the IMAGE_* variables for development purposes, be sure to (temporarily) edit the Job, setting the image appropriately.

You can iterate by deleting the Job (which will delete its Pod) and recreating it.

Tunables

The following environment variables can be set in the container, e.g. by editing the Job to include them in spec.template.spec.containers[0].env.

Remember that the values must be strings; so numeric values must be quoted.

`MAX_CLUSTER_AGE_MINUTES`

The maximum age of the cluster, in minutes, after which we will clear any silences and exit "successfully".

Default: "120" (two hours)

`CLEAN_CHECK_RUNS`

The number of consecutive health checks that must succeed before we declare the cluster truly healthy.

Default: "20"

`CLEAN_CHECK_INTERVAL_SECONDS`

The number of seconds to sleep between successful health checks. Once the cluster is truly healthy, you can expect the job to succeed after an interval of roughly:

CLEAN_CHECK_RUNS x (CLEAN_CHECK_INTERVAL_SECONDS + (time to run one iteration of health checks)) seconds

Default: "30" (seconds)

`FAILED_CHECK_INTERVAL_SECONDS`

The number of seconds to sleep after a failed health check, before rechecking.

Default: "60" (one minute)

Keeping up with osde2e

This code runs health checks via a module dependency on github.com/openshift/osde2e. That dependency is pinned to a specific commit in go.mod. That commit must be modified manually to pick up changes in osde2e. An easy way to bump to the latest commit is to run:

go get -u github.com/openshift/osde2e

Don't forget to build and test with the updated dependency before committing!

TO DO

Look for existing active silences before creating a new one
Implement actual healthchecks (steal them from osde2e) to determine cluster stability
Find if there's a better and more secure way to talk to the alertmanager API using oauth and serviceaccount tokens.
Make the default silence expiry shorter; and extend it when health checks fail (OSD-6384).

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
bin		bin
deploy		deploy
dockerbuild		dockerbuild
silence		silence
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
go.mod		go.mod
go.sum		go.sum
main.go		main.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OSD Cluster Readiness Job

Deploying the Image

Deploying the Job

Tunables

`MAX_CLUSTER_AGE_MINUTES`

`CLEAN_CHECK_RUNS`

`CLEAN_CHECK_INTERVAL_SECONDS`

`FAILED_CHECK_INTERVAL_SECONDS`

Keeping up with osde2e

TO DO

About

Releases

Packages

Contributors 2

Languages

License

iamkirkbater/osd-cluster-ready-job

Folders and files

Latest commit

History

Repository files navigation

OSD Cluster Readiness Job

Deploying the Image

Deploying the Job

Tunables

MAX_CLUSTER_AGE_MINUTES

CLEAN_CHECK_RUNS

CLEAN_CHECK_INTERVAL_SECONDS

FAILED_CHECK_INTERVAL_SECONDS

Keeping up with osde2e

TO DO

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

`MAX_CLUSTER_AGE_MINUTES`

`CLEAN_CHECK_RUNS`

`CLEAN_CHECK_INTERVAL_SECONDS`

`FAILED_CHECK_INTERVAL_SECONDS`

Packages