Skip to content
This repository has been archived by the owner on Jun 30, 2022. It is now read-only.

iamkirkbater/osd-cluster-ready-job

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

OSD Cluster Readiness Job

This job silences alerts while Day2 configuration is loaded onto a cluster at initial provisioning, allowing it to not page on-call SREs for normal operations within the cluster.

The silence initially takes effect for 1 hour.

We poll cluster health using osde2e health checks once a minute (this is configurable), until they all report healthy 20 times in a row (configurable) on 30s intervals (configurable). By default, we will clear any active silence and exit successfully if the cluster is (or becomes) more than two hours old (configurable).

If the silence expires while health checks are failing, we reinstate it. (This means it is theoretically possible for alerts to fire for up to one minute if the silence expires right after a health check fails. FIXME.)

Deploying the Image

make build
make docker-build
make docker-push

This builds the binary for linux, builds the docker image (which requires the binary to be built externally as of right now) and then pushes the updated image to quay.

If you wish to push to a specific repository, org, or image name, you may override the IMAGE_REPO, IMAGE_ORG, or IMAGE_NAME variables, respectively, when invoking the docker-build and docker-push targets. For example, for development purposes, you may wish to export IMAGE_ORG=my_quay_namespace.

Deploying the Job

Deploy each of the manifests in the deploy/ folder in alphanumeric order.

If you are overriding any of the IMAGE_* variables for development purposes, be sure to (temporarily) edit the Job, setting the image appropriately.

You can iterate by deleting the Job (which will delete its Pod) and recreating it.

Tunables

The following environment variables can be set in the container, e.g. by editing the Job to include them in spec.template.spec.containers[0].env.

Remember that the values must be strings; so numeric values must be quoted.

MAX_CLUSTER_AGE_MINUTES

The maximum age of the cluster, in minutes, after which we will clear any silences and exit "successfully".

Default: "120" (two hours)

CLEAN_CHECK_RUNS

The number of consecutive health checks that must succeed before we declare the cluster truly healthy.

Default: "20"

CLEAN_CHECK_INTERVAL_SECONDS

The number of seconds to sleep between successful health checks. Once the cluster is truly healthy, you can expect the job to succeed after an interval of roughly:

CLEAN_CHECK_RUNS x (CLEAN_CHECK_INTERVAL_SECONDS + (time to run one iteration of health checks)) seconds

Default: "30" (seconds)

FAILED_CHECK_INTERVAL_SECONDS

The number of seconds to sleep after a failed health check, before rechecking.

Default: "60" (one minute)

Keeping up with osde2e

This code runs health checks via a module dependency on github.com/openshift/osde2e. That dependency is pinned to a specific commit in go.mod. That commit must be modified manually to pick up changes in osde2e. An easy way to bump to the latest commit is to run:

go get -u github.com/openshift/osde2e

Don't forget to build and test with the updated dependency before committing!

TO DO

  • Look for existing active silences before creating a new one
  • Implement actual healthchecks (steal them from osde2e) to determine cluster stability
  • Find if there's a better and more secure way to talk to the alertmanager API using oauth and serviceaccount tokens.
  • Make the default silence expiry shorter; and extend it when health checks fail (OSD-6384).

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published