Configuration Anomaly Detection (CAD) is responsible for reducing manual SRE investigation by detecting cluster anomalies and sending relevant communications to the cluster owner.
To contribute to CAD, please see our CONTRIBUTING Document.
- cadctl -- Performs workflow for 'cluster has gone missing' (CHGM) alerts.
- AWS -- Logging into the cluster, retreiving instance info and AWS CloudTrail events.
- PagerDuty -- Retrieving alert info, esclating or silencing incidents, and adding notes.
- OCM -- Retrieving cluster info, sending service logs, and managing (post, delete) limited support reasons.
- CAD is a command line tool that is run in tekton pipelines.
- The tekton service is running on an app-sre cluster.
- CAD is triggered by PagerDuty webhooks configured on selected services, meaning that all alerts in that service trigger a CAD pipeline.
- CAD uses the data received via the webhook to determine which investigation to start.
- PagerDuty webhook receives CHGM alert from Dead Man's Snitch.
- CAD Tekton pipeline is triggered via PagerDuty sending a webhook to Tekton EventListener.
- Logs into AWS account of cluster and checks for stopped/terminated instances.
- If unable to access AWS account, posts "cluster credentials are missing" limited support reason.
- If stopped/terminated instances are found, pulls AWS CloudTrail events for those instances.
- If no stopped/terminated instances are found, escalates to SRE for further investigation.
- If the user of the event is:
- Authorized (SRE or OSD managed), escalates the alert to SRE for futher investigation.
- Note: Authorized users have prefix RH-SRE, osdManagedAdmin, or have the ManagedOpenShift-Installer-Role.
- Not authorized (not SRE or OSD managed), posts the appropriate limited support reason and silences the alert.
- Authorized (SRE or OSD managed), escalates the alert to SRE for futher investigation.
- Adds notes with investigation details to the PagerDuty alert.
- Update-Template -- Updating configuration-anomaly-detection-template.Template.yaml.
- OpenShift -- Used by app-interface to deploy the CAD resources on a target cluster.
Grafana dashboard configmaps are stored in the Dashboards directory. See app-interface for further documentation on dashboards.
- Tekton -- Installation/configuration of Tekton and triggering pipeline runs.
- Skip Webhooks -- Skipping the eventlistener and creating the pipelinerun directly.
- Namespace -- Allowing the code to ignore the namespace.
- Boilerplate -- Conventions for OSD containers.
- PipelinePruner -- Documentation about PipelineRun pruning.