Skip to content

zmird-r/configuration-anomaly-detection

 
 

Repository files navigation

Go Report Card PkgGoDev codecov License


Configuration Anomaly Detection

Configuration Anomaly Detection

About

Configuration Anomaly Detection (CAD) is responsible for reducing manual SRE investigation by detecting cluster anomalies and sending relevant communications to the cluster owner.

Contributing

To contribute to CAD, please see our CONTRIBUTING Document.

Documentation

CAD CLI

  • cadctl -- Performs workflow for 'cluster has gone missing' (CHGM) alerts.

Integrations

  • AWS -- Logging into the cluster, retreiving instance info and AWS CloudTrail events.
  • PagerDuty -- Retrieving alert info, esclating or silencing incidents, and adding notes.
  • OCM -- Retrieving cluster info, sending service logs, and managing (post, delete) limited support reasons.

Overview

  • CAD is a command line tool that is run in tekton pipelines.
  • The tekton service is running on an app-sre cluster.
  • CAD is triggered by PagerDuty webhooks configured on selected services, meaning that all alerts in that service trigger a CAD pipeline.
  • CAD uses the data received via the webhook to determine which investigation to start.

CAD Overview CAD Overview

Alert firing investigation

  1. PagerDuty webhook receives CHGM alert from Dead Man's Snitch.
  2. CAD Tekton pipeline is triggered via PagerDuty sending a webhook to Tekton EventListener.
  3. Logs into AWS account of cluster and checks for stopped/terminated instances.
    • If unable to access AWS account, posts "cluster credentials are missing" limited support reason.
  4. If stopped/terminated instances are found, pulls AWS CloudTrail events for those instances.
    • If no stopped/terminated instances are found, escalates to SRE for further investigation.
  5. If the user of the event is:
    • Authorized (SRE or OSD managed), runs the network verifier and escalates the alert to SRE for futher investigation.
      • Note: Authorized users have prefix RH-SRE, osdManagedAdmin, or have the ManagedOpenShift-Installer-Role.
    • Not authorized (not SRE or OSD managed), posts the appropriate limited support reason and silences the alert.
  6. Adds notes with investigation details to the PagerDuty alert.

CHGM investigation overview

CHGM investigation overview CHGM investigation overview

Templates

  • Update-Template -- Updating configuration-anomaly-detection-template.Template.yaml.
  • OpenShift -- Used by app-interface to deploy the CAD resources on a target cluster.

Dashboards

Grafana dashboard configmaps are stored in the Dashboards directory. See app-interface for further documentation on dashboards.

Deployment

  • Tekton -- Installation/configuration of Tekton and triggering pipeline runs.
  • Skip Webhooks -- Skipping the eventlistener and creating the pipelinerun directly.
  • Namespace -- Allowing the code to ignore the namespace.

Boilerplate

PipelinePruner

About

Configuration anomaly detection for OSD clusters

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Go 76.8%
  • Shell 19.5%
  • Makefile 3.2%
  • Dockerfile 0.5%