ADR to enable chaos testing of apps

The objective of this ADR is to propose a way to enable the end user to run chaos tests for his app as part of integration testing.
konflux-ci · May 9, 2024 · 0ef410b · 0ef410b
1 parent 214c3a7
commit 0ef410b
Show file tree

Hide file tree

Showing 3 changed files with 44 additions and 0 deletions.
diff --git a/ADR/0033-apps-continuous-chaos-testing.md b/ADR/0033-apps-continuous-chaos-testing.md
@@ -0,0 +1,44 @@
+# 33. Continuous Chaos Testing of Apps in AppStudio
+
+Date: 2024-03-05
+
+## Status
+
+In consideration
+
+## Context
+
+The chaos engineering strategy enables users to discover potential causes of service degradation. It helps users understand their app behavior under unpredictable conditions, identify areas to harden, and utilize performance data points to size and tune their application to handle failures, thereby minimizing downtime.
+
+There are two approaches to chaos testing in the CI/CD pipeline.
+
+### Resilience based Chaos scenario
+
+These Chaos scenarios are expected to cause application failure. Example scenarios include simulating memory pressure, storage errors, killing random or dependent resources. The objective of these chaos test cases in the CI/CD pipeline is to assess whether the application is capable of mitigating and maintaining reliability.
+
+![Architecture diagram of Resilience based Chaos test scenario](../diagrams/ADR-0033/chaos-resilience.png "Architecture diagram of Resilience based Chaos test scenario")
+
+### SLA based Chaos scenario
+
+Test the resiliency of a application under turbulent conditions by running tests that are designed to disrupt while monitoring the application adaptability and performance:
+Establish and define your steady state and metrics - understand the behavior and performance under stable conditions and define the metrics that will be used to evaluate the application’s behavior. Then decide on acceptable outcomes before injecting chaos.
+Analyze the statuses and metrics of all components during the chaos test runs.
+Improve the areas that are not resilient and performant by comparing the key metrics and Service Level Objectives (SLOs) to the stable conditions before the chaos. For example: evaluating the API server latency or application uptime to see if the key performance indicators and service level indicators are still within acceptable limits.
+
+![Architecture diagram of SLA based Chaos test scenario](../diagrams/ADR-0033/chaos-sla.png "Architecture diagram of SLA based Chaos test scenario")
+
+
+### Glossary
+
+- krkn: Chaos testing framework: <https://github.com/krkn-chaos/krkn>
+
+## Decision
+
+The Knoflux user has the ability to execute chaos tests such as Krkn as a part of the IntegrationTestScenarios by utilizing ephemeral clusters ([provisioning-ephemeral-openshift-clusters](https://github.com/konflux-ci/architecture/pull/172)) instead of ephemeral namespaces for enhanced isolation and a more production-like testing environment. Furthermore, they can gather Prometheus metrics from Thanos-querier-openshift-monitoring or Prometheus-k8s-openshift-monitoring endpoints.
+
+## Consequences
+
+* The service/user account associated with the ephemeral environment will have cluster admin level privileges to execute CRUD operations (configure RBAC permissions, prometheus instances) within the ephemeral environment. At a minimum, the user should be granted administrator level privilege for a namespace on the ephemeral cluster.
+* Using the service account token, it will be feasible to authenticate and query Prometheus metrics from either Thanos-querier-openshift-monitoring or Prometheus-k8s-openshift-monitoring endpoints. The privilege could be granted by assigning a cluster-monitoring-view role to the associated account, which would permit the user to query prometheus metrics across all namespaces in the cluster.
+* The user account shall have the capability to query the Thanos-querier-openshift-monitoring or Prometheus-k8s-openshift-monitoring route from the openshift-monitoring namespace. The account should have at least view level access to the openshift-monitoring namespace.
+* An additional openshfit feature that is recommended, though not mandatory, is the provision of [monitoring for user-defined projects](https://docs.openshift.com/container-platform/4.15/observability/monitoring/enabling-monitoring-for-user-defined-projects.html#accessing-metrics-from-outside-cluster_enabling-monitoring-for-user-defined-projects). Monitoring-rules-edit and monitoring-edit would be associated with the account to perform CURD operations related to PrometheusRule, ServiceMonitor and PodMonitor resources.
diff --git a/diagrams/ADR-0033/chaos-resilience.png b/diagrams/ADR-0033/chaos-resilience.png
diff --git a/diagrams/ADR-0033/chaos-sla.png b/diagrams/ADR-0033/chaos-sla.png