diff --git a/ADR/0033-apps-continuous-chaos-testing.md b/ADR/0033-apps-continuous-chaos-testing.md new file mode 100644 index 00000000..f4b5cf66 --- /dev/null +++ b/ADR/0033-apps-continuous-chaos-testing.md @@ -0,0 +1,47 @@ +# 18. Continuous Chaos Testing of Apps in AppStudio + +Date: 2024-03-05 + +## Status + +In consideration + +## Context + +There are a couple of false assumptions that users might have when operating and running their applications in distributed systems: + +Software is Resilient, The network is reliable. There is zero latency. Bandwidth is infinite. The network is secure. Topology never changes. The network is homogeneous. Consistent resource usage with no spikes. All shared resources are available from all places. Various assumptions lead to a number of outages in production environments. The services suffered from poor performance or were inaccessible to the customers, leading to missing Service Level Agreement (SLA) uptime promises, revenue loss, and a degradation in the perceived reliability of said services. How can we best avoid this from happening? This is where Chaos testing can add value. + +Failures in production are costly. To help mitigate risk to service health, consider the following strategies and approaches to service testing: + +Be proactive vs reactive. We have different types of test suites in place - unit, integration and end-to-end - that help expose bugs in code in a controlled environment. Through implementation of a chaos engineering strategy, we can discover potential causes of service degradation. We need to understand the systems’ behavior under unpredictable conditions in order to find the areas to harden, and use performance data points to size the clusters to handle failures in order to keep downtime to a minimum. + +### Glossary + +- krkn: Chaos testing framework: + +## Decision + +There are two approaches to chaos testing in the CI/CD pipeline. + +### Resilience based Chaos scenario + +These Chaos scenarios are expected to cause application failure. Example scenarios include simulating memory pressure, storage errors, killing random or dependent resources. The objective of these chaos test cases in the CI/CD pipeline is to assess whether the application is capable of mitigating and maintaining reliability. + +![Architecture diagram of Resilience based Chaos test scenario](../diagrams/ADR-0033/chaos-resilience.png "Architecture diagram of Resilience based Chaos test scenario") + +### SLA based Chaos scenario + +Test the resiliency of a application under turbulent conditions by running tests that are designed to disrupt while monitoring the application adaptability and performance: +Establish and define your steady state and metrics - understand the behavior and performance under stable conditions and define the metrics that will be used to evaluate the application’s behavior. Then decide on acceptable outcomes before injecting chaos. +Analyze the statuses and metrics of all components during the chaos test runs. +Improve the areas that are not resilient and performant by comparing the key metrics and Service Level Objectives (SLOs) to the stable conditions before the chaos. For example: evaluating the API server latency or application uptime to see if the key performance indicators and service level indicators are still within acceptable limits. + +![Architecture diagram of SLA based Chaos test scenario](../diagrams/ADR-0033/chaos-sla.png "Architecture diagram of SLA based Chaos test scenario") + +## Consequences + +The user should have sufficient authority (probably via a service account associated with the temporary namespace) to execute CRUD operations within the space. And the associate RBC privilege must be extended to allow querying metrics through the thanos-querier service, by granting cluster-monitoring-operator and other required roles to the account, in order to collect Prometheus metrics related to the app running in the temporary name. + +- The RBAC for the temporary namespace should be expanded to allow the ability to disrupt objects only within the user's temporary namespace for the test duration. +- Konflux functionality needs to be expanded to allow the user to query Prometheus metrics for their workload in the temporary namespace created during integration testing. diff --git a/diagrams/ADR-0033/chaos-resilience.png b/diagrams/ADR-0033/chaos-resilience.png new file mode 100644 index 00000000..8eb237e5 Binary files /dev/null and b/diagrams/ADR-0033/chaos-resilience.png differ diff --git a/diagrams/ADR-0033/chaos-sla.png b/diagrams/ADR-0033/chaos-sla.png new file mode 100644 index 00000000..80c96014 Binary files /dev/null and b/diagrams/ADR-0033/chaos-sla.png differ