Skip to content

Commit

Permalink
Merge pull request #6 from dasmeta/DMVP-2489-alert-expressions
Browse files Browse the repository at this point in the history
feat(DMVP-2489): Changed alert rules expression structure.
  • Loading branch information
viktoryathegreat authored Jun 26, 2023
2 parents b667451 + ee6edde commit 41ed151
Show file tree
Hide file tree
Showing 15 changed files with 195 additions and 21 deletions.
34 changes: 30 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,17 @@ At this moment we support managing

More parts are coming soon.

## Tips
1. Alert conditions are formed based on $B blocks and `equation`, `threshold` parameters users pass to the module.
`equation` parameter can only get these values:
- `lt` corresponds to `<`
- `gt` corresponds to `>`
- `e` corresponds to `=`
- `lte` corresponds to `<=`
- `gte` corresponds to `>=`
And `threshold` parameter is the number value against which B blocks are compared in the math expression.
2. We pass `null` value to `filters` variable. It's needed when we use such Prometheus metrics which don't get any filters when querying.

## Example for Alert Rules
```
module "grafana_alerts" {
Expand All @@ -22,7 +33,8 @@ module "grafana_alerts" {
deployment = "app-1-microservice"
}
function = "last"
condition = "$B < 1"
equation = "lt"
threshold = 1
},
{
name = "App_2 has 0 available replicas"
Expand All @@ -33,7 +45,19 @@ module "grafana_alerts" {
deployment = "app-2-microservice"
}
function = "last"
condition = "$B < 1"
equation = "lt"
threshold = 1
},
{
name = "Insufficient nodes in cluster"
summary = "Cluster is using fewer nodes than the required count"
folder_name = "Node Autoscaling"
datasource = "prometheus"
filters = null
metric_name = "sum(kube_node_info)"
function = "mean"
equation = "lte"
threshold = 2
}
]
}
Expand Down Expand Up @@ -79,7 +103,8 @@ module "grafana_alerts" {
deployment = "app-1-microservice"
}
function = "last"
condition = "$B < 1"
equation = "lt"
threshold = 1
},
{
name = "App_2 has 0 available replicas"
Expand All @@ -90,7 +115,8 @@ module "grafana_alerts" {
deployment = "app-2-microservice"
}
function = "last"
condition = "$B < 1"
equation = "lt"
threshold = 1
}
]
opsgenie_endpoints = [
Expand Down
14 changes: 13 additions & 1 deletion modules/alerts/README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,17 @@
## Usage
To enable some of these alerts for your applications, you just need to replace `App_1`, `App_2` and `App_3` with the actual names of your applications. You can refer to the Prometheus metrics to identify the available filters that can be used for each application. Additionally, modify the values in the conditions to reflect the real cases of your applications. These adjustments will ensure that the alerts accurately monitor your specific applications and their scaling needs.

## Tips
Alert conditions are formed based on $B blocks and `equation`, `threshold` parameters users pass to the module.
`equation` parameter can only get these values:
- `lt` corresponds to `<`
- `gt` corresponds to `>`
- `e` corresponds to `=`
- `lte` corresponds to `<=`
- `gte` corresponds to `>=`

And `threshold` parameter is the number value against which B blocks are compared in the math expression.

<!-- BEGINNING OF PRE-COMMIT-TERRAFORM DOCS HOOK -->
## Requirements

Expand Down Expand Up @@ -30,7 +42,7 @@ No modules.
| Name | Description | Type | Default | Required |
|------|-------------|------|---------|:--------:|
| <a name="input_alert_interval_seconds"></a> [alert\_interval\_seconds](#input\_alert\_interval\_seconds) | The interval, in seconds, at which all rules in the group are evaluated. If a group contains many rules, the rules are evaluated sequentially. | `number` | `10` | no |
| <a name="input_alert_rules"></a> [alert\_rules](#input\_alert\_rules) | This varibale describes alert folders, groups and rules. | <pre>list(object({<br> name = string # The name of the alert rule<br> summary = optional(string, "") # Rule annotation as a summary<br> folder_name = optional(string, "Main Alerts") # Grafana folder name in which the rule will be created<br> datasource = string # Name of the datasource used for the alert<br> metric_name = string # Prometheus metric name which queries the data for the alert<br> filters = optional(any, {}) # Filters object to identify each service for alerting<br> function = optional(string, "mean") # One of Reduce functions which will be used in B block for alerting<br> condition = string # Math expression which compares B blocks value with a number and generates an alert if needed<br> }))</pre> | `[]` | no |
| <a name="input_alert_rules"></a> [alert\_rules](#input\_alert\_rules) | This varibale describes alert folders, groups and rules. | <pre>list(object({<br> name = string # The name of the alert rule<br> summary = optional(string, "") # Rule annotation as a summary<br> folder_name = optional(string, "Main Alerts") # Grafana folder name in which the rule will be created<br> datasource = string # Name of the datasource used for the alert<br> metric_name = string # Prometheus metric name which queries the data for the alert<br> filters = optional(any, {}) # Filters object to identify each service for alerting<br> function = optional(string, "mean") # One of Reduce functions which will be used in B block for alerting<br> equation = string # The equation in the math expression which compares B blocks value with a number and generates an alert if needed. Possible values: gt, lt, gte, lte, e.<br> threshold = number # The value against which B blocks are compared in the math expression<br> }))</pre> | `[]` | no |

## Outputs

Expand Down
11 changes: 9 additions & 2 deletions modules/alerts/main.tf
Original file line number Diff line number Diff line change
@@ -1,6 +1,13 @@
locals {
folders = toset(distinct([for rule in var.alert_rules : rule.folder_name]))
alerts = { for member in local.folders : member => [for rule in var.alert_rules : rule if rule.folder_name == member] }
comparison_operators = {
gte : ">=",
gt : ">",
lt : "<",
lte : "<=",
e : "="
}
}

resource "grafana_folder" "rule_folder" {
Expand Down Expand Up @@ -39,7 +46,7 @@ resource "grafana_rule_group" "alert_rule" {
model = <<EOT
{
"editorMode": "code",
"expr": "${rule.value.metric_name}{${replace(join(", ", [for k, v in rule.value.filters : "${k}=\"${v}\""]), "\"", "\\\"")}}",
"expr": "${rule.value.metric_name}${(rule.value.filters != null && length(rule.value.filters) > 0) ? format("{%s}", replace(join(", ", [for k, v in rule.value.filters : "${k}=\"${v}\""]), "\"", "\\\"")) : ""}",
"hide": false,
"intervalMs": "1000",
"legendFormat": "__auto",
Expand Down Expand Up @@ -132,7 +139,7 @@ EOT
"type": "__expr__",
"uid": "__expr__"
},
"expression": "${rule.value.condition}",
"expression": "$B ${local.comparison_operators[rule.value.equation]} ${rule.value.threshold}",
"hide": false,
"intervalMs": 1000,
"maxDataPoints": 43200,
Expand Down
6 changes: 4 additions & 2 deletions modules/alerts/tests/autoscaling-max-usage/1-example.tf
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,8 @@ module "this" {
deployment = "app-1-microservice"
}
function = "mean"
condition = "$B >= 20"
equation = "gte"
threshold = 20
},
{
name = "App_2 max autoscaling"
Expand All @@ -24,7 +25,8 @@ module "this" {
deployment = "app-2-microservice"
}
function = "mean"
condition = "$B >= 20"
equation = "gte"
threshold = 20
}
]
}
2 changes: 1 addition & 1 deletion modules/alerts/tests/autoscaling-max-usage/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ This test case demonstrates how to configure Grafana alerts for an application r

In this test, we have set up two alert rules for different microservices, `App_1` and `App_2`, within the `Autoscaling Test` folder. The alerts are triggered based on the Prometheus datasource and the metric `kube_deployment_status_replicas_available`.

For each microservice, we have specified a filter to match the deployment name (`app-1-microservice` and `app-2-microservice`). The `mean` function is applied to aggregate the metric values, and the condition `$B >= 20` is used to check if the replicas available are equal to or greater than 20.
For each microservice, we have specified a filter to match the deployment name (`app-1-microservice` and `app-2-microservice`). The `mean` function is applied to aggregate the metric values, and the `eqaution`, `threshold` parameters are used to check if the replicas available are equal to or greater than 20.
<!-- BEGINNING OF PRE-COMMIT-TERRAFORM DOCS HOOK -->
## Requirements

Expand Down
6 changes: 4 additions & 2 deletions modules/alerts/tests/available-replica-count/1-example.tf
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,8 @@ module "this" {
deployment = "app-1-microservice"
}
function = "last"
condition = "$B < 1"
equation = "lt"
threshold = 1
},
{
name = "App_2 has 0 available replicas"
Expand All @@ -22,7 +23,8 @@ module "this" {
deployment = "app-2-microservice"
}
function = "last"
condition = "$B < 1"
equation = "lt"
threshold = 1
}
]
}
2 changes: 1 addition & 1 deletion modules/alerts/tests/available-replica-count/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ In this test, we have set up two alert rules to detect when the available replic

For each microservice, we have specified a filter to match the deployment name (`app-1-microservice` and `app-2-microservice`). The `last` function is used to process the metric values, respectively.

The condition `$B < 1` is used to check if the available replicas fall below 1, indicating that the application doesn't have any replicas.
The `eqaution`, `threshold` parameters are used to check if the available replicas fall below 1, indicating that the application doesn't have any replicas.
<!-- BEGINNING OF PRE-COMMIT-TERRAFORM DOCS HOOK -->
## Requirements

Expand Down
6 changes: 4 additions & 2 deletions modules/alerts/tests/container-restarts/1-example.tf
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,8 @@ module "this" {
container = "app-1-container"
}
function = "mean"
condition = "$B > 2"
equation = "gt"
threshold = 2
},
{
name = "App_2 has too many restarts"
Expand All @@ -23,7 +24,8 @@ module "this" {
container = "app-2-container"
}
function = "mean"
condition = "$B >= 4"
equation = "gte"
threshold = 4
}
]
}
2 changes: 1 addition & 1 deletion modules/alerts/tests/container-restarts/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ In this test, we have set up two alert rules to monitor the restart count of mic

For each microservice, we have specified a filter to match the container name (`app-1-container` and `app-2-container`). The `mean` function is used to aggregate the restart count values.

The conditions `$B > 2` and `$B >= 4` are employed to check if the restart count exceeds the thresholds for each microservice. When the conditions are met, indicating a high restart count, the alerts will be triggered.
The `eqaution`, `threshold` parameters are employed to check if the restart count exceeds the thresholds for each microservice. When the conditions are met, indicating a high restart count, the alerts will be triggered.
<!-- BEGINNING OF PRE-COMMIT-TERRAFORM DOCS HOOK -->
## Requirements

Expand Down
23 changes: 19 additions & 4 deletions modules/alerts/tests/mixed-metrics/1-example.tf
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,8 @@ module "this" {
container = "app-1-container"
}
function = "mean"
condition = "$B > 2"
equation = "gt"
threshold = 2
},
{
name = "App_2 max autoscaling"
Expand All @@ -24,7 +25,8 @@ module "this" {
deployment = "app-2-microservice"
}
function = "mean"
condition = "$B >= 20"
equation = "gte"
threshold = 20
},
{
name = "App_1 has 0 available replicas"
Expand All @@ -35,7 +37,8 @@ module "this" {
deployment = "app-1-microservice"
}
function = "mean"
condition = "$B < 1"
equation = "lt"
threshold = 1
},
{
name = "App_3 has 0 available replicas"
Expand All @@ -46,7 +49,19 @@ module "this" {
deployment = "app-3-microservice"
}
function = "mean"
condition = "$B < 1"
equation = "lt"
threshold = 1
},
{
name = "Maximum node utilization in cluster"
summary = "Cluster is using 8 available nodes"
folder_name = "Node Autoscaling"
datasource = "prometheus"
filters = null
metric_name = "sum(kube_node_info)"
function = "mean"
equation = "gte"
threshold = 8
}
]
}
15 changes: 15 additions & 0 deletions modules/alerts/tests/node-autoscaling/0-setup.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
terraform {
required_providers {
test = {
source = "terraform.io/builtin/test"
}
grafana = {
source = "grafana/grafana"
}
}
}

provider "grafana" {
url = "https://grafana.example.com/"
auth = "xxxxxxxxxxx"
}
39 changes: 39 additions & 0 deletions modules/alerts/tests/node-autoscaling/1-example.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
module "this" {
source = "../../"

alert_rules = [
{
name = "Maximum node utilization in cluster"
summary = "Cluster is using 8 available nodes"
folder_name = "Node Autoscaling"
datasource = "prometheus"
filters = null
metric_name = "sum(kube_node_info)"
function = "mean"
equation = "gt"
threshold = "8"
},
{
name = "High node utilization in cluster"
summary = "Cluster is using 6 of the available 8 nodes"
folder_name = "Node Autoscaling"
datasource = "prometheus"
filters = null
metric_name = "sum(kube_node_info)"
function = "mean"
equation = "gt"
threshold = "6"
},
{
name = "Insufficient nodes in cluster"
summary = "Cluster is using fewer nodes than the required count"
folder_name = "Node Autoscaling"
datasource = "prometheus"
filters = null
metric_name = "sum(kube_node_info)"
function = "mean"
equation = "lt"
threshold = "2"
}
]
}
9 changes: 9 additions & 0 deletions modules/alerts/tests/node-autoscaling/2-assert.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
resource "test_assertions" "dummy" {
component = "grafana-modules-alerts"

equal "scheme" {
description = "As module does not have any output and data just make sure the case runs. Probably can be thrown away."
got = "all good"
want = "all good"
}
}
44 changes: 44 additions & 0 deletions modules/alerts/tests/node-autoscaling/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
# Node Autoscaling
This test case demonstrates how to configure Grafana alerts for monitoring Node count in the cluster.

Basically it notifies you when node autoscaling reaches
- to its maximum, in our case: `$B >= 8`,
- to more count than the half of maximum: `$B >= 6`,
- to its minimum, in our case: `$B <= 2`.

Replace the values in the conditions with your real numbers.

## Usage
Please, note that we pass `null` value to `filters` variable. It's needed when we use such Prometheus metrics which don't get any filters when querying.

<!-- BEGINNING OF PRE-COMMIT-TERRAFORM DOCS HOOK -->
## Requirements

No requirements.

## Providers

| Name | Version |
|------|---------|
| <a name="provider_test"></a> [test](#provider\_test) | n/a |

## Modules

| Name | Source | Version |
|------|--------|---------|
| <a name="module_this"></a> [this](#module\_this) | ../../ | n/a |

## Resources

| Name | Type |
|------|------|
| test_assertions.dummy | resource |

## Inputs

No inputs.

## Outputs

No outputs.
<!-- END OF PRE-COMMIT-TERRAFORM DOCS HOOK -->
3 changes: 2 additions & 1 deletion modules/alerts/variables.tf
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,8 @@ variable "alert_rules" {
metric_name = string # Prometheus metric name which queries the data for the alert
filters = optional(any, {}) # Filters object to identify each service for alerting
function = optional(string, "mean") # One of Reduce functions which will be used in B block for alerting
condition = string # Math expression which compares B blocks value with a number and generates an alert if needed
equation = string # The equation in the math expression which compares B blocks value with a number and generates an alert if needed. Possible values: gt, lt, gte, lte, e.
threshold = number # The value against which B blocks are compared in the math expression
}))
default = []
description = "This varibale describes alert folders, groups and rules."
Expand Down

0 comments on commit 41ed151

Please sign in to comment.