Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix StuckInCatchup/Bootstrap alerts #14619

Merged
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -255,25 +255,6 @@ groups:
description: "{{ $value }} blocks have been validated on network {{ $labels.testnet }} in the last hour (according to some node)."
runbook: "https://www.notion.so/minaprotocol/FewBlocksPerHour-47a6356f093242d988b0d9527ce23478"

- alert: StuckInBootstrap
expr: count by (testnet) (increase(Coda_Runtime_process_uptime_ms_total{syncStatus = "BOOTSTRAP"}[2h]) >= 7200000) > 0
for: ${alert_evaluation_duration}
labels:
testnet: "{{ $labels.testnet }}"
severity: critical
annotations:
summary: "One or more {{ $labels.testnet }} nodes are stuck at bootstrap for more than 2 hours"

- alert: StuckInCatchup
expr: count by (testnet) (increase(Coda_Runtime_process_uptime_ms_total{syncStatus = "CATCHUP"}[2h]) >= 7200000) > 0
for: ${alert_evaluation_duration}
labels:
testnet: "{{ $labels.testnet }}"
severity: critical
annotations:
summary: "One or more {{ $labels.testnet }} nodes are stuck at catchup for more than 2 hours"


- name: Warnings
rules:
- alert: HighBlockGossipLatency
Expand Down Expand Up @@ -638,7 +619,25 @@ groups:
summary: "One or more {{ $labels.testnet }} nodes are stuck at an old block height (Observed block height did not increase in the last 30m)"
description: "{{ $value }} blocks have been validated on network {{ $labels.testnet }} in the last hour (according to some node)."
runbook: "https://www.notion.so/minaprotocol/FewBlocksPerHour-47a6356f093242d988b0d9527ce23478"


- alert: StuckInBootstrap
expr: max by (testnet) (increase(Coda_Runtime_process_uptime_ms_total{${berkeley_testnet},syncStatus = "BOOTSTRAP"}[2h])) >= 6000000
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a note: I think this approach may trigger this alert if a node that was in bootstrap exits bootstrap and then re-enters bootstrap again, since the rate function in prometheus do not skip breaks in data, they only filter out drops in values. When you apply the syncStatus filter here, you are causing a break in the data, but when the data matches the filter again, there will be a massive observed jump in the process_update_ms_total between the breaks, which will still be observed in the final output of this query.

Let's merge this for now, as we need a new metric to fix this. Something like current_sync_status_process_uptime_ms_total which is a counter that resets every time syncStatus changes.

for: ${alert_evaluation_duration}
labels:
testnet: "{{ $labels.testnet }}"
severity: critical
annotations:
summary: "One or more {{ $labels.testnet }} nodes are stuck at bootstrap for more than 100 mins within the recent 2 hours"

- alert: StuckInCatchup
expr: max by (testnet) (increase(Coda_Runtime_process_uptime_ms_total{${berkeley_testnet},syncStatus = "CATCHUP"}[2h])) >= 6000000
for: ${alert_evaluation_duration}
labels:
testnet: "{{ $labels.testnet }}"
severity: critical
annotations:
summary: "One or more {{ $labels.testnet }} nodes are stuck at catchup for more than 100 mins within the recent 2 hours"

- alert: HighBlockGossipLatency
expr: max by (testnet) (max_over_time(Coda_Block_latency_gossip_time {${berkeley_testnet},${synced_status_filter}} [${alert_timeframe}])) > 200
for: ${alert_evaluation_duration}
Expand Down