Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix StuckInCatchup/Bootstrap alerts #14619

Merged

Conversation

ghost-not-in-the-shell
Copy link
Contributor

Explain your changes:
This PR fixes alerts StuckInCatchup and StuckInBootstrap

Explain how you tested your changes:
*

Checklist:

  • Dependency versions are unchanged
    • Notify Velocity team if dependencies must change in CI
  • Modified the current draft of release notes with details on what is completed or incomplete within this project
  • Document code purpose, how to use it
    • Mention expected invariants, implicit constraints
  • Tests were added for the new behavior
    • Document test purpose, significance of failures
    • Test names should reflect their purpose
  • All tests pass (CI will check this if you didn't)
  • Serialized types are in stable-versioned modules
  • Does this close issues? List them
  • Closes #0000

@ghost-not-in-the-shell
Copy link
Contributor Author

!ci-build-me

@ghost-not-in-the-shell
Copy link
Contributor Author

!ci-build-me

@ghost-not-in-the-shell
Copy link
Contributor Author

!ci-build-me



- alert: StuckInBootstrap
expr: max by (testnet) (increase(Coda_Runtime_process_uptime_ms_total{${berkeley_testnet},syncStatus = "BOOTSTRAP"}[2h])) >= 6000000
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a note: I think this approach may trigger this alert if a node that was in bootstrap exits bootstrap and then re-enters bootstrap again, since the rate function in prometheus do not skip breaks in data, they only filter out drops in values. When you apply the syncStatus filter here, you are causing a break in the data, but when the data matches the filter again, there will be a massive observed jump in the process_update_ms_total between the breaks, which will still be observed in the final output of this query.

Let's merge this for now, as we need a new metric to fix this. Something like current_sync_status_process_uptime_ms_total which is a counter that resets every time syncStatus changes.

@ghost-not-in-the-shell ghost-not-in-the-shell merged commit 605b8c8 into develop Dec 18, 2023
1 check passed
@ghost-not-in-the-shell ghost-not-in-the-shell deleted the alert/fix-stuck-in-catchup-and-bootstrap branch December 18, 2023 17:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants