Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A failed rule run is not shown as a failure response in Rule Logs KPI #142910

Closed
EricDavisX opened this issue Oct 6, 2022 · 3 comments · Fixed by #142940
Closed

A failed rule run is not shown as a failure response in Rule Logs KPI #142910

EricDavisX opened this issue Oct 6, 2022 · 3 comments · Fixed by #142940
Assignees
Labels
bug Fixes for quality problems that affect the customer experience Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)

Comments

@EricDavisX
Copy link
Contributor

EricDavisX commented Oct 6, 2022

Kibana version: 8.5.0 BC3 - tested on cloud prod

Browser version:
Chrome on 12.1 macOS - Version 106.0.5249.91 (Official Build) (x86_64)

Describe the bug:
A failed rule run is not getting shown as a failure in the new Rules Log KPIs, and a related problem it looks like the UI is only showing 1000 rule success runs, when we know there are more (incrementing always) - the underlying data is likely the problem here.

Steps to reproduce:

  1. create a rule firing every 1 seconds to quickly build up logs, base it off of the Kibana 'web logs' sample data
  2. go to the Rules page -. Logs tab
  3. see some KPIs for the rule running
  4. wait about 17 minutes for the first 1000 rule runs to happen
  5. note that the 'success' counter gets stuck on 1000 even, this is wrong either way
  6. Then 'remove' the kibana web log data source from the 'home' page in Kibana, this will make the rule fail
  7. after the rule runs the next time you'll see an error in the Logs view, but the KPI won't show it, that's the bug

Expected behavior:

  • the KPI should reflect the error runs seen over the last time frame
  • the KPI counts shouldn't be capped at the last 1000 for any specific reason, it should be time based

Screenshots (if relevant):
rule-is-failed

Errors in browser console (if relevant):
n/a

Any additional context:

  • if you set the time frame to be much smaller, like less than 1000 data points, then the KPIs will be right, tested this as noted, screenshot:
    fails-over-last-few-mins-works

  • also I tested this using the machine-learing-qa framework, it can create many rules and speed up data collection to the 1000 entry point. I was extrapolating how to recreate this from scratch, if the repro scenario doesn't work, please reach out.

here is a graphic showing that the totals won't show more than 1000 count:
totals-equals-1000

@EricDavisX EricDavisX added bug Fixes for quality problems that affect the customer experience Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) labels Oct 6, 2022
@elasticmachine
Copy link
Contributor

Pinging @elastic/response-ops (Team:ResponseOps)

@XavierM XavierM self-assigned this Oct 7, 2022
@XavierM XavierM moved this from Awaiting Triage to In Review in AppEx: ResponseOps - Rules & Alerts Management Oct 7, 2022
@EricDavisX
Copy link
Contributor Author

The steps to reproduce didn't work for me when testing against 'main' 8.6, so I wanted to post these details:
to create a rule with an error, go to the a data view you created (that doesn't have an @timestamp field) and 'create an alert' based on it.  you could even create a simple {"test": "doc"} connector post to run a few times wherein the index will exist for you to create a dataview and browse it in discover where you can create the rule. then, when the rule runes it will throw an error.  this will be fixed when #135806 is done.

Repository owner moved this from In Review to Done in AppEx: ResponseOps - Rules & Alerts Management Oct 11, 2022
@EricDavisX
Copy link
Contributor Author

Ah, the Kibana Discover UI is now smart enough to prevent you from selecting an index that doesn't have a 'date' in it... but you can select one, then set up your rule and fire it every second, etc - and then delete the the index and the rule will error, so you can see a 'failure' and test this out. tested on 8.5 BC4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Fixes for quality problems that affect the customer experience Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)
Projects
No open projects
Development

Successfully merging a pull request may close this issue.

3 participants