[Monitoring] Investigate/discuss the consequences of enabling alert management from within the actions UI #92128

igoristic · 2021-02-22T06:22:56Z

Alerts in Stack Monitoring weren't designed to be managed outside the the SM app. There are few problems I notices when enabling alerts' management from the actions menu in Stack Management. We had to do this in order to accommodate the use cases described here: #91145

Main issues are:

Our instance id is tied to alert.alertTypeId and we only use the first item from api/alerts/_find this means we ignore any other instances that might be created.

Possible solution: Go through all the alerts when searching by alertTypeId and create instances prefixing the alert.id instead.

A Stack Monitoring alert instance can not be deleted, since it'll get recreated as soon as the SM page is visited again

Possible solution: I think we should implement an ability do distinguish "manageable" (created/managed in the actions ui) alerts vs "preconfigured" alerts (managed by SM ui)

I think we should solve these issues as soon as possible, so they can make it into one of the 7.12.x release

The text was updated successfully, but these errors were encountered:

elasticmachine · 2021-02-22T06:22:58Z

Pinging @elastic/stack-monitoring (Team:Monitoring)

igoristic · 2021-02-23T04:30:46Z

As discussed, we will only be focusing on the first issue (creating/modifying/notifying new alert instances). I will list the following things that we will need in order to make it work:

Get all alerts for a specific instance type (CPU, Disk Usage, etc). We will need to change the following logic to map through all alerts and not assume that theres only one instance for an alert type:

kibana/x-pack/plugins/monitoring/server/alerts/alerts_factory.ts

Line 80 in 95516e6

const [rawAlert] = alertClientAlerts.data as [Alert];
Use the alert's actual unique id in all the places we create/check instance state/notification, eg:

kibana/x-pack/plugins/monitoring/server/alerts/base_alert.ts

Line 307 in 95516e6

const instanceId = `${this.alertOptions.id}:${cluster.clusterUuid}:${firingNodeUuids}`;

Basically change this.alertOptions.id (which is alertTypeId) to this.rawAlert.id
Figure out the UI/UX of how we represent multiple instances of the same alert type. So, for example if we have 3 CPU alerts with the following thresholds: 70%, 80%, and 90% and a node is firing at 95% causing all three alerts to trigger. We can either show all three notifications of the same type or only show the most critical (the one with the highest threshold), since the other two lower thresholds/notifications become irrelevant. This will also execute all three hooks they have set for their actions, so basically getting 3 emails etc

I think the UI/UX (no. 3) will be the most complex to figure out, and might take some discussion around what the flow/experience should be. We will also need "smart" logic to only show the most critical to reduce dups/noise.

1 and 2 are pretty trivial and shouldn't take too long to implement/test. That said I think we should get 1 & 2 merged in as soon as possible and then focus on the UI/UX part, since we can sneak it in other minors (if need be) as an enhancement.

cc: @ravikesarwani @chrisronline @jasonrhodes feel free to discuss/add to the list/points mentioned

igoristic · 2021-03-09T18:20:24Z

I have concluded that we should revert alert management (for now). I was testing this locally and it was working at one point, however, when I point it to the edge environment I got different results. Mainly alerts going into active states and becoming sticky if modified/created from Stack Management UI.

There's also a race conditions with this.rawAlert.id when creating alerts (via SM ui page load) since we do not have any id initially it's set as undefined (in the instance's namespace). This only worked for me initially because I was hot swapping Kibana nodes in a docker environment that already had alerts created (thus skipping the creation step).

This is too risky to try to merge, and still needs a little more investigation

ravikesarwani · 2021-03-09T20:39:41Z

That is unfortunate. I guess we can revert but continue to work on it to see if we can fix the issues we have found (alerts going into active states and becoming sticky if modified/created from Stack Management UI).
Not being able to manage the SM alerts like other alerts is just very confusing user experience.

pmuellr · 2021-03-24T12:40:10Z

I'm interested in getting more info on this issue, since it does seem to be confusing to customers. I'm hoping we can find some functionality in the alerting framework to help you out.

One capability we recently added is being able to do flattened (keyword) searches over parameters. This kinda solves an open issue we had to allow solutions to add some data to an alert they could later use for their own search purposes. SIEM needs to do this, and added their own generated entries to the tags of the alert, but of course this clutters up the customer's use of tags, as well as opening the possibility that they delete/change the tag, or copy it elsewhere.

What you can do now is create a new parameter, which isn't exposed in the UI, and should probably be schema.nullable(), and then when the solution creates an alert, they can set a value for that parameter. Presumably customers creating alerts outside the solution would NOT have this parameter set - though the semantics of setting it (or not) should probably be documented in "API docs" (which we don't really have right now for alert params).

Then at least you can distinguish between solution-created and alert-management-page-created (or HTTP API created) alerts.

I'm not quite sure why you need to distinguish these though, especially since it appears you want to allow multiple alerts of the same alert type to co-exist.

This bit seems really hard:

Figure out the UI/UX of how we represent multiple instances of the same alert type. So, for example if we have 3 CPU alerts with the following thresholds: 70%, 80%, and 90% and a node is firing at 95% causing all three alerts to trigger. We can either show all three notifications of the same type or only show the most critical (the one with the highest threshold), since the other two lower thresholds/notifications become irrelevant. This will also execute all three hooks they have set for their actions, so basically getting 3 emails etc

We don't have anything to help with coordinating actions between independent alerts. I have no idea how you'd easily be able to only show the "highest" trigger point here, without a second layer of persistence that wrote all the values out, and then another alert that would read those values, and then only send the highest, based over some time window. Super-complicated.

And there are certainly use cases where you'd want all the alerts to fire anyway - each individual alert might have an action that sends emails to different people, different slack channel, etc.

We do have a notion of alert groups, so you could define 3 "levels" as different action groups, each with an associated, monotonically increasing threshold. Call them warn, high, critical. Let the customer assign the numbers. The alert executor would then use these values to indicate it should schedule actions on the appropriate action group, where the customer picks the actions associated with each action group. Only one action group is "active" for an alert instance - an alert instance can't be in "warn" and "high" at the same time.

jasonrhodes · 2021-04-08T15:08:52Z

@igoristic let's schedule a chat so you can show me exactly what broke here and we can figure out a way forward -- I'll schedule

igoristic added bug Fixes for quality problems that affect the customer experience discuss Team:Monitoring Stack Monitoring team v8.0.0 labels Feb 22, 2021

igoristic added this to the Stack Monitoring UI 7.13 milestone Feb 22, 2021

igoristic self-assigned this Feb 22, 2021

sgrodzicki added the v7.12.0 label Feb 22, 2021

igoristic mentioned this issue Feb 23, 2021

[Monitoring] Optimize alert fetch status call #92129

Closed

5 tasks

igoristic mentioned this issue Mar 9, 2021

[Monitoring] Reverting alert management #94167

Merged

sgrodzicki mentioned this issue Mar 26, 2021

preconfigured alerts modification #94749

Closed

jasonrhodes mentioned this issue May 14, 2021

[Stack Monitoring] Adapt SM UI to account for possibility of 0-n rules per rule type #100142

Closed

neptunian mentioned this issue Jul 21, 2021

[Stack Monitoring] Enable OOTB alerts in RAC page and multiple rules of a rule type #106457

Merged

neptunian closed this as completed in #106457 Aug 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Monitoring] Investigate/discuss the consequences of enabling alert management from within the actions UI #92128

[Monitoring] Investigate/discuss the consequences of enabling alert management from within the actions UI #92128

igoristic commented Feb 22, 2021

elasticmachine commented Feb 22, 2021

igoristic commented Feb 23, 2021

igoristic commented Mar 9, 2021

ravikesarwani commented Mar 9, 2021

pmuellr commented Mar 24, 2021

jasonrhodes commented Apr 8, 2021

[Monitoring] Investigate/discuss the consequences of enabling alert management from within the actions UI #92128

[Monitoring] Investigate/discuss the consequences of enabling alert management from within the actions UI #92128

Comments

igoristic commented Feb 22, 2021

elasticmachine commented Feb 22, 2021

igoristic commented Feb 23, 2021

igoristic commented Mar 9, 2021

ravikesarwani commented Mar 9, 2021

pmuellr commented Mar 24, 2021

jasonrhodes commented Apr 8, 2021