Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Monitoring] Investigate/discuss the consequences of enabling alert management from within the actions UI #92128

Closed
igoristic opened this issue Feb 22, 2021 · 6 comments · Fixed by #106457
Assignees
Labels
bug Fixes for quality problems that affect the customer experience discuss Team:Monitoring Stack Monitoring team v7.12.0 v8.0.0

Comments

@igoristic
Copy link
Contributor

Alerts in Stack Monitoring weren't designed to be managed outside the the SM app. There are few problems I notices when enabling alerts' management from the actions menu in Stack Management. We had to do this in order to accommodate the use cases described here: #91145

Main issues are:

  1. Our instance id is tied to alert.alertTypeId and we only use the first item from api/alerts/_find this means we ignore any other instances that might be created.

Possible solution: Go through all the alerts when searching by alertTypeId and create instances prefixing the alert.id instead.


  1. A Stack Monitoring alert instance can not be deleted, since it'll get recreated as soon as the SM page is visited again

Possible solution: I think we should implement an ability do distinguish "manageable" (created/managed in the actions ui) alerts vs "preconfigured" alerts (managed by SM ui)


I think we should solve these issues as soon as possible, so they can make it into one of the 7.12.x release

@igoristic igoristic added bug Fixes for quality problems that affect the customer experience discuss Team:Monitoring Stack Monitoring team v8.0.0 labels Feb 22, 2021
@igoristic igoristic added this to the Stack Monitoring UI 7.13 milestone Feb 22, 2021
@igoristic igoristic self-assigned this Feb 22, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/stack-monitoring (Team:Monitoring)

@igoristic
Copy link
Contributor Author

As discussed, we will only be focusing on the first issue (creating/modifying/notifying new alert instances). I will list the following things that we will need in order to make it work:

  1. Get all alerts for a specific instance type (CPU, Disk Usage, etc). We will need to change the following logic to map through all alerts and not assume that theres only one instance for an alert type:

    const [rawAlert] = alertClientAlerts.data as [Alert];

  2. Use the alert's actual unique id in all the places we create/check instance state/notification, eg:

    const instanceId = `${this.alertOptions.id}:${cluster.clusterUuid}:${firingNodeUuids}`;

    Basically change this.alertOptions.id (which is alertTypeId) to this.rawAlert.id

  3. Figure out the UI/UX of how we represent multiple instances of the same alert type. So, for example if we have 3 CPU alerts with the following thresholds: 70%, 80%, and 90% and a node is firing at 95% causing all three alerts to trigger. We can either show all three notifications of the same type or only show the most critical (the one with the highest threshold), since the other two lower thresholds/notifications become irrelevant. This will also execute all three hooks they have set for their actions, so basically getting 3 emails etc

I think the UI/UX (no. 3) will be the most complex to figure out, and might take some discussion around what the flow/experience should be. We will also need "smart" logic to only show the most critical to reduce dups/noise.

1 and 2 are pretty trivial and shouldn't take too long to implement/test. That said I think we should get 1 & 2 merged in as soon as possible and then focus on the UI/UX part, since we can sneak it in other minors (if need be) as an enhancement.

cc: @ravikesarwani @chrisronline @jasonrhodes feel free to discuss/add to the list/points mentioned

@igoristic
Copy link
Contributor Author

I have concluded that we should revert alert management (for now). I was testing this locally and it was working at one point, however, when I point it to the edge environment I got different results. Mainly alerts going into active states and becoming sticky if modified/created from Stack Management UI.

There's also a race conditions with this.rawAlert.id when creating alerts (via SM ui page load) since we do not have any id initially it's set as undefined (in the instance's namespace). This only worked for me initially because I was hot swapping Kibana nodes in a docker environment that already had alerts created (thus skipping the creation step).

This is too risky to try to merge, and still needs a little more investigation

@ravikesarwani
Copy link
Contributor

That is unfortunate. I guess we can revert but continue to work on it to see if we can fix the issues we have found (alerts going into active states and becoming sticky if modified/created from Stack Management UI).
Not being able to manage the SM alerts like other alerts is just very confusing user experience.

@pmuellr
Copy link
Member

pmuellr commented Mar 24, 2021

I'm interested in getting more info on this issue, since it does seem to be confusing to customers. I'm hoping we can find some functionality in the alerting framework to help you out.

One capability we recently added is being able to do flattened (keyword) searches over parameters. This kinda solves an open issue we had to allow solutions to add some data to an alert they could later use for their own search purposes. SIEM needs to do this, and added their own generated entries to the tags of the alert, but of course this clutters up the customer's use of tags, as well as opening the possibility that they delete/change the tag, or copy it elsewhere.

What you can do now is create a new parameter, which isn't exposed in the UI, and should probably be schema.nullable(), and then when the solution creates an alert, they can set a value for that parameter. Presumably customers creating alerts outside the solution would NOT have this parameter set - though the semantics of setting it (or not) should probably be documented in "API docs" (which we don't really have right now for alert params).

Then at least you can distinguish between solution-created and alert-management-page-created (or HTTP API created) alerts.

I'm not quite sure why you need to distinguish these though, especially since it appears you want to allow multiple alerts of the same alert type to co-exist.

This bit seems really hard:

Figure out the UI/UX of how we represent multiple instances of the same alert type. So, for example if we have 3 CPU alerts with the following thresholds: 70%, 80%, and 90% and a node is firing at 95% causing all three alerts to trigger. We can either show all three notifications of the same type or only show the most critical (the one with the highest threshold), since the other two lower thresholds/notifications become irrelevant. This will also execute all three hooks they have set for their actions, so basically getting 3 emails etc

We don't have anything to help with coordinating actions between independent alerts. I have no idea how you'd easily be able to only show the "highest" trigger point here, without a second layer of persistence that wrote all the values out, and then another alert that would read those values, and then only send the highest, based over some time window. Super-complicated.

And there are certainly use cases where you'd want all the alerts to fire anyway - each individual alert might have an action that sends emails to different people, different slack channel, etc.

We do have a notion of alert groups, so you could define 3 "levels" as different action groups, each with an associated, monotonically increasing threshold. Call them warn, high, critical. Let the customer assign the numbers. The alert executor would then use these values to indicate it should schedule actions on the appropriate action group, where the customer picks the actions associated with each action group. Only one action group is "active" for an alert instance - an alert instance can't be in "warn" and "high" at the same time.

@jasonrhodes
Copy link
Member

@igoristic let's schedule a chat so you can show me exactly what broke here and we can figure out a way forward -- I'll schedule

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Fixes for quality problems that affect the customer experience discuss Team:Monitoring Stack Monitoring team v7.12.0 v8.0.0
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants