-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Stack Monitoring] Add stale status reporting for Kibana #132613
[Stack Monitoring] Add stale status reporting for Kibana #132613
Conversation
5b23bbe
to
ec54671
Compare
5d7199d
to
011a04f
Compare
Pinging @elastic/infra-monitoring-ui (Team:Infra Monitoring UI) |
@elastic/observability-design Since @katefarrar is out, I would love some feedback on this improvised design change! |
Wondering. Does this end up firing in the event of a kibana instance replacement as well? Could probably launch it on ESS and scale kibana up/down to test. |
@miltonhultgren I would like to propose the following design changes to the indication 👍 I'd propose to replace the
You can continue to use the I'd convert this to the |
@formgeist Thanks for the swift feedback, will implement! |
"since we heard" sounds like it's missing a helping verb. "since we have heard" or "since we've heard" sounds better. |
@formgeist @smith Applied your feedback, thanks! |
@@ -130,6 +130,7 @@ export default function ({ getService }: PluginFunctionalProviderContext) { | |||
'monitoring.kibana.collection.enabled (boolean)', | |||
'monitoring.kibana.collection.interval (number)', | |||
'monitoring.ui.ccs.enabled (boolean)', | |||
'monitoring.ui.kibana.reporting.stale_status_threshold_seconds (number)', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Platform Security review:
expectedExposedConfigKeys
integration test change LGTM
@matschaffer I ran a test on Cloud, when the instances stop reporting we mark them as stale as they rotate out. Scaled from 1 to 3 (original is killed and marked as stale, 3 new ones are green): Scale back down to 1 and change time range (original is rotated out, one green, two stale): Change time range (two stale ones rotate out, one green, aggregate status back to green): I'm curious why scaling up gives me 3 new IDs but scaling down only kills 2 IDs and keeps 1 ID. I'd expect the scale up to keep the original instance running and simply add 2 but that's the Kubernetes way so it might not apply. |
There are a lot of variables to account for there. For example if the VM your first kibana was on got marked for removal by the cloud provider. So given the behavior, I'm a little concerned about what this will look like. Since any migration of the kibana instance (even typical cloud maintenance migrations) will reflect as "stale". It'd be good if we can clarify that we have a mix of stale/good I think. |
In the scenario I have intentionally decided to not collect Kibana metrics on one or more instances should this appear? Or does this only address the scenario where something went wrong and the user would want to be alerted/notified like this? I noticed this scenario after I enabled the kibana module in metricbeat and then disabled the kibana module in metricbeat and got the stale badge. Perhaps I'm not understanding something, but I notice after the default 15 minute time window elapses that I no longer see the Kibana instance with the stale badge. Not sure how useful this is if its just going to disappear outside of the time window anyway. The user would probably never see the kibana instance with the stale badge unless they made the time frame longer. |
@elasticmachine merge upstream |
Showing the warning feels correct, that instance we used to hear from is no longer sending feedback. @matschaffer We could add to that text something like "(1 of 4)"? |
Yeah, that would make more sense to me. Not sure how to represent it visually (maybe @formgeist can help) but as an operator if 3 are green and 1 is stale, I'd like to see that distinction called out. If I just see "stale" I'd presume all instances are stale. |
I'm not sure I understand. If you have never started to collect metrics, then we won't be aware of that instance at all, right?
That's 100% true, though the same happens for Elasticsearch and the whole cluster. If you turn off Metricbeat and wait for the 15 minute window to move then you get the no-data/couldn't find cluster screen. I don't know what to do about this. It's the same problem as the Entity Model for Infra Metrics.
I'll fix that. |
💚 Build SucceededMetrics [docs]Module Count
Async chunks
Page load bundle
History
To update your PR or re-run it, just comment with: |
Sorry. Yes, I meant what you described as trying to distinguish between intentionally turning off metrics vs something went wrong. Seems a bit noisy and excessive having the stale badges and warning icons if nothing is actually wrong, especially with the new columns of "last seen". Kind of feel like we're trying to replace the job of an alert notification here without the user opting for it. Also the fact we're only doing it for Kibana and not the other products will probably cause confusion. |
Just a reminder: the user problem we really need to solve is reporting green when an instance is down. Whatever solution we choose, we have to choose one that fixes this problem because it's a very embarrassing and imo indefensible state to find ourselves in for a customer in an outage.
I understand how it looks this way, but we don't handle any of the aggregate statuses this granular-ly. I believe if you have 4 instances and 3 are green and 1 is red, we will show "Status: Red", is that right? We should match that functionality in this ticket and revisit holistically if we don't like it, but I think aggregate statuses should show the worst and entice you to dig in to see what the problem is.
Noisy is a potential problem, I agree, but it's the flip side of the situation where we don't notify the user at all and then they think things are great during an outage. Stale isn't itself a warning state (this is why we originally left the status in place and applied this extra notification on top of it, because it isn't really a full-fledged status). It's a tip that we haven't heard from at least 1 instance in a given time range, which is something you may or may not be able to ignore. If we can find a better solution that solves the main problem, I'm definitely open, but absent that I think we should move forward with this one for now.
This is true and also okay, because of the problem we're solving. If the instance has disappeared from the window under investigation, we don't know it exists, so we don't show it. But if a user has a graph pinned to "Last 48 Hours" for some reason and a Kibana node goes down, but reports as "Green: Healthy" for 47 hours and 59 minutes, that's a scenario we can't defend.
It might be a good idea to log tickets for other components and try to implement the same logic before a big customer has an ES outage and asks us why the Stack Monitoring page was reporting their ES nodes as green/healthy for hours during an outage. If we find a better solution to Kibana, we can apply it across the board as well. |
Makes sense. @miltonhultgren had said in a comment that this was due to a bug and the status should actually have been grey if the last Kibana document is more than 10 minutes old. I kind of like this because grey feels like there is no current status. I was thinking fixing this could suffice without the extra UI stuff, but if design is happy, I'm not fussed. |
Summary
Fixes #126386
This PR adds visual warnings in the Stack Monitoring UI when one or more Kibana instances have a delay in their stats reporting. The delay can be configured with a
kibana.yml
setting and defaults to 120 seconds.Cluster overview:

Kibana overview:

Kibana instances:

Kibana instances table row:
Kibana instance details:

How to test
monitoring.ui.kibana.reporting.stale_status_threshold_seconds
to something low (like 10) in yourkibana.yml
To do
Checklist