Cluster coordination indicator - report if the master is stable and an impact/troubleshoot guide otherwise #85624

andreidan · 2022-04-01T13:53:24Z

Each node will need to keep track of the last few master node changes (this should be fine in memory) and its local node information (a node might see the master node flapping, but the master node itself is fine). e.g. if the master node has changed >3 times in the last 30 minutes, then it's not stable. otherwise, nothing to report.

The coordinating node might have to contact a master eligible node which in turn might have to contact other master eligible nodes, but this is the worst-case scenario and definitely does not involve fanning out to all nodes.

Store a view of the last 30 minutes of master history on each node, and add the ability to query any node for its view of master history Adding a view of master history #85941
Determine if the node has master (if one is present in the cluster state or otherwise if it has seen one in the last 30 seconds) Master stability health indicator part 1 (when a master has been seen recently) #86524

In case the node has master according to above definition:

Check if a seen master is stable - master did not go null repeatedly in the last 30 minutes. Report GREEN Master stability health indicator part 1 (when a master has been seen recently) #86524
Check if master is UNSTABLE - did it change more than 3 times in the last 30 minutes ? Report YELLOW (add history info on who was master and when) Master stability health indicator part 1 (when a master has been seen recently) #86524
- RCA for unstable master - try and contact a previous master about the reason it stepped down (take disconnect/timeout of the network call into account). For this we'll need to expose the "master history log" we built at the above points over the wire. Master stability health indicator part 1 (when a master has been seen recently) #86524
Check if we (the node coordinating this check) are unstable - has the witnessed master gone null/not-null more than 3 times in the last 30 minutes but the identity hasn't changed? Report the unstable master case above (report YELLOW) with the same RCA Master stability health indicator part 1 (when a master has been seen recently) #86524

In case the node does not have a master node:

The text was updated successfully, but these errors were encountered:

elasticmachine · 2022-04-01T13:53:42Z

Pinging @elastic/es-data-management (Team:Data Management)

masseyke · 2022-07-05T19:31:30Z

Adding the diagram we're using to drive this work.

masseyke · 2022-09-01T18:10:13Z

Closing this but with feedback from @dakrone --

We need to not use dynamic keys (node Ids) in the cluster_formation map in the details within the result
We need to check that selection of the master eligible node when polling master eligible nodes is truly random (we're currently using getMasterEligibleNodes().stream().findAny())

andreidan assigned andreidan and masseyke Apr 1, 2022

andreidan added Meta :Data Management/Health labels Apr 1, 2022

elasticmachine added the Team:Data Management Meta label for data/management team label Apr 1, 2022

andreidan mentioned this issue Apr 11, 2022

Transport action for ClusterFormationFailureHelper warnings #85782

Closed

masseyke mentioned this issue Apr 15, 2022

Adding a view of master history #85941

Merged

andreidan mentioned this issue May 18, 2022

Master stability health indicator part 1 (when a master has been seen recently) #86524

Merged

masseyke mentioned this issue Jun 15, 2022

Move the master stability logic into its own service separate from the HealthIndicatorService #87672

Merged

andreidan mentioned this issue Jul 4, 2022

Adding logic to master_is_stable indicator to check for discovery problems #88020

Merged

This was referenced Jul 11, 2022

Polling cluster formation state for master-is-stable health indicator #88397

Merged

Polling for cluster diagnostics information #88562

Closed

masseyke mentioned this issue Aug 1, 2022

Polling for cluster diagnostics information #89014

Merged

masseyke closed this as completed Sep 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cluster coordination indicator - report if the master is stable and an impact/troubleshoot guide otherwise #85624

Cluster coordination indicator - report if the master is stable and an impact/troubleshoot guide otherwise #85624

andreidan commented Apr 1, 2022 •

edited by masseyke

Loading

elasticmachine commented Apr 1, 2022

masseyke commented Jul 5, 2022

masseyke commented Sep 1, 2022

Cluster coordination indicator - report if the master is stable and an impact/troubleshoot guide otherwise #85624

Cluster coordination indicator - report if the master is stable and an impact/troubleshoot guide otherwise #85624

Comments

andreidan commented Apr 1, 2022 • edited by masseyke Loading

elasticmachine commented Apr 1, 2022

masseyke commented Jul 5, 2022

masseyke commented Sep 1, 2022

andreidan commented Apr 1, 2022 •

edited by masseyke

Loading