You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Each node will need to keep track of the last few master node changes (this should be fine in memory) and its local node information (a node might see the master node flapping, but the master node itself is fine). e.g. if the master node has changed >3 times in the last 30 minutes, then it's not stable. otherwise, nothing to report.
The coordinating node might have to contact a master eligible node which in turn might have to contact other master eligible nodes, but this is the worst-case scenario and definitely does not involve fanning out to all nodes.
Store a view of the last 30 minutes of master history on each node, and add the ability to query any node for its view of master history Adding a view of master history #85941
None of them is master and we aren't master eligible
RCA reach out to a master eligible node and run the same checks (take disconnect/timeout of the network call into account)
If we are master eligible nodes collect the information form all known master eligible nodes (the information about term/version/voting config should be available in ClusterFormationFailureHelper, however we'd need that exposed over the wire Adding a transport action to get cluster formation info #87306 ) - take disconnect/timeout of the network call into account
We need to not use dynamic keys (node Ids) in the cluster_formation map in the details within the result
We need to check that selection of the master eligible node when polling master eligible nodes is truly random (we're currently using getMasterEligibleNodes().stream().findAny())
Each node will need to keep track of the last few master node changes (this should be fine in memory) and its local node information (a node might see the master node flapping, but the master node itself is fine). e.g. if the master node has changed >3 times in the last 30 minutes, then it's not stable. otherwise, nothing to report.
The coordinating node might have to contact a master eligible node which in turn might have to contact other master eligible nodes, but this is the worst-case scenario and definitely does not involve fanning out to all nodes.
In case the node has master according to above definition:
null
repeatedly in the last 30 minutes. Report GREEN Master stability health indicator part 1 (when a master has been seen recently) #86524In case the node does not have a master node:
Check if we know of any master-eligible nodes - in case we don't know of any report RED due to a discovery problem (include witnessed masters history). Adding additional capability to the master_is_stable health indicator service #87482
In case we know of some use the
PeerFinder
to check:If we are master eligible nodes collect the information form all known master eligible nodes (the information about term/version/voting config should be available in
ClusterFormationFailureHelper
, however we'd need that exposed over the wire Adding a transport action to get cluster formation info #87306 ) - take disconnect/timeout of the network call into accountDocument the settings we create for the master stability and cluster diagnostics service(s) Documenting master_is_stable health API settings #87901
Create
master-is-stable
indicatorThe text was updated successfully, but these errors were encountered: