-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix SBR unreachable observer cleanup #7141
Fix SBR unreachable observer cleanup #7141
Conversation
….com/Arkatufus/akka.net into Fix-SBR-unreachable-observer-cleanup
…atufus/akka.net into Fix-SBR-unreachable-observer-cleanup
This reverts commit e5d8ce4.
….com/Arkatufus/akka.net into Fix-SBR-unreachable-observer-cleanup
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Self review
// Detect floating unreachable islands in the reachability graph. | ||
#region Unreachable island detection | ||
|
||
// Collect all nodes that are reported as both record observer and unreachable | ||
// (possible indirect connection) | ||
var possibleIndirect = originalReachability.Records | ||
.Where(r => originalUnreachable.Contains(r.Observer)) | ||
.Select(r => r.Observer) | ||
.ToImmutableHashSet(); | ||
|
||
// For each possible islands, reach a consensus with all nodes in the SeenBy list | ||
// (reachable from the leader) that they also could not see the possible island node. | ||
var localSeenBy = SeenBy; | ||
var pruneList = new List<UniqueAddress>(); | ||
foreach (var address in possibleIndirect) | ||
{ | ||
var records = originalReachability.Records | ||
.Where(r => localSeenBy.Contains(r.Observer.Address) && r.Subject.Equals(address)) | ||
.Select(r => r.Observer) | ||
.ToImmutableHashSet(); | ||
|
||
// Add the node to the prune list if we reach a consensus | ||
if(records.Count == localSeenBy.Count) | ||
pruneList.Add(address); | ||
} | ||
#endregion |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This algorithm collects record observer addresses that are known to be unreachable by consensus.
Reachability = Reachability.FilterRecords( | ||
r => | ||
// we only retain records for addresses that are still downable | ||
downable.Contains(r.Observer) && downable.Contains(r.Subject) && | ||
// prune out records that are known to be disconnected islands | ||
!pruneList.Contains(r.Observer) && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prune out records that Observer are known to be unreachable by consensus
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Self Review
var allReachable = AllMembers.Select(m => m.UniqueAddress) | ||
.Where(a => !unreachable.Contains(a)) | ||
.ToImmutableHashSet(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the all cluster members set, exclude all members that are inside the unreachable set. This gives us all of the cluster members that is reachable.
var possibleIndirect = reachability.Records | ||
.Where(r => unreachable.Contains(r.Observer)) | ||
.Select(r => r.Observer) | ||
.ToImmutableHashSet(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From all the Reachability
Records
that we want to filter, find any Record.Observer
that is also inside the unreachable set. These are all potential poison records that needs to be cleaned.
var records = reachability.Records | ||
.Where(r => allReachable.Contains(r.Observer) && r.Subject.Equals(address)) | ||
.Select(r => r.Observer) | ||
.ToImmutableHashSet(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the Reachability.Records
that we have, if all of the nodes inside the all reachable set agrees that they also could not reach the potential poison record, prune it from the Reachability.Records
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - this should harden the SBR system against indirectly connected node cases where an observer managed to mark some gossip as seen on its way out the door.
Fix edge case where cluster node become unreachable after gossiping about another unreachable node, causing the cluster leader to erroneously decide that the cluster is irreparably unstable and downs the whole cluster.
Changes
Add a check in
DowningStrategy.AdditionalNodesToDownWhenIndirectlyConnected
to prune out records where itsRecord.Observer
is proven to be unreachable by consensus with all other records that are known to be reachable.Log data