Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix SBR unreachable observer cleanup #7141

Merged

Conversation

Arkatufus
Copy link
Contributor

@Arkatufus Arkatufus commented Apr 5, 2024

Fix edge case where cluster node become unreachable after gossiping about another unreachable node, causing the cluster leader to erroneously decide that the cluster is irreparably unstable and downs the whole cluster.

  • Node-1 is leader
  • Node-2 went unreachable to the whole cluster
  • Node-3 gossips that Node-2 is unreachable
  • Node-1 SBR records Node-3 as an observer
  • Node-3 went unreachable to the whole cluster before SBR stable-after interval expires
  • Node-1 SBR stable-after interval expires
  • Node-1 thinks that Node-3 is indirectly connected to the cluster and couldn't resolve the graph
  • Node-1 SBR downs all nodes

Changes

Add a check in DowningStrategy.AdditionalNodesToDownWhenIndirectlyConnected to prune out records where its Record.Observer is proven to be unreachable by consensus with all other records that are known to be reachable.

Log data

[01:07:26 WRN] SBR double Akka.Cluster.SBR.DownIndirectlyConnected decision, downing all instead. originalReachability: [Reachability([akka.tcp://AkkaCluster@localhost:6010 -> UniqueAddress: (akka.tcp://AkkaCluster@localhost:6009, 1134654932): Unreachable [Unreachable] (4)], [akka.tcp://AkkaCluster@localhost:6010 -> UniqueAddress: (akka.tcp://AkkaCluster@localhost:6008, 1606735766): Unreachable [Unreachable] (3)], [akka.tcp://AkkaCluster@localhost:6010 -> UniqueAddress: (akka.tcp://AkkaCluster@localhost:6005, 1615201291): Unreachable [Unreachable] (1)], [akka.tcp://AkkaCluster@localhost:6010 -> UniqueAddress: (akka.tcp://AkkaCluster@localhost:6007, 1615201291): Unreachable [Unreachable] (2)][akka.tcp://AkkaCluster@localhost:6009 -> UniqueAddress: (akka.tcp://AkkaCluster@localhost:6005, 1615201291): Unreachable [Unreachable] (1)][akka.tcp://AkkaCluster@localhost:6003 -> UniqueAddress: (akka.tcp://AkkaCluster@localhost:6008, 1606735766): Unreachable [Unreachable] (3)], [akka.tcp://AkkaCluster@localhost:6003 -> UniqueAddress: (akka.tcp://AkkaCluster@localhost:6005, 1615201291): Unreachable [Unreachable] (1)], [akka.tcp://AkkaCluster@localhost:6003 -> UniqueAddress: (akka.tcp://AkkaCluster@localhost:6007, 1615201291): Unreachable [Unreachable] (2)][akka.tcp://AkkaCluster@localhost:6011 -> UniqueAddress: (akka.tcp://AkkaCluster@localhost:6005, 1615201291): Unreachable [Unreachable] (1)], [akka.tcp://AkkaCluster@localhost:6011 -> UniqueAddress: (akka.tcp://AkkaCluster@localhost:6007, 1615201291): Unreachable [Unreachable] (3)], [akka.tcp://AkkaCluster@localhost:6011 -> UniqueAddress: (akka.tcp://AkkaCluster@localhost:6006, 2095747650): Unreachable [Unreachable] (2)][akka.tcp://AkkaCluster@localhost:6015 -> UniqueAddress: (akka.tcp://AkkaCluster@localhost:6009, 1134654932): Unreachable [Unreachable] (4)], [akka.tcp://AkkaCluster@localhost:6015 -> UniqueAddress: (akka.tcp://AkkaCluster@localhost:6005, 1615201291): Unreachable [Unreachable] (1)], [akka.tcp://AkkaCluster@localhost:6015 -> UniqueAddress: (akka.tcp://AkkaCluster@localhost:6007, 1615201291): Unreachable [Unreachable] (3)], [akka.tcp://AkkaCluster@localhost:6015 -> UniqueAddress: (akka.tcp://AkkaCluster@localhost:6006, 2095747650): Unreachable [Unreachable] (2)][akka.tcp://AkkaCluster@localhost:6002 -> UniqueAddress: (akka.tcp://AkkaCluster@localhost:6009, 1134654932): Unreachable [Unreachable] (4)], [akka.tcp://AkkaCluster@localhost:6002 -> UniqueAddress: (akka.tcp://AkkaCluster@localhost:6005, 1615201291): Unreachable [Unreachable] (1)], [akka.tcp://AkkaCluster@localhost:6002 -> UniqueAddress: (akka.tcp://AkkaCluster@localhost:6007, 1615201291): Unreachable [Unreachable] (3)], [akka.tcp://AkkaCluster@localhost:6002 -> UniqueAddress: (akka.tcp://AkkaCluster@localhost:6006, 2095747650): Unreachable [Unreachable] (2)][akka.tcp://AkkaCluster@localhost:6013 -> UniqueAddress: (akka.tcp://AkkaCluster@localhost:6009, 1134654932): Unreachable [Unreachable] (3)], [akka.tcp://AkkaCluster@localhost:6013 -> UniqueAddress: (akka.tcp://AkkaCluster@localhost:6008, 1606735766): Unreachable [Unreachable] (2)], [akka.tcp://AkkaCluster@localhost:6013 -> UniqueAddress: (akka.tcp://AkkaCluster@localhost:6006, 2095747650): Unreachable [Unreachable] (1)][akka.tcp://AkkaCluster@localhost:6004 -> UniqueAddress: (akka.tcp://AkkaCluster@localhost:6009, 1134654932): Unreachable [Unreachable] (3)], [akka.tcp://AkkaCluster@localhost:6004 -> UniqueAddress: (akka.tcp://AkkaCluster@localhost:6008, 1606735766): Unreachable [Unreachable] (2)], [akka.tcp://AkkaCluster@localhost:6004 -> UniqueAddress: (akka.tcp://AkkaCluster@localhost:6006, 2095747650): Unreachable [Unreachable] (1)][akka.tcp://AkkaCluster@localhost:6001 -> UniqueAddress: (akka.tcp://AkkaCluster@localhost:6009, 1134654932): Unreachable [Unreachable] (3)], [akka.tcp://AkkaCluster@localhost:6001 -> UniqueAddress: (akka.tcp://AkkaCluster@localhost:6008, 1606735766): Unreachable [Unreachable] (2)], [akka.tcp://AkkaCluster@localhost:6001 -> UniqueAddress: (akka.tcp://AkkaCluster@localhost:6006, 2095747650): Unreachable [Unreachable] (1)][akka.tcp://AkkaCluster@localhost:6014 -> UniqueAddress: (akka.tcp://AkkaCluster@localhost:6009, 1134654932): Unreachable [Unreachable] (3)], [akka.tcp://AkkaCluster@localhost:6014 -> UniqueAddress: (akka.tcp://AkkaCluster@localhost:6008, 1606735766): Unreachable [Unreachable] (2)], [akka.tcp://AkkaCluster@localhost:6014 -> UniqueAddress: (akka.tcp://AkkaCluster@localhost:6006, 2095747650): Unreachable [Unreachable] (1)][akka.tcp://AkkaCluster@localhost:6012 -> UniqueAddress: (akka.tcp://AkkaCluster@localhost:6009, 1134654932): Unreachable [Unreachable] (4)], [akka.tcp://AkkaCluster@localhost:6012 -> UniqueAddress: (akka.tcp://AkkaCluster@localhost:6008, 1606735766): Unreachable [Unreachable] (3)], [akka.tcp://AkkaCluster@localhost:6012 -> UniqueAddress: (akka.tcp://AkkaCluster@localhost:6005, 1615201291): Unreachable [Unreachable] (1)], [akka.tcp://AkkaCluster@localhost:6012 -> UniqueAddress: (akka.tcp://AkkaCluster@localhost:6007, 1615201291): Unreachable [Unreachable] (2)])], filtered reachability [Reachability([akka.tcp://AkkaCluster@localhost:6010 -> UniqueAddress: (akka.tcp://AkkaCluster@localhost:6009, 1134654932): Unreachable [Unreachable] (4)], [akka.tcp://AkkaCluster@localhost:6010 -> UniqueAddress: (akka.tcp://AkkaCluster@localhost:6008, 1606735766): Unreachable [Unreachable] (3)], [akka.tcp://AkkaCluster@localhost:6010 -> UniqueAddress: (akka.tcp://AkkaCluster@localhost:6005, 1615201291): Unreachable [Unreachable] (1)], [akka.tcp://AkkaCluster@localhost:6010 -> UniqueAddress: (akka.tcp://AkkaCluster@localhost:6007, 1615201291): Unreachable [Unreachable] (2)][akka.tcp://AkkaCluster@localhost:6009 -> UniqueAddress: (akka.tcp://AkkaCluster@localhost:6005, 1615201291): Unreachable [Unreachable] (1)][akka.tcp://AkkaCluster@localhost:6003 -> UniqueAddress: (akka.tcp://AkkaCluster@localhost:6008, 1606735766): Unreachable [Unreachable] (3)], [akka.tcp://AkkaCluster@localhost:6003 -> UniqueAddress: (akka.tcp://AkkaCluster@localhost:6005, 1615201291): Unreachable [Unreachable] (1)], [akka.tcp://AkkaCluster@localhost:6003 -> UniqueAddress: (akka.tcp://AkkaCluster@localhost:6007, 1615201291): Unreachable [Unreachable] (2)][akka.tcp://AkkaCluster@localhost:6011 -> UniqueAddress: (akka.tcp://AkkaCluster@localhost:6005, 1615201291): Unreachable [Unreachable] (1)], [akka.tcp://AkkaCluster@localhost:6011 -> UniqueAddress: (akka.tcp://AkkaCluster@localhost:6007, 1615201291): Unreachable [Unreachable] (3)], [akka.tcp://AkkaCluster@localhost:6011 -> UniqueAddress: (akka.tcp://AkkaCluster@localhost:6006, 2095747650): Unreachable [Unreachable] (2)][akka.tcp://AkkaCluster@localhost:6015 -> UniqueAddress: (akka.tcp://AkkaCluster@localhost:6009, 1134654932): Unreachable [Unreachable] (4)], [akka.tcp://AkkaCluster@localhost:6015 -> UniqueAddress: (akka.tcp://AkkaCluster@localhost:6005, 1615201291): Unreachable [Unreachable] (1)], [akka.tcp://AkkaCluster@localhost:6015 -> UniqueAddress: (akka.tcp://AkkaCluster@localhost:6007, 1615201291): Unreachable [Unreachable] (3)], [akka.tcp://AkkaCluster@localhost:6015 -> UniqueAddress: (akka.tcp://AkkaCluster@localhost:6006, 2095747650): Unreachable [Unreachable] (2)][akka.tcp://AkkaCluster@localhost:6002 -> UniqueAddress: (akka.tcp://AkkaCluster@localhost:6009, 1134654932): Unreachable [Unreachable] (4)], [akka.tcp://AkkaCluster@localhost:6002 -> UniqueAddress: (akka.tcp://AkkaCluster@localhost:6005, 1615201291): Unreachable [Unreachable] (1)], [akka.tcp://AkkaCluster@localhost:6002 -> UniqueAddress: (akka.tcp://AkkaCluster@localhost:6007, 1615201291): Unreachable [Unreachable] (3)], [akka.tcp://AkkaCluster@localhost:6002 -> UniqueAddress: (akka.tcp://AkkaCluster@localhost:6006, 2095747650): Unreachable [Unreachable] (2)][akka.tcp://AkkaCluster@localhost:6013 -> UniqueAddress: (akka.tcp://AkkaCluster@localhost:6009, 1134654932): Unreachable [Unreachable] (3)], [akka.tcp://AkkaCluster@localhost:6013 -> UniqueAddress: (akka.tcp://AkkaCluster@localhost:6008, 1606735766): Unreachable [Unreachable] (2)], [akka.tcp://AkkaCluster@localhost:6013 -> UniqueAddress: (akka.tcp://AkkaCluster@localhost:6006, 2095747650): Unreachable [Unreachable] (1)][akka.tcp://AkkaCluster@localhost:6004 -> UniqueAddress: (akka.tcp://AkkaCluster@localhost:6009, 1134654932): Unreachable [Unreachable] (3)], [akka.tcp://AkkaCluster@localhost:6004 -> UniqueAddress: (akka.tcp://AkkaCluster@localhost:6008, 1606735766): Unreachable [Unreachable] (2)], [akka.tcp://AkkaCluster@localhost:6004 -> UniqueAddress: (akka.tcp://AkkaCluster@localhost:6006, 2095747650): Unreachable [Unreachable] (1)][akka.tcp://AkkaCluster@localhost:6001 -> UniqueAddress: (akka.tcp://AkkaCluster@localhost:6009, 1134654932): Unreachable [Unreachable] (3)], [akka.tcp://AkkaCluster@localhost:6001 -> UniqueAddress: (akka.tcp://AkkaCluster@localhost:6008, 1606735766): Unreachable [Unreachable] (2)], [akka.tcp://AkkaCluster@localhost:6001 -> UniqueAddress: (akka.tcp://AkkaCluster@localhost:6006, 2095747650): Unreachable [Unreachable] (1)][akka.tcp://AkkaCluster@localhost:6014 -> UniqueAddress: (akka.tcp://AkkaCluster@localhost:6009, 1134654932): Unreachable [Unreachable] (3)], [akka.tcp://AkkaCluster@localhost:6014 -> UniqueAddress: (akka.tcp://AkkaCluster@localhost:6008, 1606735766): Unreachable [Unreachable] (2)], [akka.tcp://AkkaCluster@localhost:6014 -> UniqueAddress: (akka.tcp://AkkaCluster@localhost:6006, 2095747650): Unreachable [Unreachable] (1)][akka.tcp://AkkaCluster@localhost:6012 -> UniqueAddress: (akka.tcp://AkkaCluster@localhost:6009, 1134654932): Unreachable [Unreachable] (4)], [akka.tcp://AkkaCluster@localhost:6012 -> UniqueAddress: (akka.tcp://AkkaCluster@localhost:6008, 1606735766): Unreachable [Unreachable] (3)], [akka.tcp://AkkaCluster@localhost:6012 -> UniqueAddress: (akka.tcp://AkkaCluster@localhost:6005, 1615201291): Unreachable [Unreachable] (1)], [akka.tcp://AkkaCluster@localhost:6012 -> UniqueAddress: (akka.tcp://AkkaCluster@localhost:6007, 1615201291): Unreachable [Unreachable] (2)])], still indirectlyConnected: [UniqueAddress: (akka.tcp://AkkaCluster@localhost:6009, 1134654932)], seenBy: [akka.tcp://AkkaCluster@localhost:6011, akka.tcp://AkkaCluster@localhost:6014, akka.tcp://AkkaCluster@localhost:6010, akka.tcp://AkkaCluster@localhost:6012, akka.tcp://AkkaCluster@localhost:6003, akka.tcp://AkkaCluster@localhost:6015, akka.tcp://AkkaCluster@localhost:6002, akka.tcp://AkkaCluster@localhost:6001, akka.tcp://AkkaCluster@localhost:6013, akka.tcp://AkkaCluster@localhost:6004]

Copy link
Contributor Author

@Arkatufus Arkatufus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Self review

Comment on lines 233 to 258
// Detect floating unreachable islands in the reachability graph.
#region Unreachable island detection

// Collect all nodes that are reported as both record observer and unreachable
// (possible indirect connection)
var possibleIndirect = originalReachability.Records
.Where(r => originalUnreachable.Contains(r.Observer))
.Select(r => r.Observer)
.ToImmutableHashSet();

// For each possible islands, reach a consensus with all nodes in the SeenBy list
// (reachable from the leader) that they also could not see the possible island node.
var localSeenBy = SeenBy;
var pruneList = new List<UniqueAddress>();
foreach (var address in possibleIndirect)
{
var records = originalReachability.Records
.Where(r => localSeenBy.Contains(r.Observer.Address) && r.Subject.Equals(address))
.Select(r => r.Observer)
.ToImmutableHashSet();

// Add the node to the prune list if we reach a consensus
if(records.Count == localSeenBy.Count)
pruneList.Add(address);
}
#endregion
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This algorithm collects record observer addresses that are known to be unreachable by consensus.

Reachability = Reachability.FilterRecords(
r =>
// we only retain records for addresses that are still downable
downable.Contains(r.Observer) && downable.Contains(r.Subject) &&
// prune out records that are known to be disconnected islands
!pruneList.Contains(r.Observer) &&
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prune out records that Observer are known to be unreachable by consensus

@Arkatufus Arkatufus marked this pull request as draft April 13, 2024 01:38
@Arkatufus Arkatufus marked this pull request as ready for review April 15, 2024 15:49
Copy link
Contributor Author

@Arkatufus Arkatufus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Self Review

Comment on lines +278 to +280
var allReachable = AllMembers.Select(m => m.UniqueAddress)
.Where(a => !unreachable.Contains(a))
.ToImmutableHashSet();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the all cluster members set, exclude all members that are inside the unreachable set. This gives us all of the cluster members that is reachable.

Comment on lines +284 to +287
var possibleIndirect = reachability.Records
.Where(r => unreachable.Contains(r.Observer))
.Select(r => r.Observer)
.ToImmutableHashSet();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From all the Reachability Records that we want to filter, find any Record.Observer that is also inside the unreachable set. These are all potential poison records that needs to be cleaned.

Comment on lines +294 to +297
var records = reachability.Records
.Where(r => allReachable.Contains(r.Observer) && r.Subject.Equals(address))
.Select(r => r.Observer)
.ToImmutableHashSet();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the Reachability.Records that we have, if all of the nodes inside the all reachable set agrees that they also could not reach the potential poison record, prune it from the Reachability.Records

@Aaronontheweb Aaronontheweb added this to the 1.5.20 milestone Apr 16, 2024
@Aaronontheweb Aaronontheweb enabled auto-merge (squash) April 16, 2024 14:03
Copy link
Member

@Aaronontheweb Aaronontheweb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - this should harden the SBR system against indirectly connected node cases where an observer managed to mark some gossip as seen on its way out the door.

@Aaronontheweb Aaronontheweb disabled auto-merge April 16, 2024 15:18
@Aaronontheweb Aaronontheweb merged commit f758869 into akkadotnet:dev Apr 16, 2024
9 of 12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants