-
Notifications
You must be signed in to change notification settings - Fork 959
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unstable connectivity and stream negotiation on test networks #1623
Comments
At this point, I could fully isolate my BN from libp2p peers and only connect it to the app node with a local LN that is only connected to BN. In this environment, I can see precisely a 50% chance of successful connectivity after each node start. Debugging further down the call stack |
mentioned in slack but copying here as well. Let's add links to any relevant tests. If we haven't already, we should revisit these test now that we have a more defined problem statement and see if there is a way to extend/expand the test to replicate. |
Since v0.6.0 onwards(latest is v0.6.2), testground tests were not able to replicate the connectivity issue we are experiencing in arabica testnet I've added I was checking with running big network tests that included:
In both cases |
regarding:
what is that default deadline? Did we try to increase it? My suspicion: either this is too short and depending on IO might bubble up, or, fx fires up several go-routines which lead to a deadlock-like scenario sometimes (which you also suspected above).
Is this box checked because it was fixed already? |
What unit or integration test, local to the node repo, cover these bug areas? |
I wanted to give some further context on the issue (and how it manifests for a user): Case 1: Case 2: Case 3: All three of the above cases occur even though we've queried the bootstrappers and they only have about 70-80 peers or so when this problem is observed IIRC. |
Yes
10 seconds, but increasing it does not help. In can block for an hour |
Every swamp test relies on successful connection and/or stream negotiation. |
yes but are those connections made under similar conditions? Specifically, do we have a test that has a cluster of nodes running that are connected and synced to some height, and then a new node is added and tries to sync and catch up? |
We have two tests with Swamp where a node is catching up. Not with a cluster, but one FN or BN. For such clustering we rely on Testground, which we eventually can get on CI. We could add it to the swamp as well, but I am a bit skeptical that it will show us root cause. After we find the issue, we will definitely write a test to assert this issue never happens again. |
It might be caused by the following libp2p issue libp2p/go-libp2p#1987 due to which the |
The core issue was with our discovery wrapper running on the Server, which has a loop that calls FindPeers and rcvs EvtPeerConnectednessChanged notifications. However, the loop called the All this explains why the Server could make an outbound stream to the client while the client did not. This also explains why this happens only when the Server has many connections(the subscriptions channel gets full of events). Additionally, this may probably explain why the issue happened only on the Arabica network because the network doesn't have enough full and bridge nodes, thus making The fix is in #1639. However, tests show that it is not enough, and this doesn't entirely fix the connectivity, so we should continue our investigations. One of the ideas is to increase the buffer size of the subscription through options and investigate other places where EvtConnectednessChanged is used(edit: implemented in the same fixing PR). |
So we were able to "stabilize" network connectivity via:
Stabilize because nodes are able to connect and initialize from a single start, but in quotes because we still observe that some bootstrappers are hesitant to receive connections. |
Also, I've made a warning log, as discussed with the libp2p maintainer, in case events are not read out by the application, s.t. future developers can see find bug much quicker than us. |
Note that issues (1) and (2) have the same root(inability to finish connection because of the block on We may want to make a PR to make the routed host return the original connection error instead or both, so it's clear these are the same issue/error |
Anyway, we should have a retry mechanism for requests here(but long term, it should be better than this implementation), and also, this is helping with #1623 during the first initialization of a node.
Anyway, we should have a retry mechanism for requests here(but long term, it should be better than this implementation), and also, this is helping with #1623 during the first initialization of a node.
Anyway, we should have a retry mechanism for requests here(but long term, it should be better than this implementation), and also, this is helping with #1623 during the first initialization of a node.
As #1684 resolves instability with our bootstrappers, we no longer need to depend on @Wondertan's fork of libp2p Related to #1623
Provides `PIDStore` to header module so that it can be used in `peerTracker` and replaces mem `peerstore.Peerstore` with on-disk `peerstore.Peerstore` so that `peerTracker` can quickly bootstrap itself with previously-seen peers and allow syncer to initialise its sync target from tracked peers rather than trusted so long as it has a subjective head within the trusting period. Overrides #2133 Closes #1851, mitigates issues resulting from #1623 Swamp integration tests to follow (tracked in #2506) ### Future note: This PR introduces a soon-to-be deprecated feature from libp2p (on-disk peerstore). Once libp2p deprecates and removes this feature, the PIDStore will have to become a PeerAddrStore such that it can save addr info of good peers to disk instead of just their IDs.
This was fixed a while ago. The issue was with the slow event bus reader that made the whole libp2p swarm to stall. |
Context
We have two testing networks, Arabica and Mocha. Mocha has shown to be stable, while Arabica is still not. We have been fighting with issues on our bootstrappers of the Arabica network for about a month, and it's finally time to document what step we approached for future us.
Main issues
Mainly, unstableness manifested via
failed to find any peer in table
errors on node startupcontext.Deadline
errors during Header Store initialization on node startupDebug Sequence
Resource Manager
In the beginning, all the problems started with enabling
ResourceManager
in the libp2p host, which we presumed to be buggy(because of the libp2p discussions about broken auto-scaling in resource manager) on the version we were running(v0.20.1), so we disabled it, temporarily. Our logs were cluttered with unclear logs that streams were not established due to limits, and it wasn't clear where they were coming from until we updated our libp2p version, which included the fix to make logs less confusing.Infra
Unfortunately, removing the resource manager and updating the libp2p version with less confusing resource manager errors didn't help us resolve unstableness, so we blamed the infrastructure due to Mocha working fine. Still, we had doubts over that as the Mocha and Arabica were running different node versions(The diff was not too big, but still).
We presumed that Arabica is a developer-focused network that should have a higher load. @sysrex tried vertically scaling the hardware of 3 bootstrappers to a massive size, which didn't help. Additionally, @sysrex analyzed resource usage on the machines, which was negligible. Finally, we compared the number of peers connected to bootstrappers on Mocha and Arabica. The results showed that Mocha has 10x more peers(yeah, we should have seen this earlier, and our network observability is in a poor state at this moment) with humble hardware for bootstrappers, so the conclusion is that this is def not a hardware issue.
Software
Next, we brought both networks to full parity in the setup and software version to see if something in the diff could explain it. However, Arabica was still unstable. Meanwhile, @Bidon15 and @Wondertan went full debug mode and found out that the reason behind (3) is a bug inside the new version of header ex, which was initially fixed by #1592 as an experiment and superseded with #1603.
Even we (3) being fixed, we still have (1) and (2) that are very likely to be software issues as well, which need to be reproducible in our testing environments(Testground(yet to be reconfirmed on v0.6.2). Both of those issues come from the deeper internals of libp2p.
failed to find any peer in table
error is coming fromRoutedHost
, which fails to reconnect after the initial connection failurecontext.Deadliner
error is coming from deadlocked new stream creation, which is either blocked on "identify protocol" that cannot negotiate with the remote side.We think those two issues are interconnected and describe the unstable connectivity. Version updates from
v0.20.1
tov0.23.1
during debugging attempts could also contribute to this. Additionally, we contacted libp2p maintainers to support us in debugging this issue. It could be either a misconfiguration from our side or an actual bug inside libp2p(not yet found by other teams relying on it).Back to Infra
Meanwhile, @Bidon15 also found two other issues that contributed to the above issues (we are still automating everything, and manual work is prone to human error)
We also deployed an additional bootstrapper, Kaarina #1619
Current work
The current focus is on debugging libp2p and seeing if version differences contribute to the issues we face, refining bootstrapper infrastructure, and improving debuggability by logs aggregation by @sysrex.
The text was updated successfully, but these errors were encountered: