Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unstable connectivity and stream negotiation on test networks #1623

Closed
1 of 3 tasks
Wondertan opened this issue Jan 20, 2023 · 16 comments
Closed
1 of 3 tasks

Unstable connectivity and stream negotiation on test networks #1623

Wondertan opened this issue Jan 20, 2023 · 16 comments
Assignees
Labels
area:p2p bug Something isn't working

Comments

@Wondertan
Copy link
Member

Wondertan commented Jan 20, 2023

Context

We have two testing networks, Arabica and Mocha. Mocha has shown to be stable, while Arabica is still not. We have been fighting with issues on our bootstrappers of the Arabica network for about a month, and it's finally time to document what step we approached for future us.

Main issues

Mainly, unstableness manifested via

  • (1) failed to find any peer in table errors on node startup
  • (2) context.Deadline errors during Header Store initialization on node startup
  • (3) inability to sync headers when the previous two errors didn't appear.

Debug Sequence

Resource Manager

In the beginning, all the problems started with enabling ResourceManager in the libp2p host, which we presumed to be buggy(because of the libp2p discussions about broken auto-scaling in resource manager) on the version we were running(v0.20.1), so we disabled it, temporarily. Our logs were cluttered with unclear logs that streams were not established due to limits, and it wasn't clear where they were coming from until we updated our libp2p version, which included the fix to make logs less confusing.

Infra

Unfortunately, removing the resource manager and updating the libp2p version with less confusing resource manager errors didn't help us resolve unstableness, so we blamed the infrastructure due to Mocha working fine. Still, we had doubts over that as the Mocha and Arabica were running different node versions(The diff was not too big, but still).

We presumed that Arabica is a developer-focused network that should have a higher load. @sysrex tried vertically scaling the hardware of 3 bootstrappers to a massive size, which didn't help. Additionally, @sysrex analyzed resource usage on the machines, which was negligible. Finally, we compared the number of peers connected to bootstrappers on Mocha and Arabica. The results showed that Mocha has 10x more peers(yeah, we should have seen this earlier, and our network observability is in a poor state at this moment) with humble hardware for bootstrappers, so the conclusion is that this is def not a hardware issue.

Software

Next, we brought both networks to full parity in the setup and software version to see if something in the diff could explain it. However, Arabica was still unstable. Meanwhile, @Bidon15 and @Wondertan went full debug mode and found out that the reason behind (3) is a bug inside the new version of header ex, which was initially fixed by #1592 as an experiment and superseded with #1603.

Even we (3) being fixed, we still have (1) and (2) that are very likely to be software issues as well, which need to be reproducible in our testing environments(Testground(yet to be reconfirmed on v0.6.2). Both of those issues come from the deeper internals of libp2p.

  • The (1)failed to find any peer in table error is coming from RoutedHost, which fails to reconnect after the initial connection failure
  • The (2)context.Deadliner error is coming from deadlocked new stream creation, which is either blocked on "identify protocol" that cannot negotiate with the remote side.

We think those two issues are interconnected and describe the unstable connectivity. Version updates from v0.20.1 to v0.23.1 during debugging attempts could also contribute to this. Additionally, we contacted libp2p maintainers to support us in debugging this issue. It could be either a misconfiguration from our side or an actual bug inside libp2p(not yet found by other teams relying on it).

Back to Infra

Meanwhile, @Bidon15 also found two other issues that contributed to the above issues (we are still automating everything, and manual work is prone to human error)

  • Outdated images
  • One of the bootstrappers wasn't correctly connected to the app node so it couldn't serve the headers.

We also deployed an additional bootstrapper, Kaarina #1619

Current work

The current focus is on debugging libp2p and seeing if version differences contribute to the issues we face, refining bootstrapper infrastructure, and improving debuggability by logs aggregation by @sysrex.

@Wondertan Wondertan added the bug Something isn't working label Jan 20, 2023
@Wondertan
Copy link
Member Author

At this point, I could fully isolate my BN from libp2p peers and only connect it to the app node with a local LN that is only connected to BN. In this environment, I can see precisely a 50% chance of successful connectivity after each node start. Debugging further down the call stack

@MSevey
Copy link
Member

MSevey commented Jan 20, 2023

mentioned in slack but copying here as well.

Let's add links to any relevant tests. If we haven't already, we should revisit these test now that we have a more defined problem statement and see if there is a way to extend/expand the test to replicate.

@Bidon15
Copy link
Member

Bidon15 commented Jan 20, 2023

Since v0.6.0 onwards(latest is v0.6.2), testground tests were not able to replicate the connectivity issue we are experiencing in arabica testnet

I've added udp/quic-v1 in parallel with tcp to make node cfg as close to prod as possible. PR: celestiaorg/test-infra#146

I was checking with running big network tests that included:

  1. PFB from 1k Light nodes that are connected to 100 Bridge Nodes
  2. Block Reconstruction where we have 1412 Light Nodes connected to 10 Bridge Nodes and 2 Full Nodes

In both cases no instability was observed (the tests would obviously failed miserably at startup) in testground k8s.

@liamsi
Copy link
Member

liamsi commented Jan 23, 2023

regarding:

(2) context.Deadline errors during Header Store initialization on node startup

what is that default deadline? Did we try to increase it? My suspicion: either this is too short and depending on IO might bubble up, or, fx fires up several go-routines which lead to a deadlock-like scenario sometimes (which you also suspected above).

(3) inability to sync headers when the previous two errors didn't appear.

Is this box checked because it was fixed already?

@MSevey
Copy link
Member

MSevey commented Jan 23, 2023

What unit or integration test, local to the node repo, cover these bug areas?

@renaynay
Copy link
Member

I wanted to give some further context on the issue (and how it manifests for a user):

Case 1:
User connects to either network (mocha or arabica) with a fresh instance of a node (light or full) and receives a context deadline exceeded error which spits out that the initStore hook failed, meaning that the node was unable to fetch the block by the genesis hash from any bootstrapper of the hardcoded bootstrappers.

Case 2:
User connects to either network (mocha or arabica) with a node (light or full) that is NOT fresh (meaning has already synced some blocks to its datastore) and receives a context deadline exceeded error which spits out that the initStore hook failed, but the error here is incorrect because the failure actually occurs inside sync.Start where the node is unable to fetch a new network head from any of the hardcoded bootstrappers.

Case 3:
User connects to either network (mocha or arabica) with a node (light or full), fresh or not, and fails to start due to no peers in routing table.

All three of the above cases occur even though we've queried the bootstrappers and they only have about 70-80 peers or so when this problem is observed IIRC.

@Wondertan
Copy link
Member Author

@liamsi

Is this box checked because it was fixed already?

Yes

what is that default deadline? Did we try to increase it? My suspicion: either this is too short and depending on IO might bubble up, or, fx fires up several go-routines which lead to a deadlock-like scenario sometimes (which you also suspected above).

10 seconds, but increasing it does not help. In can block for an hour

@Wondertan
Copy link
Member Author

@MSevey

What unit or integration test, local to the node repo, cover these bug areas?

Every swamp test relies on successful connection and/or stream negotiation.

@MSevey
Copy link
Member

MSevey commented Jan 24, 2023

@Wondertan

Every swamp test relies on successful connection and/or stream negotiation.

yes but are those connections made under similar conditions? Specifically, do we have a test that has a cluster of nodes running that are connected and synced to some height, and then a new node is added and tries to sync and catch up?

@Wondertan
Copy link
Member Author

We have two tests with Swamp where a node is catching up. Not with a cluster, but one FN or BN. For such clustering we rely on Testground, which we eventually can get on CI. We could add it to the swamp as well, but I am a bit skeptical that it will show us root cause.

After we find the issue, we will definitely write a test to assert this issue never happens again.

@staheri14
Copy link
Collaborator

It might be caused by the following libp2p issue libp2p/go-libp2p#1987 due to which the ConnectionGater on the bootstrap nodes refuses incoming connections in one of its methods that implements the ConnectionGater interface specially the InterceptSecured(). This means the connection is actually refused by the server, yet the querying node assumes the connection is established and its further queries time out i.e., emitting context deadline exceeded.
To verify whether this is the root cause, you may implement a wrapper around the libp2p BasicConnectionGater which is used in your code, and additionally log the state of connections.

@Wondertan
Copy link
Member Author

Wondertan commented Jan 26, 2023

The core issue was with our discovery wrapper running on the Server, which has a loop that calls FindPeers and rcvs EvtPeerConnectednessChanged notifications.

However, the loop called the FindPeers without any timeout on the context and blocked forever, avoiding reads out of the event subscription -> making subscription channel full -> blocking the event emission -> blocking new connection handling logic. At this point, the connection is ready and only has to run the start method to accept inbound streams, but it wasn't run.

All this explains why the Server could make an outbound stream to the client while the client did not. This also explains why this happens only when the Server has many connections(the subscriptions channel gets full of events). Additionally, this may probably explain why the issue happened only on the Arabica network because the network doesn't have enough full and bridge nodes, thus making FIndPeers block forever and blocking the whole connection flow.

The fix is in #1639. However, tests show that it is not enough, and this doesn't entirely fix the connectivity, so we should continue our investigations. One of the ideas is to increase the buffer size of the subscription through options and investigate other places where EvtConnectednessChanged is used(edit: implemented in the same fixing PR).

@Wondertan
Copy link
Member Author

Wondertan commented Jan 28, 2023

So we were able to "stabilize" network connectivity via:

Stabilize because nodes are able to connect and initialize from a single start, but in quotes because we still observe that some bootstrappers are hesitant to receive connections.

@Wondertan
Copy link
Member Author

Also, I've made a warning log, as discussed with the libp2p maintainer, in case events are not read out by the application, s.t. future developers can see find bug much quicker than us.

@Wondertan
Copy link
Member Author

Note that issues (1) and (2) have the same root(inability to finish connection because of the block on IdentifyWait as per #1623 (comment)) but manifest differently. The failed to find any peer in table is coming from the routed host here and happens when we fail to connect and then routed host tries to find the address for the peer, but we are not connected at all and routing table is empty, thus the error.

We may want to make a PR to make the routed host return the original connection error instead or both, so it's clear these are the same issue/error context deadline exceeded coming from IdentifyWait.

Wondertan added a commit that referenced this issue Jan 31, 2023
Anyway, we should have a retry mechanism for requests here(but long term, it should be better than this implementation), and also, this is helping with #1623 during the first initialization of a node.
renaynay pushed a commit that referenced this issue Jan 31, 2023
Anyway, we should have a retry mechanism for requests here(but long term, it should be better than this implementation), and also, this is helping with #1623 during the first initialization of a node.
renaynay pushed a commit that referenced this issue Feb 1, 2023
Anyway, we should have a retry mechanism for requests here(but long term, it should be better than this implementation), and also, this is helping with #1623 during the first initialization of a node.
renaynay added a commit that referenced this issue Feb 8, 2023
As #1684 resolves instability with our bootstrappers, we no longer need
to depend on @Wondertan's fork of libp2p

Related to #1623
renaynay added a commit that referenced this issue Sep 5, 2023
Provides `PIDStore` to header module so that it can be used in
`peerTracker` and replaces mem `peerstore.Peerstore` with on-disk
`peerstore.Peerstore` so that `peerTracker` can quickly bootstrap itself
with previously-seen peers and allow syncer to initialise its sync
target from tracked peers rather than trusted so long as it has a
subjective head within the trusting period.

Overrides #2133 

Closes #1851, mitigates issues resulting from #1623

Swamp integration tests to follow (tracked in #2506)

### Future note: 

This PR introduces a soon-to-be deprecated feature from libp2p (on-disk
peerstore). Once libp2p deprecates and removes this feature, the
PIDStore will have to become a PeerAddrStore such that it can save addr
info of good peers to disk instead of just their IDs.
@Wondertan
Copy link
Member Author

This was fixed a while ago. The issue was with the slow event bus reader that made the whole libp2p swarm to stall.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:p2p bug Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants