-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate peering #983
Comments
What is a long time and what would be the ideal? |
Multiple minutes
A few seconds? We have all the peer_ids on-chain so it's unclear why it's taking a long time to establish the peering. We just leave it for a "long enough" time and eventually it seems to figure it out, but this is opaque and unpredictable. |
Is the peering serial at each node? |
I don't understand the question, please can you elaborate? |
Sorry. When each peer is creating a connection with each other peer, which I assume they do, are these connections created in series or in parallel? Again I would assume in parallel but wanted to confirm. |
I don't know - it would presumably be possible to find out by looking at the underlying networking code. @AlastairHolmes has identified some strangeness in the libp2p trace logs which we are now investigating. |
@andyjsbell pretty sure it's asynchronous ie. it pushes the messages to the network interface in series but of course it doesn't wait for the response on each one before sending the next. |
I suppose I was referring to the network interface whether that was scheduling requests in series maybe even a blocking socket or a pool of sockets. |
Could be related to the fact that we are not storing IP addresses on-chain as was agreed, but are instead presumably maintaining a DHT on each node (the explicit thing that we wanted to avoid). |
Ok, so the fact that we only store the peer_id means that we're waiting (1) for the peer_id to be registered but then also (2) for the matching IP address to propagate to all nodes on the network. Sounds like it could be an issue. |
I am going to try implementing a quick test of this solution, in the chore/merge-config-option branch, so we can try it out. First I will finish looking at the logs, to confirm that the nodes don't know each others IP addresses. |
I am also aware of this problem which seems to be related: paritytech/substrate#9827 |
Downgrading this since we're going to roll the testnet out gradually, so instant peering is less of an issue. We should however check our assumptions re. Backup Validators. If we're not forcing peerage prior to a Keygen between Backup Validators, then it's possible that Keygen could fail because of a peering issue, even though no Validator is really at fault. Not sure how to log this one accurately, it might warrant its own issue instead. Re. this current issue, we're pretty sure that the problem (slowness of complete peering) is caused by libp2p's internals putting DHT updates on a timer. Don't blame them, but it's probably not an easy thing to configure without making and maintaining some changes to Substrate itself (not out of the question, but could cause some nasty merges down the line). |
@morelazers Not going to look at this now, but I would like to change this issue to: Handle p2p connections inside the CFE instead of using substrate to avoid slow peering. TODO Plan. |
Noted @AlastairHolmes. Will bin this and make a new issue for cleanliness. |
Description
Keygen requires us to be peered with everyone else on the network. If we're not peered, we cannot complete Keygen (though nothing will stop us from starting it).
When we boot up 150 nodes, it takes a long time for all the peers to connect to each Substrate node, even though all nodes are currently in the same datacenter.
It would be good to answer the following two questions:
The text was updated successfully, but these errors were encountered: