Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate peering #983

Closed
morelazers opened this issue Dec 10, 2021 · 16 comments
Closed

Investigate peering #983

morelazers opened this issue Dec 10, 2021 · 16 comments

Comments

@morelazers
Copy link

Description

Keygen requires us to be peered with everyone else on the network. If we're not peered, we cannot complete Keygen (though nothing will stop us from starting it).

When we boot up 150 nodes, it takes a long time for all the peers to connect to each Substrate node, even though all nodes are currently in the same datacenter.

It would be good to answer the following two questions:

  • Why does it take so long to peer with everyone?
  • Is it possible to "force" peering?
@morelazers
Copy link
Author

After multiple minutes of the network being live, and all peer_ids being published on-chain:

image

@andyjsbell
Copy link

What is a long time and what would be the ideal?

@morelazers
Copy link
Author

What is a long time

Multiple minutes

what would be the ideal?

A few seconds? We have all the peer_ids on-chain so it's unclear why it's taking a long time to establish the peering. We just leave it for a "long enough" time and eventually it seems to figure it out, but this is opaque and unpredictable.

@andyjsbell
Copy link

Is the peering serial at each node?

@morelazers
Copy link
Author

I don't understand the question, please can you elaborate?

@andyjsbell
Copy link

I don't understand the question, please can you elaborate?

Sorry. When each peer is creating a connection with each other peer, which I assume they do, are these connections created in series or in parallel? Again I would assume in parallel but wanted to confirm.

@morelazers
Copy link
Author

I don't know - it would presumably be possible to find out by looking at the underlying networking code.

@AlastairHolmes has identified some strangeness in the libp2p trace logs which we are now investigating.

@dandanlen
Copy link
Collaborator

@andyjsbell pretty sure it's asynchronous ie. it pushes the messages to the network interface in series but of course it doesn't wait for the response on each one before sending the next.

@andyjsbell
Copy link

I suppose I was referring to the network interface whether that was scheduling requests in series maybe even a blocking socket or a pool of sockets.

@morelazers
Copy link
Author

Could be related to the fact that we are not storing IP addresses on-chain as was agreed, but are instead presumably maintaining a DHT on each node (the explicit thing that we wanted to avoid).

#960

@dandanlen
Copy link
Collaborator

Ok, so the fact that we only store the peer_id means that we're waiting (1) for the peer_id to be registered but then also (2) for the matching IP address to propagate to all nodes on the network. Sounds like it could be an issue.

@AlastairHolmes
Copy link
Contributor

AlastairHolmes commented Dec 13, 2021

I am going to try implementing a quick test of this solution, in the chore/merge-config-option branch, so we can try it out. First I will finish looking at the logs, to confirm that the nodes don't know each others IP addresses.

@AlastairHolmes
Copy link
Contributor

AlastairHolmes commented Dec 13, 2021

I am also aware of this problem which seems to be related: paritytech/substrate#9827

@morelazers
Copy link
Author

Downgrading this since we're going to roll the testnet out gradually, so instant peering is less of an issue.

We should however check our assumptions re. Backup Validators. If we're not forcing peerage prior to a Keygen between Backup Validators, then it's possible that Keygen could fail because of a peering issue, even though no Validator is really at fault.

Not sure how to log this one accurately, it might warrant its own issue instead.

Re. this current issue, we're pretty sure that the problem (slowness of complete peering) is caused by libp2p's internals putting DHT updates on a timer. Don't blame them, but it's probably not an easy thing to configure without making and maintaining some changes to Substrate itself (not out of the question, but could cause some nasty merges down the line).

@AlastairHolmes AlastairHolmes self-assigned this Jan 3, 2022
@AlastairHolmes
Copy link
Contributor

@morelazers Not going to look at this now, but I would like to change this issue to:

Handle p2p connections inside the CFE instead of using substrate to avoid slow peering. TODO Plan.

@morelazers
Copy link
Author

Noted @AlastairHolmes. Will bin this and make a new issue for cleanliness.

#1103.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants