Investigate peering #983

morelazers · 2021-12-10T13:36:36Z

Description

Keygen requires us to be peered with everyone else on the network. If we're not peered, we cannot complete Keygen (though nothing will stop us from starting it).

When we boot up 150 nodes, it takes a long time for all the peers to connect to each Substrate node, even though all nodes are currently in the same datacenter.

It would be good to answer the following two questions:

Why does it take so long to peer with everyone?
Is it possible to "force" peering?

morelazers · 2021-12-10T13:37:29Z

After multiple minutes of the network being live, and all peer_ids being published on-chain:

andyjsbell · 2021-12-10T13:45:56Z

What is a long time and what would be the ideal?

morelazers · 2021-12-10T13:53:11Z

What is a long time

Multiple minutes

what would be the ideal?

A few seconds? We have all the peer_ids on-chain so it's unclear why it's taking a long time to establish the peering. We just leave it for a "long enough" time and eventually it seems to figure it out, but this is opaque and unpredictable.

andyjsbell · 2021-12-10T14:58:50Z

Is the peering serial at each node?

morelazers · 2021-12-10T15:01:40Z

I don't understand the question, please can you elaborate?

andyjsbell · 2021-12-10T15:13:59Z

I don't understand the question, please can you elaborate?

Sorry. When each peer is creating a connection with each other peer, which I assume they do, are these connections created in series or in parallel? Again I would assume in parallel but wanted to confirm.

morelazers · 2021-12-10T15:35:33Z

I don't know - it would presumably be possible to find out by looking at the underlying networking code.

@AlastairHolmes has identified some strangeness in the libp2p trace logs which we are now investigating.

dandanlen · 2021-12-10T15:38:13Z

@andyjsbell pretty sure it's asynchronous ie. it pushes the messages to the network interface in series but of course it doesn't wait for the response on each one before sending the next.

andyjsbell · 2021-12-10T15:50:45Z

I suppose I was referring to the network interface whether that was scheduling requests in series maybe even a blocking socket or a pool of sockets.

morelazers · 2021-12-13T08:59:50Z

Could be related to the fact that we are not storing IP addresses on-chain as was agreed, but are instead presumably maintaining a DHT on each node (the explicit thing that we wanted to avoid).

#960

dandanlen · 2021-12-13T09:10:19Z

Ok, so the fact that we only store the peer_id means that we're waiting (1) for the peer_id to be registered but then also (2) for the matching IP address to propagate to all nodes on the network. Sounds like it could be an issue.

AlastairHolmes · 2021-12-13T10:52:21Z

I am going to try implementing a quick test of this solution, in the chore/merge-config-option branch, so we can try it out. First I will finish looking at the logs, to confirm that the nodes don't know each others IP addresses.

AlastairHolmes · 2021-12-13T10:56:53Z

I am also aware of this problem which seems to be related: paritytech/substrate#9827

morelazers · 2021-12-14T21:58:12Z

Downgrading this since we're going to roll the testnet out gradually, so instant peering is less of an issue.

We should however check our assumptions re. Backup Validators. If we're not forcing peerage prior to a Keygen between Backup Validators, then it's possible that Keygen could fail because of a peering issue, even though no Validator is really at fault.

Not sure how to log this one accurately, it might warrant its own issue instead.

Re. this current issue, we're pretty sure that the problem (slowness of complete peering) is caused by libp2p's internals putting DHT updates on a timer. Don't blame them, but it's probably not an easy thing to configure without making and maintaining some changes to Substrate itself (not out of the question, but could cause some nasty merges down the line).

AlastairHolmes · 2022-01-03T10:36:39Z

@morelazers Not going to look at this now, but I would like to change this issue to:

Handle p2p connections inside the CFE instead of using substrate to avoid slow peering. TODO Plan.

morelazers · 2022-01-03T10:39:31Z

Noted @AlastairHolmes. Will bin this and make a new issue for cleanliness.

#1103.

morelazers added p1-asap State Chain labels Dec 10, 2021

morelazers added p0-dropeverything p2-somedaysoon and removed p1-asap p0-dropeverything labels Dec 13, 2021

AlastairHolmes self-assigned this Jan 3, 2022

AlastairHolmes mentioned this issue Jan 3, 2022

RPC Websocket buffer overflowing #966

Closed

morelazers mentioned this issue Jan 3, 2022

Handle p2p connections inside the CFE instead of using substrate to avoid slow peering. #1103

Closed

morelazers closed this as completed Jan 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate peering #983

Investigate peering #983

morelazers commented Dec 10, 2021

morelazers commented Dec 10, 2021

andyjsbell commented Dec 10, 2021

morelazers commented Dec 10, 2021

andyjsbell commented Dec 10, 2021

morelazers commented Dec 10, 2021

andyjsbell commented Dec 10, 2021

morelazers commented Dec 10, 2021

dandanlen commented Dec 10, 2021

andyjsbell commented Dec 10, 2021

morelazers commented Dec 13, 2021

dandanlen commented Dec 13, 2021

AlastairHolmes commented Dec 13, 2021 •

edited

Loading

AlastairHolmes commented Dec 13, 2021 •

edited

Loading

morelazers commented Dec 14, 2021

AlastairHolmes commented Jan 3, 2022

morelazers commented Jan 3, 2022

Investigate peering #983

Investigate peering #983

Comments

morelazers commented Dec 10, 2021

Description

morelazers commented Dec 10, 2021

andyjsbell commented Dec 10, 2021

morelazers commented Dec 10, 2021

andyjsbell commented Dec 10, 2021

morelazers commented Dec 10, 2021

andyjsbell commented Dec 10, 2021

morelazers commented Dec 10, 2021

dandanlen commented Dec 10, 2021

andyjsbell commented Dec 10, 2021

morelazers commented Dec 13, 2021

dandanlen commented Dec 13, 2021

AlastairHolmes commented Dec 13, 2021 • edited Loading

AlastairHolmes commented Dec 13, 2021 • edited Loading

morelazers commented Dec 14, 2021

AlastairHolmes commented Jan 3, 2022

morelazers commented Jan 3, 2022

AlastairHolmes commented Dec 13, 2021 •

edited

Loading

AlastairHolmes commented Dec 13, 2021 •

edited

Loading