-
Notifications
You must be signed in to change notification settings - Fork 769
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Node keeps loosing peers. #528
Comments
@dmitry-markin and @altonen could both of you please both look into this? |
To share intermediate results: I was able to reproduce the issue on polkadot+substrate master, but also on polkadot-0.9.28 & polkadot-0.9.16. So, it looks like this is not something introduced by the recent changes. Also, a couple of times everything worked fine (on master), so I'm starting to be suspicious that the issue might be caused by some external factors. |
I saw this quite a while ago on my PC as well, but thought it was normal. Like dropping peers from 30 to 10 or even less, and then slowly creeping back up. |
I didn't look into warping, but if it downloads significant chunks of data, it might have the same root cause as paritytech/substrate#12105, namely keep alive timeout. |
I tried reducing keep alive timeout from 10 sec to 1 sec (locally) to see if it affects the peer count, and it looks like it doesn't. Overall, the peer count jitter seems purely random: sometimes the peer count is stable on 20-30 peers, and sometimes it starts dropping to almost 0. |
@dmitry-markin curious if you saw any errors. Or it just silently disconnects? Surely libp2p has some "debug" flag. |
I don't recall anything suspicious in the logs, but it's been a while and I'm unsure if I run the node with something like |
Tried the latest substrate / polkadot Some of the errors I saw:
The primary source of the majority of disconnects appears to be |
I've found two reasons as to why the peer count drops:
After a while the peer count has dropped to 0-2 range and stays there. Genesis mismatch sounds like a bug in the ancestry search code but I'm not sure why the requests are refused. From the logs it looks like at some point node sends the same request a third time soon after the other two after which the connection is closed. This could explain why the connection is closed because syncing code considers three same requests from the peer as fatal error. Related issue: #531 |
That's quite strange that ancestry search even starts in case of genesis mismatch — we now have genesis hash hardcoded in substrate & polkadot protocol names, so the nodes must not talk to each other at all. |
They do talk and exchange genesis hashes and then proceed to sync. I think ancestry search has bug which incorrectly determines the nodes have different genesis hashes. |
May be that's because of the legacy fallback protocol name which doesn't contain the genesis hash: And we should phase-out the legacy names? |
Apart from this it's also a good idea to check a genesis hash in ancestry search, of course. |
The genesis hash is also exchanged in the |
I think the issue might be this function. It allows the peer to send the same request multiple times since it doesn't properly keep track of which blocks have been requested and keeps resending requests, sometimes in very rapid succession. This causes the remote node to close the connection when it notices that a peer has sent the same request multiple times. Ancestor search bug is probably related to the fact that the peer state is tied to the latest sent request but there is no notion of a stale response so if local node has sent ancestor search requests before without hearing any response and then sends one more, this time starting from genesis (here) and then receives receives a response to some previous request (i.e., the response is stale), the code can fail with genesis mismatch. I noticed another issues with block requests: local node requests blocks |
Care to elaborate? Is there a test or a log that demonstrate this?
IIRC Request-respnse protocol should guarantee that each response is matched to a request with an ID.
What do you mean by "too far in the future"? As in block timestamp is too far in the future? |
I will try to reproduce it with a test but here is a log showing the bug:
My interpretation of this is that node sends a gap sync request starting from block 16128, at 16:54:51.486, then a response to some previous block request is received and as a result of the response, another block request is sent at 16:54:51.758 starting from block 16128 again while the previous request is still in flight.
That is true. But right now
The slot is too far in the future, as reported by Aura. |
To me it looks like the response it actully correct and contains the blocks that were actually requested. Notice that the request for blocks in descending order. Response id also matches request id. The interesting bit is that the first request was issued for the gap sync ("New gap block request for ..") and then another request was issued for the main sync ("New block request for ..") for the same blocks. So there's clearly a bug there with the gap sync interfering the the main sync.
Yeah, ideally this should be handled by the import queue. |
I think this is source of the problem. Peer is sending multiple requests and apparently the cancelling behavior ( I'll start working on the fixes. Thanks for the information.
So the import queue should hold on to the block if it's in a future slot and not disconnect the peer, right? |
If there isn't a huge drift in the local clock, this isn't possible (or should not happen). Where did you observe this behavior? The import queue should reject these peers as they are trying too fool you at least from our own local view. But this should also only happen when we are syncing on the tip of the chain. |
I had some clock drift because NTP was not enabled. Enabling it fixed the importing issue. |
In addition to sending duplicate requests, this bug is also related to #556. Peer count is reported incorrectly to be 40 and blocks requests work because the block request handler doesn't check if the peer is accepted by the syncing code (which to me sounds like a security bug) and syncing works for the duration of warp sync but when it starts to download blocks and announcing them, it notices that the block announcement substream has been closed and then terminates the substream on its end too. That issue is currently blocked by paritytech/substrate#12828 but when it's merged, I'll continue the implementation of custom handshakes. I think I also found reason the why the code sends duplicate requests even if there are measures in place to prevent that. If peer sends a block request and then syncing gets restarted, the block request sent before the restart is not discarded but received normally which will then reset the peer to be available again and send another request. Then the request after the sync restart is received which does the same thing which basically results two requests being constantly in active. The block request scheduling can then request the same range in two different requests from the same peer which causes the connection to be closed. |
@altonen Is there a way to reproduce this log? The duplicate request sure looks suspicious but I can't reproduce it. |
I've been sidetracked with another issue that I think is the central issue of this bug report. The problem seems to be that there nearly all full node slots are occupied on the network but this is not noticed until the local node sends a block announcement. The reason for duplicate requests was a recent change in moving block requests to |
this one? paritytech/substrate#12739 |
* migrate back to Substrate master * fmt * clippy
Hi, @altonen. We are experiencing similar issues and are looking for potential solutions to stabilize peer connectivity. Do we have any clue how to resolve the problem? |
|
Hi @felixshu could you post the logs of the moment when your node loses peers (+ a couple of minutes before), the command line that you use to start the node, and the version/commit-id you are running? |
Start a new warp sync using latest polkadot+substrate master with an empty db dir in GCP. The nodes keeps loosing almost all peers every couple of minutes.
Notice how the number of peers jumps from 20+ to 2.
Unfortunately our network protocol does not include reason for why other peers decide to disconnect. However it looks like this is happening after we announce latest blocks to other peers. It could be that something was added to the block announcement that broke network protocol compatibility because of this code here:
https://github.com/paritytech/substrate/blob/ded44948e2d5a398abcb4e342b0513cb690961bb/primitives/consensus/common/src/block_validation.rs#L89
The text was updated successfully, but these errors were encountered: