-
Notifications
You must be signed in to change notification settings - Fork 491
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Frequent small reorgs caused by heimdall <-> bor interaction #1118
Comments
On the nodes that I configured to use the HeimdallApp in-process client, I experiences bad merkle root issues on the bor side on block 51737759. |
If you are not running a validator node, then your chain won't be locked. |
Ok got it. Edited the original post to reflect this. The problem still exists on normal full nodes, as demonstrated by the logs. Sentries would suffer the same issue, so if a validator is only running a single sentry, propagation of a block would be halted for the duration of the call. From a network perspective, the results would be the same as the validator not producing blocks for this period. |
Hey @VAIBHAVJINDAL3012, just wanted to follow up on this, given that the issue happening on sentry nodes would have the same effect as delayed block production on validators. |
Hi @CitizenSearcher And even in Validator, lock is acquired on depth 16th block, it can't cause 1-2 blocks and won't cause any delay in block production. And 1-2 blocks reorg occur sometime in our Bor network due to P2P latency. |
@VAIBHAVJINDAL3012 Thanks for the explanation. As I posted in the logs, this function was getting called on my normal full node, and was causing reorgs on it client side. If it's not supposed to be getting called on the full node, perhaps I had something misconfigured? |
Just a data point, while I'm also seeing way more reorgs than what could be considered acceptable (many per hour), I've tried your patch on one of my nodes it made no difference for me. |
Your Heimdall shouldn't call this fn, Can you send me your Heimdall logs related to milestone module. |
What is on an average length of reorgs you are facing? |
Looking at the past 24 hours on a random one of my nodes, the vast majority is of depth 1, maybe 10% of depth 2, and 3% of depth 3-8. Total number is around 250. |
Can you please check the number of connected peers, as it seems to me that due to p2p latency you are facing this issue |
The node in question is currently connected to 668 peers. Its running on a 64 core EPYC 7702P (but limited to 24 cores) with 512GB memory and a RAID0 of Intel NVME drives. |
I have the same observation as kaber2, which motivated the original post / investigation. There are several small reorgs per hour. Vast majority are 1 block reorgs, with occasional 2 or 3 block ones. My nodes also have good hardware and are well connected. The network reorgs (most) I see occur on all my nodes, which generally receive the blocks within 200ms or eachother or so. The ones caused client side by the race condition are rarer and more arbitrary, not appearing on all nodes. W/ regards to logs, related to milestone, here's a sample of recent logs I'm seeing:
Additionally, bor logs related to milestone:
|
Hi @CitizenSearcher You can ignore this Heimdall for now, they don't have any impact on reorgs. I am saying that sometime it happens that block proposed by miner doesn't reach you on time, and meanwhile secondary block reaches you earlier. It cause 1 2 blocks reorg. 1 2 block reorgs are acceptable as P2P latency cause this with high probablity. |
As per the logs I posted, this issues most certainly was causing a reorg on my client. The primary block arrived before the secondaries block, but the lock caused the import of both to be held up until after the secondary had arrived. The issue I'm pointing out is that this 5 second lock causes delays in block importing, and if this issue happens on validator sentries they'd be unable to propagate the block during this time. In you're last post, you mentioned that this function shouldn't get called on regular full nodes, and I'm noticing this gets called fairly often. Either this means my nodes are misconfigured, or there's some bug in heimdall. Either way, I'm glad you're PR to remove the milestone check on the bor side is merged in, as it should fix it. I hope that this will make it into the next release. |
There is no locking on sentries node, and even in validators locks are done on tip-16th block, so It can't cause any issue in importing. |
Can you please send me your Heimdall signer address which is on your machine? |
Further examples of logs:
And the heimdall signer info:
To deploy nodes, I'm just using the configs from |
This issue is stale because it has been open 14 days with no activity. Remove stale label or comment or this will be closed in 14 days. |
Hi @CitizenSearcher, I checked at my end. |
Yeh I saw, closing the thread now. The loop beginning maintained throughout the retry loop is probably suboptimal, but I doubt it matters after your change. |
System information
Bor client version: v1.1.0
Heimdall client version: v1.0.3
OS & Version: Linux
Environment: Polygon Mainnet
Type of node: Full
Additional Information: Client modified to add telemetry to block imports.
Overview of the problem
I noticed that my client was experiencing many small reorgs (1-2 blocks), so decided to dig in and diagnose the problem. I run several nodes, and notice this problem on all of them. My setup is that I run a local heimdall client, communicating with bor via the HTTP client.
I narrowed down the issue to the communication between bor and heimdall, specifically in the FetchMilestoneID method.
The problem is as following:
The forker.ValidateReorg is called in the insertChain function here:
bor/core/blockchain.go
Line 1940 in 2ee3919
This ultimately routed down to the downloader's whitelist milestone, where a read lock is obtained here:
bor/eth/downloader/whitelist/milestone.go
Line 61 in 2ee3919
Concurrently, the GetVoteOnHash function is called by heimdall at an arbitary interval. This function also calls down and obtains a full lock on the downloaders whitelist, and the FetchMilestoneID function is called on the HeimdallClient that's been loaded.
bor/eth/bor_api_backend.go
Line 74 in 2ee3919
bor/eth/downloader/whitelist/milestone.go
Line 136 in 2ee3919
The issue is that when using either the GRPC or HTTP clients, the FetchMilestoneID function will go into a retry loop, and block until a successful request is made. This keeps the whitelist locked during the retry, stopping insertChain from obtaining it's read lock, thereby stopping any blocks from being inserted.
bor/core/blockchain.go
Line 1713 in 2ee3919
bor/core/blockchain.go
Line 1888 in 2ee3919
bor/miner/worker.go
Line 799 in 2ee3919
Using the heimdall HTTP client, the first request will almost always fail, with the following error:
WARN [12-29|04:05:03.529] an error while trying fetching from Heimdall path="/milestone/ID/4fb6bfe2-eec4-4366-9f18-ef272ee4c338 - 0x4951435518fa374ac10951fee3aeffa1d04aded8" attempt=1 error="error while fetching data from Heimdall: response code 500
but the second will be successful. With the retry delay at 5 seconds, this causes a 5 second import delay on blocks, or a 5 second production delay for validators. When this occurs it can cause a reorg on client-side, as it's enough time for a backup's block to have been propagated, and then a race occurs between the backup and primary validators block (see logs below).
Reproduction Steps
Run the setup as bor and heimdall as separate processes running via http. Looking at the normal bor logs you should notice several small reorgs (1-2 blocks) per hour.
Here are some telemetry logs I added for block imports. Here is an example of a client-side reorg:
Here we see block 51706222 from the primary proposer arrive at 2023-12-29T12:24:27.099. It is blocked for 4 seconds, and obtains the milestone read lock at 2023-12-29T12:24:31.869.
In the mean time, two competing blocks for 51706223 have arrived from the primary and backup validator at 2023-12-29T12:24:29.092 and 023-12-29T12:24:31.711 respectively. Even though the primary's block correctly arrived first, the lock delayed the prior blocks import so long that both were now in the import cycle. A race condition then occurs between the primary and backup block. In this case the backups block wins, but is almost immediately reorged upon the primary's import finishing.
Additional Information
The client side problem is eliminated using the HeimdallApp in-process client, since the FetchMilestoneID function returns immediately. Even removing the client-side problem using the above method. I still observe a concerning amount of 1-2 block reorgs, occuring where a primary proposers block is around 4-6 seconds late. Given the timing and frequency, I'd expect that many are caused this problem.
The text was updated successfully, but these errors were encountered: