-
Notifications
You must be signed in to change notification settings - Fork 790
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Client/Block: stabilize block fetcher #3240
Conversation
_task: JobTask | ||
): { destroyFetcher: boolean; banPeer: boolean; stepBack: bigint } { | ||
const stepBack = BIGINT_0 | ||
const destroyFetcher = !(error.message as string).includes( | ||
`Blocks don't extend canonical subchain` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@g11tech what is the reason to destroy the fetcher if there is another error? Not sure here 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the only error that is expected if the peer didn't give you correct chain, for all other errors something wrong went in the fetcher and hence fetcher needs to be cleared out and a new fetcher will be restarted as a means to be robust against issues. why do you want to remove it?
So if now validation errors can come in, means that we also expect those and not destroy the fetcher (which will just lead to re-queing of the job)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should study the Fetcher stack a bit more before I can enlighten :) I read somewhere that if you destroy the fetcher it is a "critical" error, I would assume that then the fetcher is broken or something. Will study it some more and will get back to this later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, so fetcher will be/should be reinitiated (if our peer latest/best is correct or we handle it in better way)
@@ -86,7 +86,6 @@ export class BlockFetcher extends BlockFetcherBase<Block[], Block> { | |||
} | |||
// Supply the common from the corresponding block header already set on correct fork | |||
const block = Block.fromValuesArray(values, { common: headers[i].common }) | |||
await block.validateData() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think its ok to not validate data if this is PoS because of parent-child relationship validation that will happen while storing in skeleton, so we can do an if here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I agree. For the backfill process we can (especially for reverse block fetcher) just accept the blocks - right? If we validate that the reported blocks has the block hash which we expect in the end, then we can accept it (and we dont have to validate all data such as: does every txn have a valid signature? Since if the tx trie matches the block hash, we know that CL expects us that this block is valid? If it is unvalid then CL is broken since it gives us the wrong chain tip block)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, just validating the hash here is good enough
Codecov ReportAttention:
Additional details and impacted files
Flags with carried forward coverage won't be shown. Click here to find out more. |
Updated this via UI What is the latest state here? Ready for review respectively can likely @g11tech have a final look? Or not yet? |
From my side this is ready for review :) |
// Upon putting blocks into blockchain (for BlockFetcher), `validateData` is called again | ||
// In ReverseBlockFetcher we do not need to validate the entire block, since CL | ||
// expects us to sync with the requested chain tip header | ||
await block.validateDataIntegrity() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it really necessary to add a wrapper function that's only used in one place? Feels like we should just call block.verifyData
and then in the comments explan why we're not validating transactions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do not have a very strong opinion here but would cautiously agree
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Or is there a somewhat strong reasoning (which might be) that the API is better off with (yet) another explicit validation method?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really a strong reason to do this. I will remove the method and directly call into verifyData
!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have addressed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates, LGTM!
This PR attempts to stabilize the ReverseBlockFetcher.
block.validateData
takes a lot of time. This means that once we request big blocks, this validation loop will block the event queue. This means that the fetcher job actually expires and the results are not written.A related problem is that the validation loop also blocks network io loops: (this happens after this PR still)
TRACE[01-18|05:34:44.616] Failed RLPx handshake addr=127.0.0.1:30303 conn=staticdial err="read tcp 127.0.0.1:42862->127.0.0.1:30303: i/o timeout"
Note: the ReverseBlockFetcher writes to the skeleton chain, which itself does not re-perform block validation. However, in
BlockFetcher
if one stores the blocks, if you write tochain.blockchain
(so a blockchain object, not skeleton), this will internally re-validate theblock.validateData()
. So, for BlockFetcher the block is actually verified twice! Once upon request, and once upon storing it. So, we can safely (?) remove this from the BlockFetcher.Still WIP, but this actually got my client un-stuck! I got from to 655624 (tail block) in about 10 minutes. Before this, I would always get stuck and stay at 656724 since the fetcher would expire my jobs.
Side question: is there a way to put the
validateData
job at the end of the nodejs event queue? If we can do this, it means we validate one block, then reply to the devp2p messages (ping messages for instance), and are also open for sockets to receive other jobs (this is the error listed above: Geth tries to write to our socket but since we do not read, it will timeout at some point). I triedsetImmediate
,process.next
,setTimeout
but this does not seem to work.Thoughts? (Try this locally too!! It should get your client unstuck 😄 )
This PR does 3 things (can cherry-pick them out):
verifyData(onlyHeader: boolean = false, verifyTxs: boolean = true)
theverifyTxs
parameter to block. Setting this tofalse
skips tx validation.best()
method to always return a peer if one is available