Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Client/Block: stabilize block fetcher #3240

Merged
merged 16 commits into from
Jan 29, 2024
Merged

Client/Block: stabilize block fetcher #3240

merged 16 commits into from
Jan 29, 2024

Conversation

jochem-brouwer
Copy link
Member

@jochem-brouwer jochem-brouwer commented Jan 18, 2024

This PR attempts to stabilize the ReverseBlockFetcher.

block.validateData takes a lot of time. This means that once we request big blocks, this validation loop will block the event queue. This means that the fetcher job actually expires and the results are not written.

A related problem is that the validation loop also blocks network io loops: (this happens after this PR still)

TRACE[01-18|05:34:44.616] Failed RLPx handshake addr=127.0.0.1:30303 conn=staticdial err="read tcp 127.0.0.1:42862->127.0.0.1:30303: i/o timeout"

Note: the ReverseBlockFetcher writes to the skeleton chain, which itself does not re-perform block validation. However, in BlockFetcher if one stores the blocks, if you write to chain.blockchain (so a blockchain object, not skeleton), this will internally re-validate the block.validateData(). So, for BlockFetcher the block is actually verified twice! Once upon request, and once upon storing it. So, we can safely (?) remove this from the BlockFetcher.

Still WIP, but this actually got my client un-stuck! I got from to 655624 (tail block) in about 10 minutes. Before this, I would always get stuck and stay at 656724 since the fetcher would expire my jobs.

Side question: is there a way to put the validateData job at the end of the nodejs event queue? If we can do this, it means we validate one block, then reply to the devp2p messages (ping messages for instance), and are also open for sockets to receive other jobs (this is the error listed above: Geth tries to write to our socket but since we do not read, it will timeout at some point). I tried setImmediate, process.next, setTimeout but this does not seem to work.

Thoughts? (Try this locally too!! It should get your client unstuck 😄 )

This PR does 3 things (can cherry-pick them out):

  • Speeds up the BlockFetcher by only verifying the data integrity, not checking if each tx is signed (motivation is in the comments)
  • Adds the verifyData(onlyHeader: boolean = false, verifyTxs: boolean = true) the verifyTxs parameter to block. Setting this to false skips tx validation.
  • (temporarily) updates the BeaconSync best() method to always return a peer if one is available

_task: JobTask
): { destroyFetcher: boolean; banPeer: boolean; stepBack: bigint } {
const stepBack = BIGINT_0
const destroyFetcher = !(error.message as string).includes(
`Blocks don't extend canonical subchain`
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@g11tech what is the reason to destroy the fetcher if there is another error? Not sure here 🤔

Copy link
Contributor

@g11tech g11tech Jan 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the only error that is expected if the peer didn't give you correct chain, for all other errors something wrong went in the fetcher and hence fetcher needs to be cleared out and a new fetcher will be restarted as a means to be robust against issues. why do you want to remove it?

So if now validation errors can come in, means that we also expect those and not destroy the fetcher (which will just lead to re-queing of the job)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should study the Fetcher stack a bit more before I can enlighten :) I read somewhere that if you destroy the fetcher it is a "critical" error, I would assume that then the fetcher is broken or something. Will study it some more and will get back to this later.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, so fetcher will be/should be reinitiated (if our peer latest/best is correct or we handle it in better way)

@@ -86,7 +86,6 @@ export class BlockFetcher extends BlockFetcherBase<Block[], Block> {
}
// Supply the common from the corresponding block header already set on correct fork
const block = Block.fromValuesArray(values, { common: headers[i].common })
await block.validateData()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think its ok to not validate data if this is PoS because of parent-child relationship validation that will happen while storing in skeleton, so we can do an if here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I agree. For the backfill process we can (especially for reverse block fetcher) just accept the blocks - right? If we validate that the reported blocks has the block hash which we expect in the end, then we can accept it (and we dont have to validate all data such as: does every txn have a valid signature? Since if the tx trie matches the block hash, we know that CL expects us that this block is valid? If it is unvalid then CL is broken since it gives us the wrong chain tip block)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, just validating the hash here is good enough

Copy link

codecov bot commented Jan 19, 2024

Codecov Report

Attention: 4 lines in your changes are missing coverage. Please review.

Comparison is base (68c4fb9) 87.84% compared to head (3efbd77) 87.88%.

Additional details and impacted files

Impacted file tree graph

Flag Coverage Δ
block 88.57% <75.00%> (+0.96%) ⬆️
blockchain 91.60% <ø> (ø)
client 84.56% <100.00%> (+<0.01%) ⬆️
common 98.26% <ø> (ø)
devp2p 82.12% <ø> (ø)
ethash ∅ <ø> (∅)
evm 76.92% <ø> (ø)
genesis 99.98% <ø> (ø)
rlp ∅ <ø> (∅)
statemanager 86.57% <ø> (ø)
trie 89.67% <ø> (+0.28%) ⬆️
tx 95.89% <ø> (ø)
util 89.13% <ø> (ø)
vm 80.26% <ø> (ø)
wallet 88.35% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

@holgerd77
Copy link
Member

Updated this via UI

What is the latest state here? Ready for review respectively can likely @g11tech have a final look? Or not yet?

@jochem-brouwer
Copy link
Member Author

From my side this is ready for review :)

// Upon putting blocks into blockchain (for BlockFetcher), `validateData` is called again
// In ReverseBlockFetcher we do not need to validate the entire block, since CL
// expects us to sync with the requested chain tip header
await block.validateDataIntegrity()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it really necessary to add a wrapper function that's only used in one place? Feels like we should just call block.verifyData and then in the comments explan why we're not validating transactions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not have a very strong opinion here but would cautiously agree

Copy link
Member

@holgerd77 holgerd77 Jan 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Or is there a somewhat strong reasoning (which might be) that the API is better off with (yet) another explicit validation method?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really a strong reason to do this. I will remove the method and directly call into verifyData!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have addressed

Copy link
Member

@holgerd77 holgerd77 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates, LGTM!

@holgerd77 holgerd77 merged commit 6ca6fd5 into master Jan 29, 2024
45 of 46 checks passed
@holgerd77 holgerd77 deleted the stabilize-fetcher branch January 29, 2024 11:35
@holgerd77 holgerd77 changed the title Client: stabilize block fetcher Client/Block: stabilize block fetcher / new Block.validateData() API Method Jan 30, 2024
@holgerd77 holgerd77 changed the title Client/Block: stabilize block fetcher / new Block.validateData() API Method Client/Block: stabilize block fetcher Jan 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants