Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Client: Fix Sync Bugs and Error Messages #1075

Merged
merged 5 commits into from
Feb 2, 2021
Merged

Conversation

holgerd77
Copy link
Member

This PR targets to fix the latest (obvious) client sync errors, specifically:

  • Two common peer errors just indicating a disconnect ("Unknown forkhash", "Invalid type") are added to the list of ignored RLPxServer errors, 7c58eeb and 0d64db5
  • A safeguard is added to avoid "socket already destroyed" errors in the devp2p peer module (haven't seen this error on quite an amount of blocks syncing after adding this change, this might need some final confirmation respectively some further observation though), 05059c8
  • Block validation on client blockchain is activated, this should avoid getting into inconsistent state DB states when malformed blocks are received and at least partially solve Client: "Error: invalid transaction trie" leading to corrupted state DB #1066

Will keep this as "WIP" for another hour or up to half a day, should be open for review soon though.

@codecov
Copy link

codecov bot commented Jan 29, 2021

Codecov Report

Merging #1075 (ce2cb43) into master (55f6400) will decrease coverage by 0.17%.
The diff coverage is 39.13%.

Impacted file tree graph

Flag Coverage Δ
block 81.35% <100.00%> (ø)
blockchain 82.83% <ø> (-0.08%) ⬇️
client 87.22% <29.41%> (-0.43%) ⬇️
common 86.28% <ø> (ø)
devp2p 82.98% <0.00%> (ø)
ethash 82.08% <ø> (ø)
tx 90.00% <ø> (ø)
vm 83.09% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

@holgerd77 holgerd77 force-pushed the fix-client-sync-bugs branch from dd19dd9 to 7e7c34a Compare January 29, 2021 22:14
@holgerd77
Copy link
Member Author

Ok, this is now ready for review. 😄

I've added one last commit dd19dd9 giving some consistency to the case of execution failures. Before - taken over from the VM.runBlockchain() method - the block where an execution failed was deleted while the syncing continued (which was both annoying and made no sense). On a re-run there was an error triggered since the iterator head for the VM was pointing towards a non-existing block, the syncing continued nevertheless.

I've now added a long comment to the execution logic distinguishing between the cases where a) a bad block is served and b) the VM has some consensus or other error. At some point we likely need to handle both cases (suggestions are in the comment mentioned), for now I've concentrated on the b) case since all occurences up till now were of this type and will likely remain to a greater extend until we have fixed the most severe bugs.

So on the iterator run the block where the error occurred is now not deleted any more. Instead the error is logged on the console, the iterator head is set to the parent block and syncing is stopped. With this setup it gets possible to re-trigger the same error reliably on the next run which should hopefully greatly help us on debugging.

Here is an example for a client run:

INFO [01-29|23:09:24] Started eth service.
INFO [01-29|23:10:05] Imported blocks count=50 number=227309 hash=2a34e2d2... hardfork=chainstart peers=2
WARN [01-29|23:10:05] Execution of block number=226522 hash=63de8ba9... failed
ERROR [01-29|23:10:05] Error: invalid receiptTrie
    at VM.runBlock (/EthereumJS/ethereumjs-vm/packages/vm/dist/runBlock.js:88:19)
    at async /EthereumJS/ethereumjs-vm/packages/client/dist/lib/sync/execution/vmexecution.js:101:21
    at async /EthereumJS/ethereumjs-vm/packages/blockchain/dist/index.js:909:21
    at async Blockchain.runWithLock (/EthereumJS/ethereumjs-vm/packages/blockchain/dist/index.js:262:27)
    at async Blockchain.initAndLock (/EthereumJS/ethereumjs-vm/packages/blockchain/dist/index.js:250:16)
    at async Blockchain._iterator (/EthereumJS/ethereumjs-vm/packages/blockchain/dist/index.js:891:16)
    at async VMExecution.run (/EthereumJS/ethereumjs-vm/packages/client/dist/lib/sync/execution/vmexecution.js:147:28)
    at async Chain.<anonymous> (/EthereumJS/ethereumjs-vm/packages/client/dist/lib/sync/fullsync.js:32:17)
INFO [01-29|23:10:05] Stopped execution.
INFO [01-29|23:10:12] Synchronized
INFO [01-29|23:10:13] Stopped synchronization.

This is actually concretely preparing for our first consensus bug in block number 226522 😜 (invalid receipt trie), will open a dedicated issue on that.

@@ -112,10 +117,26 @@ export class VMExecution extends Execution {
// set as new head block
headBlock = block
} catch (error) {
// TODO: determine if there is a way to differentiate between the cases
// a) a bad block is served by a bad peer -> delete the block and restart sync
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you would delete the block and immediately restart sync, then the VM will probably not do anything: the current block number is not directly available (it is available when served by a peer), and the previous block was already validated, so there's not much to do.

If there is a bad block, we should ban the peer. It would be useful to have some sort of construct where we can figure out which peer served it. (But I'm not sure how this would work, it doesn't seem elegant to write which peer served the block to DB (so we can distinguish if we restart the client), and we can also only cache a few blocks...)

Copy link
Member Author

@holgerd77 holgerd77 Feb 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, read the things I added to a) only as some thought inspiration, this is likely not yet the way to go, just the first thing which came to my mind.

Yes, banning the peer is a good idea, maybe we can do the following as first simple but working mechanism:

  1. We emit an event badblock (or something) by the VMExecution module which can be picked up by the block fetcher (?) and also the peer pool (this might also be an occasion to pick up again on the idea of a centralized event bus we recently discussed so we avoid this event chaining, will directly open a new issue on that after writing here)
  2. On the peer pool the peer would be banned (yes, also would not write this to DB, caching along the active peers should be enough and we can compare the bad blocks by hash and block number)
  3. The block fetcher and/or the chain (not sure atm where things happen) can then delete the assumed bad block and re-trigger syncing from the respective parent block as a head, now syncing with the eventual bad peer excluded. (If we make the banning periods long enough this would even mitigate a bit towards an attack by several peers, these would just be banned one after the other, has of course its limits but should be very much ok for a first round)
  4. Yeah, and then things should work. 😄

This procedure would actually also already be some procedure for both cases, a bad block as well as a consensus error. For a consensus error the behavior would just be (somewhat) as it is now with this PR, except that there is not only a retry once the client restarts but repeatedly along sync. To finally distinguish between both cases we can just count the attempts and stop the sync after some number of attempts (let's say: 10), then there is a big likelihood that there is just a consensus error (which we need to fix with a bugfix release, no way around that).

}
},
this.NUM_BLOCKS_PER_ITERATION
)
numExecuted = (await this.vmPromise) as number

// eslint-disable-next-line @typescript-eslint/no-unnecessary-condition
if (errorBlock) {
await this.chain.blockchain.setIteratorHead('vm', (errorBlock as Block).header.parentHash)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are not deleting the block? If you would restart sync now, then it will re-execute the wrong block, right?

Copy link
Member Author

@holgerd77 holgerd77 Feb 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's correct.

The main point here is to bring us in a better debugging position. As stated in the comment above, IF there is a consensus error like in #1076 we just have to fix and do a new release, there is no way to develop around this case.

I only implemented the case from above in this PR since for some time I very much assume this will be our main (if not: only) concern from both cases. If there is an "unintentionally bad peer" which is just serving malformed blocks by software error this is now covered by having activated the block validation on blockchain (not sure though if the sync will recover properly on such a case, but this is then something for another PR).

On mainnet it is highly unlikely that someone will try to trick us on a deviating chain path in the lower block number range being so far away from the latest mainnet block (so, let's say: someone is sending out a malformed (but validated) block nr. 1.000.000 just to confuse our few 1,2,3 client instances. On the PoA chains (so I would say: these are the two cases we both target primarily atm, mainnet and PoA chains) bad blocks from attackers are just no thing at all. So we can safely concentrate on the consensus part right now, on mainnet being able to sync close to the chain head is still at least weeks - or rather months - away I would assume (if we will get there at all, many unknowns on the performance side).

'Invalid MAC',

// Client
'Handshake timed out', // Protocol handshake
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not related to this PR but it would be nice to know why it is OK to ignore these errors.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't answer for all cases (mainly not sure about the peer socket connection errors) but at least the very most of the other cases are either cases where the peer is sending malformed data ('Invalid MAC', 'Hash verification failed', all other DPT stuff (except time out)) for unknown reasons or we don't want to have certain data by protocol decision, 'NetworkId mismatch' as the most obvious case where we just don't want to connect to a peer which is on a different network.

All this is taking place under the assumption that our actual devp2p implementation along these kind of failures is correct. So we generally don't want to have these errors displayed again and again when they are just expectable. If they are occurring too often though, this might be an indicator that there is something wrong with our implementation. Sometimes hard to distinguish though. But it's likely a good idea to deactivate these filters temporarily from time to time and see what errors come through in what level of quantity.

Copy link
Contributor

@cgewecke cgewecke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great...

Is the proposal to ban bad peers is a separate PR?

The improvements to the error management here will be super helpful 🙂

@holgerd77
Copy link
Member Author

@cgewecke Thanks!

Yes, banning bad peers would be a separate PR, as stated, my personal judgement on this is that this shouldn't be too urgent, so we can also very well address in 3-4 weeks or so, on the other side this should also not be too overly complex to tackle earlier.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants