Client: Fix Sync Bugs and Error Messages #1075

holgerd77 · 2021-01-29T12:06:45Z

This PR targets to fix the latest (obvious) client sync errors, specifically:

Two common peer errors just indicating a disconnect ("Unknown forkhash", "Invalid type") are added to the list of ignored RLPxServer errors, 7c58eeb and 0d64db5
A safeguard is added to avoid "socket already destroyed" errors in the devp2p peer module (haven't seen this error on quite an amount of blocks syncing after adding this change, this might need some final confirmation respectively some further observation though), 05059c8
Block validation on client blockchain is activated, this should avoid getting into inconsistent state DB states when malformed blocks are received and at least partially solve Client: "Error: invalid transaction trie" leading to corrupted state DB #1066

Will keep this as "WIP" for another hour or up to half a day, should be open for review soon though.

packages/block/src/block.ts

codecov · 2021-01-29T12:08:29Z

Codecov Report

Merging #1075 (ce2cb43) into master (55f6400) will decrease coverage by 0.17%.
The diff coverage is 39.13%.

Flag	Coverage Δ
block	`81.35% <100.00%> (ø)`
blockchain	`82.83% <ø> (-0.08%)`	⬇️
client	`87.22% <29.41%> (-0.43%)`	⬇️
common	`86.28% <ø> (ø)`
devp2p	`82.98% <0.00%> (ø)`
ethash	`82.08% <ø> (ø)`
tx	`90.00% <ø> (ø)`
vm	`83.09% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

…guard against stream-was-destroyed error

… RLPx server errors, sorted ignored error list by origin

…lass, fixed test cases

…eset the iterator head to the parent hash and stop syncing

holgerd77 · 2021-01-29T22:32:40Z

Ok, this is now ready for review. 😄

I've added one last commit dd19dd9 giving some consistency to the case of execution failures. Before - taken over from the VM.runBlockchain() method - the block where an execution failed was deleted while the syncing continued (which was both annoying and made no sense). On a re-run there was an error triggered since the iterator head for the VM was pointing towards a non-existing block, the syncing continued nevertheless.

I've now added a long comment to the execution logic distinguishing between the cases where a) a bad block is served and b) the VM has some consensus or other error. At some point we likely need to handle both cases (suggestions are in the comment mentioned), for now I've concentrated on the b) case since all occurences up till now were of this type and will likely remain to a greater extend until we have fixed the most severe bugs.

So on the iterator run the block where the error occurred is now not deleted any more. Instead the error is logged on the console, the iterator head is set to the parent block and syncing is stopped. With this setup it gets possible to re-trigger the same error reliably on the next run which should hopefully greatly help us on debugging.

Here is an example for a client run:

INFO [01-29|23:09:24] Started eth service.
INFO [01-29|23:10:05] Imported blocks count=50 number=227309 hash=2a34e2d2... hardfork=chainstart peers=2
WARN [01-29|23:10:05] Execution of block number=226522 hash=63de8ba9... failed
ERROR [01-29|23:10:05] Error: invalid receiptTrie
    at VM.runBlock (/EthereumJS/ethereumjs-vm/packages/vm/dist/runBlock.js:88:19)
    at async /EthereumJS/ethereumjs-vm/packages/client/dist/lib/sync/execution/vmexecution.js:101:21
    at async /EthereumJS/ethereumjs-vm/packages/blockchain/dist/index.js:909:21
    at async Blockchain.runWithLock (/EthereumJS/ethereumjs-vm/packages/blockchain/dist/index.js:262:27)
    at async Blockchain.initAndLock (/EthereumJS/ethereumjs-vm/packages/blockchain/dist/index.js:250:16)
    at async Blockchain._iterator (/EthereumJS/ethereumjs-vm/packages/blockchain/dist/index.js:891:16)
    at async VMExecution.run (/EthereumJS/ethereumjs-vm/packages/client/dist/lib/sync/execution/vmexecution.js:147:28)
    at async Chain.<anonymous> (/EthereumJS/ethereumjs-vm/packages/client/dist/lib/sync/fullsync.js:32:17)
INFO [01-29|23:10:05] Stopped execution.
INFO [01-29|23:10:12] Synchronized
INFO [01-29|23:10:13] Stopped synchronization.

This is actually concretely preparing for our first consensus bug in block number 226522 😜 (invalid receipt trie), will open a dedicated issue on that.

jochem-brouwer · 2021-01-30T22:10:46Z

packages/client/lib/sync/execution/vmexecution.ts

@@ -112,10 +117,26 @@ export class VMExecution extends Execution {
            // set as new head block
            headBlock = block
          } catch (error) {
+            // TODO: determine if there is a way to differentiate between the cases
+            // a) a bad block is served by a bad peer -> delete the block and restart sync


If you would delete the block and immediately restart sync, then the VM will probably not do anything: the current block number is not directly available (it is available when served by a peer), and the previous block was already validated, so there's not much to do.

If there is a bad block, we should ban the peer. It would be useful to have some sort of construct where we can figure out which peer served it. (But I'm not sure how this would work, it doesn't seem elegant to write which peer served the block to DB (so we can distinguish if we restart the client), and we can also only cache a few blocks...)

Yes, read the things I added to a) only as some thought inspiration, this is likely not yet the way to go, just the first thing which came to my mind.

Yes, banning the peer is a good idea, maybe we can do the following as first simple but working mechanism:

We emit an event badblock (or something) by the VMExecution module which can be picked up by the block fetcher (?) and also the peer pool (this might also be an occasion to pick up again on the idea of a centralized event bus we recently discussed so we avoid this event chaining, will directly open a new issue on that after writing here)

On the peer pool the peer would be banned (yes, also would not write this to DB, caching along the active peers should be enough and we can compare the bad blocks by hash and block number)

The block fetcher and/or the chain (not sure atm where things happen) can then delete the assumed bad block and re-trigger syncing from the respective parent block as a head, now syncing with the eventual bad peer excluded. (If we make the banning periods long enough this would even mitigate a bit towards an attack by several peers, these would just be banned one after the other, has of course its limits but should be very much ok for a first round)

Yeah, and then things should work. 😄

This procedure would actually also already be some procedure for both cases, a bad block as well as a consensus error. For a consensus error the behavior would just be (somewhat) as it is now with this PR, except that there is not only a retry once the client restarts but repeatedly along sync. To finally distinguish between both cases we can just count the attempts and stop the sync after some number of attempts (let's say: 10), then there is a big likelihood that there is just a consensus error (which we need to fix with a bugfix release, no way around that).

jochem-brouwer · 2021-01-30T22:11:37Z

packages/client/lib/sync/execution/vmexecution.ts

          }
        },
        this.NUM_BLOCKS_PER_ITERATION
      )
      numExecuted = (await this.vmPromise) as number

+      // eslint-disable-next-line @typescript-eslint/no-unnecessary-condition
+      if (errorBlock) {
+        await this.chain.blockchain.setIteratorHead('vm', (errorBlock as Block).header.parentHash)


We are not deleting the block? If you would restart sync now, then it will re-execute the wrong block, right?

Yes, that's correct.

The main point here is to bring us in a better debugging position. As stated in the comment above, IF there is a consensus error like in #1076 we just have to fix and do a new release, there is no way to develop around this case.

I only implemented the case from above in this PR since for some time I very much assume this will be our main (if not: only) concern from both cases. If there is an "unintentionally bad peer" which is just serving malformed blocks by software error this is now covered by having activated the block validation on blockchain (not sure though if the sync will recover properly on such a case, but this is then something for another PR).

On mainnet it is highly unlikely that someone will try to trick us on a deviating chain path in the lower block number range being so far away from the latest mainnet block (so, let's say: someone is sending out a malformed (but validated) block nr. 1.000.000 just to confuse our few 1,2,3 client instances. On the PoA chains (so I would say: these are the two cases we both target primarily atm, mainnet and PoA chains) bad blocks from attackers are just no thing at all. So we can safely concentrate on the consensus part right now, on mainnet being able to sync close to the chain head is still at least weeks - or rather months - away I would assume (if we will get there at all, many unknowns on the performance side).

jochem-brouwer · 2021-01-30T22:13:04Z

packages/client/lib/net/server/rlpxserver.ts

+    'Invalid MAC',
+
+    // Client
+    'Handshake timed out', // Protocol handshake


Not related to this PR but it would be nice to know why it is OK to ignore these errors.

Can't answer for all cases (mainly not sure about the peer socket connection errors) but at least the very most of the other cases are either cases where the peer is sending malformed data ('Invalid MAC', 'Hash verification failed', all other DPT stuff (except time out)) for unknown reasons or we don't want to have certain data by protocol decision, 'NetworkId mismatch' as the most obvious case where we just don't want to connect to a peer which is on a different network.

All this is taking place under the assumption that our actual devp2p implementation along these kind of failures is correct. So we generally don't want to have these errors displayed again and again when they are just expectable. If they are occurring too often though, this might be an indicator that there is something wrong with our implementation. Sometimes hard to distinguish though. But it's likely a good idea to deactivate these filters temporarily from time to time and see what errors come through in what level of quantity.

packages/client/lib/sync/execution/vmexecution.ts

cgewecke

This looks great...

Is the proposal to ban bad peers is a separate PR?

The improvements to the error management here will be super helpful 🙂

holgerd77 · 2021-02-02T11:44:39Z

@cgewecke Thanks!

Yes, banning bad peers would be a separate PR, as stated, my personal judgement on this is that this shouldn't be too urgent, so we can also very well address in 3-4 weeks or so, on the other side this should also not be too overly complex to tackle earlier.

holgerd77 added type: bug PR state: WIP package: devp2p package: client labels Jan 29, 2021

holgerd77 commented Jan 29, 2021

View reviewed changes

packages/block/src/block.ts Show resolved Hide resolved

holgerd77 added PR state: needs review and removed PR state: WIP labels Jan 29, 2021

holgerd77 added 5 commits January 29, 2021 23:14

client: added Unkown-fork-hash to the list of ignored RLPx server errors

9377de6

devp2p: added socket destroyed checks on peer message sending to safe…

33e3c40

…guard against stream-was-destroyed error

client: added invalid-type on DPT message decoding to list of ignored…

11b50c8

… RLPx server errors, sorted ignored error list by origin

client: activated block validation for blockchain instance in Chain c…

0dab34c

…lass, fixed test cases

client -> VM execution: do not delete blog on execution failure but r…

7e7c34a

…eset the iterator head to the parent hash and stop syncing

holgerd77 force-pushed the fix-client-sync-bugs branch from dd19dd9 to 7e7c34a Compare January 29, 2021 22:14

holgerd77 requested review from ryanio, cgewecke and jochem-brouwer January 29, 2021 22:33

jochem-brouwer reviewed Jan 30, 2021

View reviewed changes

packages/client/lib/sync/execution/vmexecution.ts Show resolved Hide resolved

cgewecke approved these changes Feb 2, 2021

View reviewed changes

holgerd77 merged commit 80093e6 into master Feb 2, 2021

holgerd77 deleted the fix-client-sync-bugs branch February 2, 2021 11:44

holgerd77 mentioned this pull request Feb 2, 2021

Client: Centralized Event Bus #1079

Closed

This was referenced Feb 12, 2021

Client: "Error: invalid transaction trie" leading to corrupted state DB #1066

Closed

Client Error [ERR_STREAM_DESTROYED]: Cannot call write after a stream was destroyed #991

Closed

Client: add block validation to chain syncing #981

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Client: Fix Sync Bugs and Error Messages #1075

Client: Fix Sync Bugs and Error Messages #1075

holgerd77 commented Jan 29, 2021

codecov bot commented Jan 29, 2021 •

edited

Loading

holgerd77 commented Jan 29, 2021

jochem-brouwer Jan 30, 2021

holgerd77 Feb 1, 2021 •

edited

Loading

jochem-brouwer Jan 30, 2021

holgerd77 Feb 1, 2021 •

edited

Loading

jochem-brouwer Jan 30, 2021

holgerd77 Feb 1, 2021

cgewecke left a comment

holgerd77 commented Feb 2, 2021

Client: Fix Sync Bugs and Error Messages #1075

Client: Fix Sync Bugs and Error Messages #1075

Conversation

holgerd77 commented Jan 29, 2021

codecov bot commented Jan 29, 2021 • edited Loading

Codecov Report

holgerd77 commented Jan 29, 2021

jochem-brouwer Jan 30, 2021

Choose a reason for hiding this comment

holgerd77 Feb 1, 2021 • edited Loading

Choose a reason for hiding this comment

jochem-brouwer Jan 30, 2021

Choose a reason for hiding this comment

holgerd77 Feb 1, 2021 • edited Loading

Choose a reason for hiding this comment

jochem-brouwer Jan 30, 2021

Choose a reason for hiding this comment

holgerd77 Feb 1, 2021

Choose a reason for hiding this comment

cgewecke left a comment

Choose a reason for hiding this comment

holgerd77 commented Feb 2, 2021

codecov bot commented Jan 29, 2021 •

edited

Loading

holgerd77 Feb 1, 2021 •

edited

Loading

holgerd77 Feb 1, 2021 •

edited

Loading