-
Notifications
You must be signed in to change notification settings - Fork 790
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Client: Fix Sync Bugs and Error Messages #1075
Conversation
Codecov Report
Flags with carried forward coverage won't be shown. Click here to find out more. |
…guard against stream-was-destroyed error
… RLPx server errors, sorted ignored error list by origin
…lass, fixed test cases
…eset the iterator head to the parent hash and stop syncing
dd19dd9
to
7e7c34a
Compare
Ok, this is now ready for review. 😄 I've added one last commit dd19dd9 giving some consistency to the case of execution failures. Before - taken over from the I've now added a long comment to the execution logic distinguishing between the cases where a) a bad block is served and b) the VM has some consensus or other error. At some point we likely need to handle both cases (suggestions are in the comment mentioned), for now I've concentrated on the b) case since all occurences up till now were of this type and will likely remain to a greater extend until we have fixed the most severe bugs. So on the iterator run the block where the error occurred is now not deleted any more. Instead the error is logged on the console, the iterator head is set to the parent block and syncing is stopped. With this setup it gets possible to re-trigger the same error reliably on the next run which should hopefully greatly help us on debugging. Here is an example for a client run: INFO [01-29|23:09:24] Started eth service.
INFO [01-29|23:10:05] Imported blocks count=50 number=227309 hash=2a34e2d2... hardfork=chainstart peers=2
WARN [01-29|23:10:05] Execution of block number=226522 hash=63de8ba9... failed
ERROR [01-29|23:10:05] Error: invalid receiptTrie
at VM.runBlock (/EthereumJS/ethereumjs-vm/packages/vm/dist/runBlock.js:88:19)
at async /EthereumJS/ethereumjs-vm/packages/client/dist/lib/sync/execution/vmexecution.js:101:21
at async /EthereumJS/ethereumjs-vm/packages/blockchain/dist/index.js:909:21
at async Blockchain.runWithLock (/EthereumJS/ethereumjs-vm/packages/blockchain/dist/index.js:262:27)
at async Blockchain.initAndLock (/EthereumJS/ethereumjs-vm/packages/blockchain/dist/index.js:250:16)
at async Blockchain._iterator (/EthereumJS/ethereumjs-vm/packages/blockchain/dist/index.js:891:16)
at async VMExecution.run (/EthereumJS/ethereumjs-vm/packages/client/dist/lib/sync/execution/vmexecution.js:147:28)
at async Chain.<anonymous> (/EthereumJS/ethereumjs-vm/packages/client/dist/lib/sync/fullsync.js:32:17)
INFO [01-29|23:10:05] Stopped execution.
INFO [01-29|23:10:12] Synchronized
INFO [01-29|23:10:13] Stopped synchronization. This is actually concretely preparing for our first consensus bug in block number 226522 😜 (invalid receipt trie), will open a dedicated issue on that. |
@@ -112,10 +117,26 @@ export class VMExecution extends Execution { | |||
// set as new head block | |||
headBlock = block | |||
} catch (error) { | |||
// TODO: determine if there is a way to differentiate between the cases | |||
// a) a bad block is served by a bad peer -> delete the block and restart sync |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you would delete the block and immediately restart sync, then the VM will probably not do anything: the current block number is not directly available (it is available when served by a peer), and the previous block was already validated, so there's not much to do.
If there is a bad block, we should ban the peer. It would be useful to have some sort of construct where we can figure out which peer served it. (But I'm not sure how this would work, it doesn't seem elegant to write which peer served the block to DB (so we can distinguish if we restart the client), and we can also only cache a few blocks...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, read the things I added to a) only as some thought inspiration, this is likely not yet the way to go, just the first thing which came to my mind.
Yes, banning the peer is a good idea, maybe we can do the following as first simple but working mechanism:
- We emit an event
badblock
(or something) by the VMExecution module which can be picked up by the block fetcher (?) and also the peer pool (this might also be an occasion to pick up again on the idea of a centralized event bus we recently discussed so we avoid this event chaining, will directly open a new issue on that after writing here) - On the peer pool the peer would be banned (yes, also would not write this to DB, caching along the active peers should be enough and we can compare the bad blocks by hash and block number)
- The block fetcher and/or the chain (not sure atm where things happen) can then delete the assumed bad block and re-trigger syncing from the respective parent block as a head, now syncing with the eventual bad peer excluded. (If we make the banning periods long enough this would even mitigate a bit towards an attack by several peers, these would just be banned one after the other, has of course its limits but should be very much ok for a first round)
- Yeah, and then things should work. 😄
This procedure would actually also already be some procedure for both cases, a bad block as well as a consensus error. For a consensus error the behavior would just be (somewhat) as it is now with this PR, except that there is not only a retry once the client restarts but repeatedly along sync. To finally distinguish between both cases we can just count the attempts and stop the sync after some number of attempts (let's say: 10), then there is a big likelihood that there is just a consensus error (which we need to fix with a bugfix release, no way around that).
} | ||
}, | ||
this.NUM_BLOCKS_PER_ITERATION | ||
) | ||
numExecuted = (await this.vmPromise) as number | ||
|
||
// eslint-disable-next-line @typescript-eslint/no-unnecessary-condition | ||
if (errorBlock) { | ||
await this.chain.blockchain.setIteratorHead('vm', (errorBlock as Block).header.parentHash) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are not deleting the block? If you would restart sync now, then it will re-execute the wrong block, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's correct.
The main point here is to bring us in a better debugging position. As stated in the comment above, IF there is a consensus error like in #1076 we just have to fix and do a new release, there is no way to develop around this case.
I only implemented the case from above in this PR since for some time I very much assume this will be our main (if not: only) concern from both cases. If there is an "unintentionally bad peer" which is just serving malformed blocks by software error this is now covered by having activated the block validation on blockchain (not sure though if the sync will recover properly on such a case, but this is then something for another PR).
On mainnet
it is highly unlikely that someone will try to trick us on a deviating chain path in the lower block number range being so far away from the latest mainnet block (so, let's say: someone is sending out a malformed (but validated) block nr. 1.000.000 just to confuse our few 1,2,3 client instances. On the PoA chains (so I would say: these are the two cases we both target primarily atm, mainnet and PoA chains) bad blocks from attackers are just no thing at all. So we can safely concentrate on the consensus part right now, on mainnet being able to sync close to the chain head is still at least weeks - or rather months - away I would assume (if we will get there at all, many unknowns on the performance side).
'Invalid MAC', | ||
|
||
// Client | ||
'Handshake timed out', // Protocol handshake |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not related to this PR but it would be nice to know why it is OK to ignore these errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't answer for all cases (mainly not sure about the peer socket connection errors) but at least the very most of the other cases are either cases where the peer is sending malformed data ('Invalid MAC', 'Hash verification failed', all other DPT stuff (except time out)) for unknown reasons or we don't want to have certain data by protocol decision, 'NetworkId mismatch' as the most obvious case where we just don't want to connect to a peer which is on a different network.
All this is taking place under the assumption that our actual devp2p implementation along these kind of failures is correct. So we generally don't want to have these errors displayed again and again when they are just expectable. If they are occurring too often though, this might be an indicator that there is something wrong with our implementation. Sometimes hard to distinguish though. But it's likely a good idea to deactivate these filters temporarily from time to time and see what errors come through in what level of quantity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great...
Is the proposal to ban bad peers is a separate PR?
The improvements to the error management here will be super helpful 🙂
@cgewecke Thanks! Yes, banning bad peers would be a separate PR, as stated, my personal judgement on this is that this shouldn't be too urgent, so we can also very well address in 3-4 weeks or so, on the other side this should also not be too overly complex to tackle earlier. |
This PR targets to fix the latest (obvious) client sync errors, specifically:
Will keep this as "WIP" for another hour or up to half a day, should be open for review soon though.