-
Notifications
You must be signed in to change notification settings - Fork 2.6k
Bad signature synchronization error of past blocks with Aura #10103
Comments
@remzrn we face a similar issue on our networks. Did you get it resolved already? |
@DylanVerstraete try to revert this pr: #9132 it could may be the problem |
Reverting only that PR does not really work, since this code depends on the |
The important point is probably to add a call to
|
Something like this? https://github.com/DylanVerstraete/substrate/blob/c66577f77b10269f4f7b2070fad7f16803170e60/client/consensus/aura/src/import_queue.rs#L208 I tried it, doesn't work. I still get the same error... |
I will try to explain this in a simple example, at block 10 there are 2 authorities (Bob, Alice), in block 11 a call was made to add an authority (Ferdie), in block 12 grandpa authorities changed. The client can import block 10 and block 11 but it fails at block 12. When block 12 is being imported, it tries to read the authorities from the parent block, see:
When I patch this code to request the authorities from the currently imported block (12) I get following error:
All of this does not really explain why it worked on Substrate V3, so I am more confused then ever now. |
Okay, I think I found the issue to all of this. For adding / removing validators we use: https://github.com/gautamdhameja/substrate-validator-set Which implements the I reworked the implementation of authorities in our runtime api: https://github.com/threefoldtech/tfchain/pull/462/files and upgraded the runtime, the sync now properly continues because it can fetch the authorities from the substrate validator set pallet instead from aura. Still not sure why it worked om Substrate V3 but not on |
@bkchr Seems like my last comment is not working actually. What could I be missing here? |
@DylanVerstraete : I thought I did, but we discovered a couple of days ago that we didn't unfortunately. @bkchr : The problem gets actually nastier when initializing the block in the aura import queue. It synchronizes the past perfectly, but then you can get into a situation where the validator set changes in such a way that the validator that is set to produce the block is actually one that gets excluded on the era change. Then it bricks the chain, because that validator produces the block, but in the header checking function, the authorities after initialization have already been updated and this block is seen as invalid because produced by an unknown authority, then all validators drop, because the new authority is set for verification, but the block that should enshrine this is not accepted and it totally bricks the chain. I don't see any elegant way to solve this, as the logic changed in substrate at that time, so we have to choose between not being able to sync from scratch if we use the current substrate, or risking a chain halt at any point if we modify it to initialize the block. |
@remzrn My applied fix does not work because when syncing history it uses the runtime at that block. What we have noticed in our research about this issue is the following: We use https://github.com/gautamdhameja/substrate-validator-set for adding/removing validators, the previous version (which was compatible with susbtrate 3.0.0) used to force rotate a session when you added a validator. Then this new validator would create the block after getting added. Now this gives issues while trying to sync past this block with any substrate 4.0.0-dev version, because the author of that block is not recognised in the aura authorities. Currently we migrated some weeks ago to substrate 4.0.0 dev version. To sync our network from 0 we use the latest binary which was still on substrate 3.0.0. This properly syncs until the block where we upgraded our network to subtrate 4.0.0. It stops syncing when it reaches the point where we performed the upgraded, from that point we need to use our latest binary (substrate 4.0.0 dev version) to sync until the current head of our chain. We chose to not modify any substrate related packages because we don't really know in detail how they work together. |
@DylanVerstraete : Yes it makes sense that a change in the runtime does not help with the past, which is why we tried to slightly amend substrate. |
@remzrn I'm not really following in why your change would not work for future created blocks that contain an authoritiy set change. Isn't the point of your change to circumvent this? |
The point of the change, at meta level, is just to be able to comply with the logic that was enforced when the chain history was built. Initializing the block achieves this, it seems. The problem is the disconnect between who builds the block and who the authorities are. I will try to give an example: 3 validators v0, v1, v2 are in the set. Block 100 is an era change, and v0 and v1 stop, but v3 and v4 are set to join, v2 remains. Block #100 will be produced by v1 (say) that sends the block to everybody, (in the new logic substrate logic, the validators are still the old ones at the turn block, so v1 can produce the block). The other validators receive this block, and they will check it. They call the verify function which checks the header, but with the "fix" it initialized the block, which applies the authority change, and so v1 is seen as illegitimate. So they all reject block 100. Every validator in its turn will then attend to produce another one. But since the authority change did not get enacted as the block was not approved, we still have the old validators trying to produce blocks, and rejecting them subsequently. Only v2 can actually produce blocks properly, because it was part of the new and the old set. PS: This example is simplified but we had the case on a testnet, just a few days before our planned release. It could sound convoluted but it does happen and it will happen. |
This is true, we have done this on 2 networks already and it all went smoothly. Since no authorities changes were made between running the old binary and the new binary (we don't have a notion of era's on our chain, only sessions). |
The problem of your fix is that you did not revert the commit/pr I mentioned. If you take a close look to the code that I removed back then, I removed code that fetched the authorities from the cache. This cache was filled on import here. So, while writing this I realized that we actually never called |
@bkchr will this code be reverted in the next release? |
@DylanVerstraete which code? |
@bkchr the removing of the cache substrate/client/consensus/aura/src/lib.rs Lines 549 to 555 in ede9bc1
|
No, I removed this on purpose. Back then I didn't thought that anyone would use AURA in production. However, we work on a solution for you. @DylanVerstraete in what way did you fix your problem? Or you didn't yet fixed it? I don't really get why the old binary stops syncing? Do you got some db right before the syncing stops? Then I could try this on my own. |
@bkchr Sorry for the confusion but we did not really fix our problem. We circumvent our problem by running the old binary until the block where we upgraded our network to polkadot-0.9.24 dependencies (syncing stops there automatically). We then run our upgraded version of our chain to sync the remainder. To get to the point where the syncing stops you can do this:
Syncing will stop at height: Height |
This explains why the syncing stops: #10103 (comment) |
@bkchr : Thanks for the pointers, but I think it's not the only problem: I thought about the digest logs and the cache, but at the time concluded it was not where the information about authorities could come from because the cache does not seem to be implemented for full clients. substrate/client/api/src/backend.rs Line 156 in ede9bc1
This trait is only implemented in 3 places: the database, the in-memory client, and the light client, but only the light client implements is (I suppose for the purpose of caching auth data, it should be relevant only in memory) light.rs substrate/client/light/src/backend.rs Lines 304 to 306 in ede9bc1
db substrate/client/db/src/lib.rs Lines 786 to 788 in ede9bc1
in_mem.rs substrate/client/api/src/in_mem.rs Line 555 in ede9bc1
I checked the history yesterday, and it seems that there used to be a in-memory cache, that got removed by this commit: Could it be that the consensus got actually broken by this (the authorities changes would not get cached so a validator from the old authorities could try to produce a block, but that one would fail verification later on, because the block was still initialized before being verified so the expected author belongs to the new authorities) and that your removal of the block initialization later one actually fixed it (though by taking a different convention at the turn of the epoch than was applicable before)? Despite this, it also looks like the caching traits etc. have all been removed by further commits so re-enabling the cache would not be possible without forking at least the blockchain primitive and it's backend and a big part of the client and forward-porting the old logic there. |
This pr should fix your problem: #12492 It introduces a compatibility mode. I added some docs and explanations to the pr and the code itself. If you have more questions, I'm here to help you ;) |
@bkchr awesome thanks! |
On a testnet it seems the behaviour is as expected before and after the block change (I have tried to put it in the middle of an era, to avoid specific turn behaviours), it also seem to sync the past blocks as expected. I will let it run to see if I reach the tip of the chain correctly, but usually the problems - if any - pop up rather early, so it seems all good! |
Fixed by #12492 |
Hello everyone,
During migration to substrate 4 of edgeware, which uses Aura, I encountered the following issue: upon authority changes at each epoch, the chain fails with a bad signature.
Here is how I traced it:
The error message happens only here (and in babe, but Edgeware does not use babe)
substrate/client/consensus/aura/src/lib.rs
Line 512 in 632b323
And the enum is only present (for aura) in the check_header function here:
substrate/client/consensus/aura/src/import_queue.rs
Line 106 in 632b323
So I built two nodes, and old one and a new one, with respective modified substrate versions that printed out the information, and it seems on block 700 (at the end of the first session), the authority set of the working (old) version is extended, ending up with an expected_author which is different from my upgraded version that keeps the same authority set. But I don’t really understand what I missed during the migration that caused this, so I went on a bit more:
The function check_header is only used in aura (also in pow and babe, but whatever), and the authorities are fetched there:
substrate/client/consensus/aura/src/import_queue.rs
Line 213 in 632b323
Which seems to reference this function:
substrate/client/consensus/aura/src/lib.rs
Line 544 in 632b323
printing the two outputs also show the difference in the authority set, with 4.0.0-dev not picking up the change, while it was correct in substrate 3.
I migrated the chain following the examples in node-template, and it does not seem like any change is required in the runtime api implementation for the aura API. Besides, since i am syncing past blocks, I suppose the runtime that is executed is the one that was effective at the production time, so the authorities should be the same.
Unfortunately, I could not get any further since the Runtime API construction implementation is too complicated for me to understand and involves macros etc.
I also have a weird feeling that the issue might be somehow related to this one too, with some information from a block not being processed or not going through the adequate call chain.
Any help or point would be really appreciated. I am ready to try things to trace further but since the runtime that is executed is a very old one, I cannot print the state further once the API is called.
The text was updated successfully, but these errors were encountered: