-
Notifications
You must be signed in to change notification settings - Fork 690
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(resharding): fix 4 places related to outgoing receipts #12598
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Although I admit I might not fully grasp the implications of the most subtle changes.
chain/chain/src/chain.rs
Outdated
@@ -2689,6 +2696,9 @@ impl Chain { | |||
)); | |||
} | |||
root_proofs.push(root_proofs_cur); | |||
(target_shard_id, _) = self |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is target_shard_id
set here and not right after
let ReceiptProofResponse(block_hash, receipt_proofs) = receipt_response;
let block_header = self.get_block_header(block_hash)?.clone();
let block = self.get_block(block_hash)?;
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because target_shard_id
was already known for block_hash
, and this function moves shard id 1 block back, so it must happen in the end. (Anyway I reverted changes in the file)
@@ -468,6 +477,7 @@ fn validate_source_receipt_proofs( | |||
epoch_manager: &dyn EpochManagerAdapter, | |||
source_receipt_proofs: &HashMap<ChunkHash, ReceiptProof>, | |||
receipt_source_blocks: &[Block], | |||
target_shard_layout: ShardLayout, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe adjust the comment on validate_source_receipt_proofs
. Seems we do filtering now.
.collect_vec(); | ||
|
||
let anchor_hash = get_anchor_hash(&clients); | ||
let nonce = get_next_nonce(&test_loop_data, &node_datas, &sender); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
chain/chain/src/store/mod.rs
Outdated
} | ||
}; | ||
|
||
// If the shard layout changed we need to filter receipts to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So now, is it more like "in some places" we need to filter receipts and "in other places" (state witness) we don't 😄 ?
🤯 🤯
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed the comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, great stuff, it's really unblocking us!
Please see the comment about get_prev_shard_id
this is the only serious one here.
@@ -2573,14 +2573,12 @@ impl Chain { | |||
let sync_prev_block = self.get_block(sync_block_header.prev_hash())?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a "consistency rules" comment above - is it still accurate? If not, can you update it please?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll skip for now because as Marcelo showed that this code part shouldn't be changed, so I reverted it. The fix is only related to receipts aggregation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I'd better rename (prev_shard_id, prev_shard_index) with (sync_prev_shard_id, sync_prev_shard_index) to keep it in line with sync_prev_block. Otherwise shard id manipulation is scary.
chain/chain/src/chain.rs
Outdated
|
||
// Chunk header here is the same chunk header as at the `current` height. | ||
let sync_prev_hash = sync_prev_block.hash(); | ||
let chunks = sync_prev_block.chunks(); | ||
let chunk_header = chunks.get(prev_shard_index).ok_or(Error::InvalidShardId(shard_id))?; | ||
let chunk_header = chunks.get(shard_index).ok_or(Error::InvalidShardId(shard_id))?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you need to support both old epoch state sync
and current epoch state sync
here depending on the protocol version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait hmm, was the previous code here really a bug? We're indexing into sync_prev_block.chunks()
here. In both kinds of state sync (prev epoch and current epoch), we want to get the shard index that correspsonds to sync_prev_block
. in the two cases:
old prev-epoch state sync:
sync_prev_block
and sync_block
are in different epochs. So, the shard index for shard_id
should be gotten from prev_epoch_id
's shard layout. If there's a resharding, we have a problem because shard_id
might not have existed in prev_epoch_id
, but we assume we're not doing any resharding with prev epoch state sync.
So here since we assume there hasn't been any resharding, I guess we can assume the shard layouts for prev_epoch_id
and sync_block_epoch_id
are the same, and it doesnt really matter which one we use, but using the one for prev_epoch_id
feels more correct. (can the shard index <-> shard ID mapping change without a resharding? In some test code or something?)
new current-epoch state sync:
sync_prev_block
and sync_block
are in the same epoch, like you mention in the PR description. So in that case it prob doesnt really even matter which shard layout we use, since theyre going to be the same. But even tho it doesnt matter, still feels a bit more correct to just go with prev_epoch_id
(which is sort of a misleading name now since it's really the epoch ID of the prev block, not the prev epoch ID).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, sorry, it wasn't a bug. Reverted - thank you very much for reviewing!
I think I automatically assumed that prev_epoch_id means previous epoch. But there is also consistency rules comment... I'd review it some day.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- renamed parameters to sync_prev_shard_index, etc.
chain/chain/src/chain.rs
Outdated
chunk_proofs.get(prev_shard_index).ok_or(Error::InvalidShardId(shard_id))?.clone(); | ||
let block_header = | ||
chunk_proofs.get(shard_index).ok_or(Error::InvalidShardId(shard_id))?.clone(); | ||
let chunk_block_header = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gotta love missing chunks 💀
chain/chain/src/chain.rs
Outdated
let (prev_shard_id, prev_shard_index) = self | ||
.epoch_manager | ||
.get_prev_shard_id(chunk_block_header.prev_hash(), shard_id)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It took me a while to convince myself that this is indeed prev and not prev prev.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean that get_prev_shard_id
accepts prev_hash
? Yeah... I'll go ahead and rename with get_prev_shard_id_from_prev_hash
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, kind of like the above comment, with current epoch state sync prev_block
should be in the same epoch as sync_block
, so get_prev_shard_id()
should just return shard_id
here. Not against adding this though, I guess it's more robust/clear
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lol. Yeah, I agree.
We really really need to make this logic simpler and try to extract common parts. This block iteration and receipts aggregation is kinda duplicated in many places; and where we try to use common functions, there are usually too many parameters. I had some trouble with get_state_witness_block_range
.
I also think it's worthy to show my thinking process.
First - I didn't know (forgot) that new sync hash guarantees prev_shard_id = shard_id, so I added this.
Second - I understood your comment, but thought that we still need to call get_prev_shard_id to support old sync hash.
Third - I realised that we are likely to enable CurrentEpochStateSync not later than SimpleNightshadeV4, so we don't need to call it.
But that's too much thinking which justifies the need for refactoring. Anyway, I guess yeah, the change makes logic slightly better.
chain/chain/src/chain.rs
Outdated
let prev_chunk_header = prev_block | ||
.chunks() | ||
.get(prev_shard_index) | ||
.ok_or(Error::InvalidShardId(shard_id))? | ||
.ok_or(Error::InvalidShardId(prev_shard_id))? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice, my bad
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After all, it isn't actually a bug, because we discussed below that prev_shard_id = shard_id. But I'll keep it this way for readability,
chain/chain/src/store/mod.rs
Outdated
/// TODO validate if this logic works for Resharding V3. | ||
fn reassign_outgoing_receipts_for_resharding( | ||
pub fn reassign_outgoing_receipts_for_resharding( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the TODO done? Can you remove it if so?
} | ||
latest_height.set(tip.height); | ||
|
||
let mut rng = rand::thread_rng(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is debatable, but it's more common in our tests to use seedable rng to make it all deterministic and reproducible. If you want the extra randomness you can add multiple tests with different seeds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, perhaps it's better to find a seed which triggers the error (without the fix) and always use that one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But is it more common? Search by rand::thread_rng()
gives more results than by SeedableRng
.
And fixing the seed defeats the purpose of random.
If we fix the seed, we should as well find the specific pattern and hardcode the pattern instead of keeping randomness.
Without fixed seed, the test will keep finding patterns for us in order to catch other bugs.
Then it turns into a question - what is better: CI suddenly finding a failure for us or CI being more stable and reproducible? I think the first is better.
What I probably should already do is to make 10 iterations of this test to reduce flakiness. But I need more evidence why seedable random is better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah perhaps I just saw more of it than the pure random.
I think we're mixing two different types of testing. The first one is the unit / integration test where determinism should be preferred in my opinion. The second is randomized / fuzzing tests where we want to cover as many cases as possible. Given we don't have a good infra for fuzzing tests I'm fine with keeping the integration tests random.
I do have a small request though - can you pick a seed at random, log it and then use the deterministic rng? This way if the test breaks we will be easily able to reproduce it still - by hardcoding the seed to the logged one.
let mut rng = rand::thread_rng(); | ||
for _ in 0..NUM_TRANSFERS_PER_BLOCK { | ||
let sender = account_ids.choose(&mut rng).unwrap().clone(); | ||
let receiver = account_ids.choose(&mut rng).unwrap().clone(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️ This is awesome, we totally need tests with some random traffic across all shards.
(ShardUId { shard_id: 4, version: 3 }, 0..1), | ||
]); | ||
let mut params = TestReshardingParameters::new() | ||
.base_shard_layout_version(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would advice either:
- use the default
- make this test parametrized by the version and have one test for each
test_loop_data: &TestLoopData, | ||
node_datas: &[TestData], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just out of curiosity why this change? Did you have some trouble with mut?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's because I didn't have TestLoopEnv
for event handling, and I also removed mut
because it is read-only call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks sane to me, thanks for fixing it! I'm just going to try check the receipts logic change in compute_state_header()
to make sure it still works with set_state_finalize()
, because it feels tricky to me.
chain/chain/src/chain.rs
Outdated
|
||
// Chunk header here is the same chunk header as at the `current` height. | ||
let sync_prev_hash = sync_prev_block.hash(); | ||
let chunks = sync_prev_block.chunks(); | ||
let chunk_header = chunks.get(prev_shard_index).ok_or(Error::InvalidShardId(shard_id))?; | ||
let chunk_header = chunks.get(shard_index).ok_or(Error::InvalidShardId(shard_id))?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait hmm, was the previous code here really a bug? We're indexing into sync_prev_block.chunks()
here. In both kinds of state sync (prev epoch and current epoch), we want to get the shard index that correspsonds to sync_prev_block
. in the two cases:
old prev-epoch state sync:
sync_prev_block
and sync_block
are in different epochs. So, the shard index for shard_id
should be gotten from prev_epoch_id
's shard layout. If there's a resharding, we have a problem because shard_id
might not have existed in prev_epoch_id
, but we assume we're not doing any resharding with prev epoch state sync.
So here since we assume there hasn't been any resharding, I guess we can assume the shard layouts for prev_epoch_id
and sync_block_epoch_id
are the same, and it doesnt really matter which one we use, but using the one for prev_epoch_id
feels more correct. (can the shard index <-> shard ID mapping change without a resharding? In some test code or something?)
new current-epoch state sync:
sync_prev_block
and sync_block
are in the same epoch, like you mention in the PR description. So in that case it prob doesnt really even matter which shard layout we use, since theyre going to be the same. But even tho it doesnt matter, still feels a bit more correct to just go with prev_epoch_id
(which is sort of a misleading name now since it's really the epoch ID of the prev block, not the prev epoch ID).
chain/chain/src/chain.rs
Outdated
let (prev_shard_id, prev_shard_index) = self | ||
.epoch_manager | ||
.get_prev_shard_id(chunk_block_header.prev_hash(), shard_id)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, kind of like the above comment, with current epoch state sync prev_block
should be in the same epoch as sync_block
, so get_prev_shard_id()
should just return shard_id
here. Not against adding this though, I guess it's more robust/clear
4b4f4c6
to
5c86b52
Compare
done = false; | ||
} | ||
} | ||
let done = num_new_chunks.iter().all(|num_chunks| *num_chunks >= 2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh was this a bug before too? So this change will push out the sync hash one more block (with no missed chunks, 4th block of the epoch instead of the 3rd). Was the previous one incorrect somewhere? The logic in that state sync header generation is a bit tricky, could def have missed something there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh nvm I see your msg on zulip, yea this is a good fix, thanks!
I think this was never caught because the state_sync.rs
tests set up skipped chunks, but this wont be hit there bc theres no resharding. And the resharding shard shuffling test doesnt skip chunks. So maybe would be good to add some resharding tests w/ drop_chunks_by_height()
or something
nice, LGTM on the state sync stuff, although I am not super familiar w the chunk state witness code |
38617e7
to
c1d671f
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #12598 +/- ##
==========================================
+ Coverage 70.47% 70.50% +0.02%
==========================================
Files 845 845
Lines 172024 172064 +40
Branches 172024 172064 +40
==========================================
+ Hits 121241 121306 +65
+ Misses 45672 45647 -25
Partials 5111 5111
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
I'm not doing it separately because for my test setups partial fixes didn't work, and we are shipping soon... So I'll explain everything here.
On the shard split boundary, we must filter outgoing receipts from parent chunks, because not all of them actually target the first chunk in our shard. But the chunk proof we send to chunk validators must include all receipts. That's why we need to add awful parameter
filter_receipts
to skip filtering when we generate proof, and enable filtering when we apply chunk.Also filtering was completely missing on chunk validation; added that.
verify_path
assertion was failing.[OUTDATED]
`compute_state_response_header` wasn't fully reflecting resharding logic after moving sync hash to current epoch. First, last chunk' shard layout corresponds to the same epoch id, not the previous one. Second, on aggregating incoming receipts, shard layout actually **may** change, so we need to maintain target shard id properly.CurrentEpochStateSync provides sync_hash for which there are 2 chunks for each shard. However, state sync code goes 1 block back so it gives only 1 chunk. To get incoming receipts for this chunk, we will cross epoch and resharding boundary. So I move sync_hash one block forward for now.
These two cases are caught by
test_resharding_v3_shard_shuffling_intense
, which creates nontrivial outgoing receipt distributions on resharding boundary. The next two cases are more insidious and caught by patch #12597.test_resharding_v3_outgoing_receipts_from_splitted_shard
and similar tests were useful.I tried to make a proper test but I think I need to enable single shard tracking everywhere. For that I tried to add rpc node tracking all shards but it wasn't simple because we have assumptions on what is rpc node account id.
Once I understood what
ChainStore::reassign_outgoing_receipts_for_resharding
is for, I understood as well that it must be called for chunk validators. It's better to read its definition on review. Without it I was gettingInvalidReceiptsProof
for one of the parent shards.MainStateTransitionCache
must cache results by child shard id, not parent, otherwise there is a collision. For shard merging we probably need to key entries by both shard ids.Maybe we can un-ignore some of tests after that, I didn't check.
It also uses #12586.