Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(resharding): fix 4 places related to outgoing receipts #12598

Merged
merged 11 commits into from
Dec 13, 2024

Conversation

Longarithm
Copy link
Member

@Longarithm Longarithm commented Dec 11, 2024

I'm not doing it separately because for my test setups partial fixes didn't work, and we are shipping soon... So I'll explain everything here.

  1. Filtering receipts

On the shard split boundary, we must filter outgoing receipts from parent chunks, because not all of them actually target the first chunk in our shard. But the chunk proof we send to chunk validators must include all receipts. That's why we need to add awful parameter filter_receipts to skip filtering when we generate proof, and enable filtering when we apply chunk.
Also filtering was completely missing on chunk validation; added that.

  1. State dumper fix

verify_path assertion was failing.

[OUTDATED] `compute_state_response_header` wasn't fully reflecting resharding logic after moving sync hash to current epoch. First, last chunk' shard layout corresponds to the same epoch id, not the previous one. Second, on aggregating incoming receipts, shard layout actually **may** change, so we need to maintain target shard id properly.

CurrentEpochStateSync provides sync_hash for which there are 2 chunks for each shard. However, state sync code goes 1 block back so it gives only 1 chunk. To get incoming receipts for this chunk, we will cross epoch and resharding boundary. So I move sync_hash one block forward for now.

These two cases are caught by test_resharding_v3_shard_shuffling_intense, which creates nontrivial outgoing receipt distributions on resharding boundary. The next two cases are more insidious and caught by patch #12597. test_resharding_v3_outgoing_receipts_from_splitted_shard and similar tests were useful.
I tried to make a proper test but I think I need to enable single shard tracking everywhere. For that I tried to add rpc node tracking all shards but it wasn't simple because we have assumptions on what is rpc node account id.

  1. Reassignment

Once I understood what ChainStore::reassign_outgoing_receipts_for_resharding is for, I understood as well that it must be called for chunk validators. It's better to read its definition on review. Without it I was getting InvalidReceiptsProof for one of the parent shards.

  1. Cache by target shard id

MainStateTransitionCache must cache results by child shard id, not parent, otherwise there is a collision. For shard merging we probably need to key entries by both shard ids.

Maybe we can un-ignore some of tests after that, I didn't check.
It also uses #12586.

@Longarithm Longarithm marked this pull request as ready for review December 11, 2024 08:28
@Longarithm Longarithm requested a review from a team as a code owner December 11, 2024 08:28
Copy link
Contributor

@Trisfald Trisfald left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Although I admit I might not fully grasp the implications of the most subtle changes.

@@ -2689,6 +2696,9 @@ impl Chain {
));
}
root_proofs.push(root_proofs_cur);
(target_shard_id, _) = self
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is target_shard_id set here and not right after

            let ReceiptProofResponse(block_hash, receipt_proofs) = receipt_response;
            let block_header = self.get_block_header(block_hash)?.clone();
            let block = self.get_block(block_hash)?;

?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because target_shard_id was already known for block_hash, and this function moves shard id 1 block back, so it must happen in the end. (Anyway I reverted changes in the file)

@@ -468,6 +477,7 @@ fn validate_source_receipt_proofs(
epoch_manager: &dyn EpochManagerAdapter,
source_receipt_proofs: &HashMap<ChunkHash, ReceiptProof>,
receipt_source_blocks: &[Block],
target_shard_layout: ShardLayout,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe adjust the comment on validate_source_receipt_proofs. Seems we do filtering now.

.collect_vec();

let anchor_hash = get_anchor_hash(&clients);
let nonce = get_next_nonce(&test_loop_data, &node_datas, &sender);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

}
};

// If the shard layout changed we need to filter receipts to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So now, is it more like "in some places" we need to filter receipts and "in other places" (state witness) we don't 😄 ?

🤯 🤯

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed the comment

Copy link
Contributor

@wacban wacban left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, great stuff, it's really unblocking us!

Please see the comment about get_prev_shard_id this is the only serious one here.

@@ -2573,14 +2573,12 @@ impl Chain {
let sync_prev_block = self.get_block(sync_block_header.prev_hash())?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a "consistency rules" comment above - is it still accurate? If not, can you update it please?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll skip for now because as Marcelo showed that this code part shouldn't be changed, so I reverted it. The fix is only related to receipts aggregation

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I'd better rename (prev_shard_id, prev_shard_index) with (sync_prev_shard_id, sync_prev_shard_index) to keep it in line with sync_prev_block. Otherwise shard id manipulation is scary.


// Chunk header here is the same chunk header as at the `current` height.
let sync_prev_hash = sync_prev_block.hash();
let chunks = sync_prev_block.chunks();
let chunk_header = chunks.get(prev_shard_index).ok_or(Error::InvalidShardId(shard_id))?;
let chunk_header = chunks.get(shard_index).ok_or(Error::InvalidShardId(shard_id))?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need to support both old epoch state sync and current epoch state sync here depending on the protocol version?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait hmm, was the previous code here really a bug? We're indexing into sync_prev_block.chunks() here. In both kinds of state sync (prev epoch and current epoch), we want to get the shard index that correspsonds to sync_prev_block. in the two cases:

old prev-epoch state sync:

sync_prev_block and sync_block are in different epochs. So, the shard index for shard_id should be gotten from prev_epoch_id's shard layout. If there's a resharding, we have a problem because shard_id might not have existed in prev_epoch_id, but we assume we're not doing any resharding with prev epoch state sync.

So here since we assume there hasn't been any resharding, I guess we can assume the shard layouts for prev_epoch_id and sync_block_epoch_id are the same, and it doesnt really matter which one we use, but using the one for prev_epoch_id feels more correct. (can the shard index <-> shard ID mapping change without a resharding? In some test code or something?)

new current-epoch state sync:

sync_prev_block and sync_block are in the same epoch, like you mention in the PR description. So in that case it prob doesnt really even matter which shard layout we use, since theyre going to be the same. But even tho it doesnt matter, still feels a bit more correct to just go with prev_epoch_id (which is sort of a misleading name now since it's really the epoch ID of the prev block, not the prev epoch ID).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, sorry, it wasn't a bug. Reverted - thank you very much for reviewing!
I think I automatically assumed that prev_epoch_id means previous epoch. But there is also consistency rules comment... I'd review it some day.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • renamed parameters to sync_prev_shard_index, etc.

chunk_proofs.get(prev_shard_index).ok_or(Error::InvalidShardId(shard_id))?.clone();
let block_header =
chunk_proofs.get(shard_index).ok_or(Error::InvalidShardId(shard_id))?.clone();
let chunk_block_header =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gotta love missing chunks 💀

Comment on lines 2604 to 2606
let (prev_shard_id, prev_shard_index) = self
.epoch_manager
.get_prev_shard_id(chunk_block_header.prev_hash(), shard_id)?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It took me a while to convince myself that this is indeed prev and not prev prev.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean that get_prev_shard_id accepts prev_hash? Yeah... I'll go ahead and rename with get_prev_shard_id_from_prev_hash.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, kind of like the above comment, with current epoch state sync prev_block should be in the same epoch as sync_block, so get_prev_shard_id() should just return shard_id here. Not against adding this though, I guess it's more robust/clear

Copy link
Member Author

@Longarithm Longarithm Dec 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lol. Yeah, I agree.

We really really need to make this logic simpler and try to extract common parts. This block iteration and receipts aggregation is kinda duplicated in many places; and where we try to use common functions, there are usually too many parameters. I had some trouble with get_state_witness_block_range.
I also think it's worthy to show my thinking process.

First - I didn't know (forgot) that new sync hash guarantees prev_shard_id = shard_id, so I added this.
Second - I understood your comment, but thought that we still need to call get_prev_shard_id to support old sync hash.
Third - I realised that we are likely to enable CurrentEpochStateSync not later than SimpleNightshadeV4, so we don't need to call it.

But that's too much thinking which justifies the need for refactoring. Anyway, I guess yeah, the change makes logic slightly better.

let prev_chunk_header = prev_block
.chunks()
.get(prev_shard_index)
.ok_or(Error::InvalidShardId(shard_id))?
.ok_or(Error::InvalidShardId(prev_shard_id))?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, my bad

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After all, it isn't actually a bug, because we discussed below that prev_shard_id = shard_id. But I'll keep it this way for readability,

Comment on lines 583 to 584
/// TODO validate if this logic works for Resharding V3.
fn reassign_outgoing_receipts_for_resharding(
pub fn reassign_outgoing_receipts_for_resharding(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the TODO done? Can you remove it if so?

}
latest_height.set(tip.height);

let mut rng = rand::thread_rng();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is debatable, but it's more common in our tests to use seedable rng to make it all deterministic and reproducible. If you want the extra randomness you can add multiple tests with different seeds.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, perhaps it's better to find a seed which triggers the error (without the fix) and always use that one.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But is it more common? Search by rand::thread_rng() gives more results than by SeedableRng.
And fixing the seed defeats the purpose of random.
If we fix the seed, we should as well find the specific pattern and hardcode the pattern instead of keeping randomness.
Without fixed seed, the test will keep finding patterns for us in order to catch other bugs.
Then it turns into a question - what is better: CI suddenly finding a failure for us or CI being more stable and reproducible? I think the first is better.

What I probably should already do is to make 10 iterations of this test to reduce flakiness. But I need more evidence why seedable random is better.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah perhaps I just saw more of it than the pure random.

I think we're mixing two different types of testing. The first one is the unit / integration test where determinism should be preferred in my opinion. The second is randomized / fuzzing tests where we want to cover as many cases as possible. Given we don't have a good infra for fuzzing tests I'm fine with keeping the integration tests random.

I do have a small request though - can you pick a seed at random, log it and then use the deterministic rng? This way if the test breaks we will be easily able to reproduce it still - by hardcoding the seed to the logged one.

Comment on lines 493 to 496
let mut rng = rand::thread_rng();
for _ in 0..NUM_TRANSFERS_PER_BLOCK {
let sender = account_ids.choose(&mut rng).unwrap().clone();
let receiver = account_ids.choose(&mut rng).unwrap().clone();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️ This is awesome, we totally need tests with some random traffic across all shards.

(ShardUId { shard_id: 4, version: 3 }, 0..1),
]);
let mut params = TestReshardingParameters::new()
.base_shard_layout_version(1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would advice either:

  • use the default
  • make this test parametrized by the version and have one test for each

Comment on lines +61 to +62
test_loop_data: &TestLoopData,
node_datas: &[TestData],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just out of curiosity why this change? Did you have some trouble with mut?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's because I didn't have TestLoopEnv for event handling, and I also removed mut because it is read-only call.

Copy link
Contributor

@marcelo-gonzalez marcelo-gonzalez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks sane to me, thanks for fixing it! I'm just going to try check the receipts logic change in compute_state_header() to make sure it still works with set_state_finalize(), because it feels tricky to me.


// Chunk header here is the same chunk header as at the `current` height.
let sync_prev_hash = sync_prev_block.hash();
let chunks = sync_prev_block.chunks();
let chunk_header = chunks.get(prev_shard_index).ok_or(Error::InvalidShardId(shard_id))?;
let chunk_header = chunks.get(shard_index).ok_or(Error::InvalidShardId(shard_id))?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait hmm, was the previous code here really a bug? We're indexing into sync_prev_block.chunks() here. In both kinds of state sync (prev epoch and current epoch), we want to get the shard index that correspsonds to sync_prev_block. in the two cases:

old prev-epoch state sync:

sync_prev_block and sync_block are in different epochs. So, the shard index for shard_id should be gotten from prev_epoch_id's shard layout. If there's a resharding, we have a problem because shard_id might not have existed in prev_epoch_id, but we assume we're not doing any resharding with prev epoch state sync.

So here since we assume there hasn't been any resharding, I guess we can assume the shard layouts for prev_epoch_id and sync_block_epoch_id are the same, and it doesnt really matter which one we use, but using the one for prev_epoch_id feels more correct. (can the shard index <-> shard ID mapping change without a resharding? In some test code or something?)

new current-epoch state sync:

sync_prev_block and sync_block are in the same epoch, like you mention in the PR description. So in that case it prob doesnt really even matter which shard layout we use, since theyre going to be the same. But even tho it doesnt matter, still feels a bit more correct to just go with prev_epoch_id (which is sort of a misleading name now since it's really the epoch ID of the prev block, not the prev epoch ID).

Comment on lines 2604 to 2606
let (prev_shard_id, prev_shard_index) = self
.epoch_manager
.get_prev_shard_id(chunk_block_header.prev_hash(), shard_id)?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, kind of like the above comment, with current epoch state sync prev_block should be in the same epoch as sync_block, so get_prev_shard_id() should just return shard_id here. Not against adding this though, I guess it's more robust/clear

done = false;
}
}
let done = num_new_chunks.iter().all(|num_chunks| *num_chunks >= 2);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh was this a bug before too? So this change will push out the sync hash one more block (with no missed chunks, 4th block of the epoch instead of the 3rd). Was the previous one incorrect somewhere? The logic in that state sync header generation is a bit tricky, could def have missed something there

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh nvm I see your msg on zulip, yea this is a good fix, thanks!

I think this was never caught because the state_sync.rs tests set up skipped chunks, but this wont be hit there bc theres no resharding. And the resharding shard shuffling test doesnt skip chunks. So maybe would be good to add some resharding tests w/ drop_chunks_by_height() or something

@marcelo-gonzalez
Copy link
Contributor

nice, LGTM on the state sync stuff, although I am not super familiar w the chunk state witness code

@Longarithm Longarithm enabled auto-merge December 13, 2024 05:49
Copy link

codecov bot commented Dec 13, 2024

Codecov Report

Attention: Patch coverage is 88.28829% with 13 lines in your changes missing coverage. Please review.

Project coverage is 70.50%. Comparing base (c010dc0) to head (64b72a1).
Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
...chain/src/stateless_validation/chunk_validation.rs 80.85% 0 Missing and 9 partials ⚠️
chain/chain/src/chain.rs 60.00% 0 Missing and 2 partials ⚠️
...src/stateless_validation/state_witness_producer.rs 75.00% 0 Missing and 1 partial ⚠️
tools/replay-archive/src/cli.rs 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master   #12598      +/-   ##
==========================================
+ Coverage   70.47%   70.50%   +0.02%     
==========================================
  Files         845      845              
  Lines      172024   172064      +40     
  Branches   172024   172064      +40     
==========================================
+ Hits       121241   121306      +65     
+ Misses      45672    45647      -25     
  Partials     5111     5111              
Flag Coverage Δ
backward-compatibility 0.16% <0.00%> (-0.01%) ⬇️
db-migration 0.16% <0.00%> (-0.01%) ⬇️
genesis-check 1.29% <0.00%> (-0.01%) ⬇️
linux 69.38% <81.08%> (-0.01%) ⬇️
linux-nightly 70.04% <88.28%> (-0.01%) ⬇️
pytests 1.60% <0.00%> (-0.01%) ⬇️
sanity-checks 1.41% <0.00%> (-0.01%) ⬇️
unittests 70.33% <88.28%> (+0.02%) ⬆️
upgradability 0.20% <0.00%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@Longarithm Longarithm added this pull request to the merge queue Dec 13, 2024
Merged via the queue into near:master with commit 9d3a7dc Dec 13, 2024
27 checks passed
@Longarithm Longarithm deleted the rv3-orx-bugs branch December 13, 2024 09:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants