-
Notifications
You must be signed in to change notification settings - Fork 662
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(bandwidth_scheduler) - include parent's receipts in bandwidth requests #12728
Conversation
…quests When making a bandwidth request to a child shard which has been split from a parent shard, we have to include the receipts stored in the outgoing buffer to the parent shard in the bandwidth request for sending receipts to the child shard. Forwarding receipts from the buffer to parent uses bandwidth granted for sending receipts to one of the children. Not including the parent receipts in the bandwidth request could lead to a situation where a receipt can't be sent because the grant for sending receipts to a child is too small to send out a receipt from a buffer aimed at a parent.
Implementing it was the easy part, now how do I test it... |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #12728 +/- ##
==========================================
+ Coverage 70.72% 70.76% +0.04%
==========================================
Files 849 849
Lines 174675 174818 +143
Branches 174675 174818 +143
==========================================
+ Hits 123539 123716 +177
+ Misses 45957 45925 -32
+ Partials 5179 5177 -2
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
@@ -2053,6 +2053,7 @@ impl Runtime { | |||
let pending_delayed_receipts = processing_state.delayed_receipts; | |||
let processed_delayed_receipts = process_receipts_result.processed_delayed_receipts; | |||
let promise_yield_result = process_receipts_result.promise_yield_result; | |||
let shard_layout = epoch_info_provider.shard_layout(&apply_state.epoch_id)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might break replayability, but AFAIU we now support only the last two versions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would it break replayability?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's fine to have this here as long as we are not doing something different for past protocol versions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would it break replayability?
Previously the call to epoch_info_provider.shard_layout()
was gated by ProtocolFeature::SimpleNightshadeV4.enabled(protocol_version)
, I don't know what would happen on older protocol versions, would it throw an error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I probably wouldn't be too worried 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM but see the comment about chaining order
{ | ||
let parent_receipt_sizes_iter = | ||
parent_metadata.iter_receipt_group_sizes(trie, side_effects); | ||
receipt_sizes_iter = Box::new(receipt_sizes_iter.chain(parent_receipt_sizes_iter)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think resharding tries to empty the parent buffers first - (double check please) - so I would feel better to do it in the same order here. It may be important.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, might start mattering what exactly do we fill within the bandwidth request as it's based on the order.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah good catch, I wanted to have the parent buffer first but I did the opposite by mistake
@@ -2053,6 +2053,7 @@ impl Runtime { | |||
let pending_delayed_receipts = processing_state.delayed_receipts; | |||
let processed_delayed_receipts = process_receipts_result.processed_delayed_receipts; | |||
let promise_yield_result = process_receipts_result.promise_yield_result; | |||
let shard_layout = epoch_info_provider.shard_layout(&apply_state.epoch_id)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would it break replayability?
yeah, please see the testloop tests for resharding_v3, those should be usable for your case |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thanks!
I added a The situation looks like this:
The test actually found some bugs in the initial implementation. The main bug was that bandwidth requests were only generated for shards which had an outgoing buffer, meaning that there had to be at least one receipt aimed at the shard in the past. New shards created after resharding didn't have an outgoing buffer, and because of that requests were not generated for them. The other bug was that the code short-circuited when there were no receipts in the outgoing buffer. It made sense before, but now we also have to check the receipts in the parent buffer before we can decide that there's no need to make a request. I rewrote |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
/// This test sends large (3MB) receipts from a stable shard to shard that will be split into two. | ||
/// These large receipts are buffered and at the resharding boundary the stable shard's outgoing | ||
/// buffer contains receipts to the shard that was split. Bandwidth requests to the child where the | ||
/// receipts will be sent must include the receipts stored in outgoing buffer to the parent shard, | ||
/// otherwise there will be no bandwidth grants to send them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to further my understanding, can you confirm the following?
The stable shard will contain a mix of receipts for both children. The parent buffer will be included in the request calculation for both child shard and so the requests will be overestimated. That will continue until the parent shard is emptied.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that's correct 👍
let memtrie = | ||
get_memtrie_for_shard(&client_actor.client, &shard_uid, &tip.prev_block_hash); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why get the memtrie for the prev block? Or is this the API?
If the latter IMO those methods should include something like "from_prev_block" in the name. It's all too easy to make an off by one error as is. Not related to your PR, just ranting ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The argument is called prev_block_hash
:
pub fn get_memtrie_for_shard(
client: &Client,
shard_uid: &ShardUId,
prev_block_hash: &CryptoHash,
) -> Trie {
let state_root =
*client.chain.get_chunk_extra(prev_block_hash, shard_uid).unwrap().state_root();
// Here memtries will be used as long as client has memtries enabled.
let memtrie = client
.runtime_adapter
.get_trie_for_shard(shard_uid.shard_id(), prev_block_hash, state_root, false)
.unwrap();
assert!(memtrie.has_memtries());
memtrie
}
AFAIU we get post-state root of the chunk included in the previous block, which means that it's the pre-state root of the chunk at the current height.
But tbh I just copied it from check_buffered_receipts_exist_in_memtrie
, in this test being off by one doesn't really matter, what matters is that there are only receipts in the outgoing buffer to the parent shard.
let receipt_size = match receipt { | ||
Ok(ReceiptOrStateStoredReceipt::StateStoredReceipt( | ||
state_stored_receipt, | ||
)) => state_stored_receipt.metadata().congestion_size, | ||
_ => panic!("receipt is {:?}", receipt), | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mini nit: This could also be written as a let <pattern> = receipt else { panic!(..) }; };
Not sure if it's any better but it may be a bit shorter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like let else
works well with short patterns, but for longer ones I find it harder to read than a standard match
.
And in here it would even be longer:
let receipt_size = if let Ok(
ReceiptOrStateStoredReceipt::StateStoredReceipt(state_stored_receipt),
) = receipt
{
state_stored_receipt.metadata().congestion_size
} else {
panic!("receipt is {:?}", receipt)
};
let cur_epoch_estimated_end = cur_epoch_start + cur_epoch_length; | ||
cur_epoch_estimated_end | ||
} | ||
_ => tip.height + 99999999999999, // Not in the next epoch, set to infinity into the future |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mini nit: maybe BlockHeight::MAX?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I initially did that, but then I had to add something to estimated_resharding_height
and that'd overflow, so I changed it to something smaller, but still very far into the future. I think there is no more adding now, but I'd prefer to keep it the way it is, less likely to cause trouble in the future.
let cur_epoch_length = | ||
epoch_manager.get_epoch_config(&tip.epoch_id).unwrap().epoch_length; | ||
let cur_epoch_estimated_end = cur_epoch_start + cur_epoch_length; | ||
cur_epoch_estimated_end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you following the convention of resharding_block being the last block of the old shard layout?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm you're right, resharding_height
is the last block of the old epoch, but cur_epoch_start + cur_epoch_length
is actually the first block of the new one. I'll change it to match resharding_height
.
{ | ||
for signer_id in &signer_ids { | ||
for receiver_id in &receiver_ids { | ||
// Send a 3MB cross-shard receipt from signer_id's shard to receiver_id's shard. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Maybe move the contents to a helper method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd have to be a function with 8 arguments :/
call_burn_gas_contract
also has the tx sending logic inlined, I think it's good enough.
|
||
/// Checks status of the provided transactions. Panics if transaction result is an error. | ||
/// Removes transactions that finished successfully from the list. | ||
pub fn check_txs_remove_successful(txs: &Cell<Vec<(CryptoHash, u64)>>, client: &Client) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: u64->BlockHeight
When making a bandwidth request to a child shard which has been split from a parent shard, we have to include the receipts stored in the outgoing buffer to the parent shard in the bandwidth request for sending receipts to the child shard. Forwarding receipts from the buffer to parent uses bandwidth granted for sending receipts to one of the children. Not including the parent receipts in the bandwidth request could lead to a situation where a receipt can't be sent because the grant for sending receipts to a child is too small to send out a receipt from a buffer aimed at a parent.