feat(bandwidth_scheduler) - include parent's receipts in bandwidth requests #12728

jancionear · 2025-01-13T16:47:45Z

When making a bandwidth request to a child shard which has been split from a parent shard, we have to include the receipts stored in the outgoing buffer to the parent shard in the bandwidth request for sending receipts to the child shard. Forwarding receipts from the buffer to parent uses bandwidth granted for sending receipts to one of the children. Not including the parent receipts in the bandwidth request could lead to a situation where a receipt can't be sent because the grant for sending receipts to a child is too small to send out a receipt from a buffer aimed at a parent.

…quests When making a bandwidth request to a child shard which has been split from a parent shard, we have to include the receipts stored in the outgoing buffer to the parent shard in the bandwidth request for sending receipts to the child shard. Forwarding receipts from the buffer to parent uses bandwidth granted for sending receipts to one of the children. Not including the parent receipts in the bandwidth request could lead to a situation where a receipt can't be sent because the grant for sending receipts to a child is too small to send out a receipt from a buffer aimed at a parent.

jancionear · 2025-01-13T16:48:23Z

Implementing it was the easy part, now how do I test it...
I guess I could modify some resharding test and put a lot of big receipts that require bandwidth requests in the outgoing buffer to parent?

codecov · 2025-01-13T17:04:08Z

Codecov Report

Attention: Patch coverage is 91.11111% with 16 lines in your changes missing coverage. Please review.

Project coverage is 70.76%. Comparing base (515b5fa) to head (af293e3).
Report is 3 commits behind head on master.

Files with missing lines	Patch %	Lines
...ntegration-tests/src/test_loop/utils/resharding.rs	93.57%	5 Missing and 2 partials ⚠️
...egration-tests/src/test_loop/utils/transactions.rs	70.58%	4 Missing and 1 partial ⚠️
runtime/runtime/src/congestion_control.rs	95.65%	1 Missing and 1 partial ⚠️
runtime/runtime/src/lib.rs	33.33%	0 Missing and 2 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master   #12728      +/-   ##
==========================================
+ Coverage   70.72%   70.76%   +0.04%     
==========================================
  Files         849      849              
  Lines      174675   174818     +143     
  Branches   174675   174818     +143     
==========================================
+ Hits       123539   123716     +177     
+ Misses      45957    45925      -32     
+ Partials     5179     5177       -2

Flag	Coverage Δ
backward-compatibility	`0.16% <0.00%> (-0.01%)`	⬇️
db-migration	`0.16% <0.00%> (-0.01%)`	⬇️
genesis-check	`1.35% <0.00%> (-0.01%)`	⬇️
linux	`69.15% <11.66%> (-0.04%)`	⬇️
linux-nightly	`70.35% <91.11%> (+0.02%)`	⬆️
pytests	`1.64% <0.00%> (-0.01%)`	⬇️
sanity-checks	`1.46% <0.00%> (-0.01%)`	⬇️
unittests	`70.60% <91.11%> (+0.04%)`	⬆️
upgradability	`0.20% <0.00%> (-0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

jancionear · 2025-01-13T18:00:05Z

runtime/runtime/src/lib.rs

@@ -2053,6 +2053,7 @@ impl Runtime {
        let pending_delayed_receipts = processing_state.delayed_receipts;
        let processed_delayed_receipts = process_receipts_result.processed_delayed_receipts;
        let promise_yield_result = process_receipts_result.promise_yield_result;
+        let shard_layout = epoch_info_provider.shard_layout(&apply_state.epoch_id)?;


This might break replayability, but AFAIU we now support only the last two versions.

Why would it break replayability?

It's fine to have this here as long as we are not doing something different for past protocol versions.

Why would it break replayability?

Previously the call to epoch_info_provider.shard_layout() was gated by ProtocolFeature::SimpleNightshadeV4.enabled(protocol_version), I don't know what would happen on older protocol versions, would it throw an error?

Yeah, I probably wouldn't be too worried 😅

wacban

LGTM but see the comment about chaining order

wacban · 2025-01-14T10:47:00Z

runtime/runtime/src/congestion_control.rs

+            {
+                let parent_receipt_sizes_iter =
+                    parent_metadata.iter_receipt_group_sizes(trie, side_effects);
+                receipt_sizes_iter = Box::new(receipt_sizes_iter.chain(parent_receipt_sizes_iter));


I think resharding tries to empty the parent buffers first - (double check please) - so I would feel better to do it in the same order here. It may be important.

Agree, might start mattering what exactly do we fill within the bandwidth request as it's based on the order.

Ah good catch, I wanted to have the parent buffer first but I did the opposite by mistake

wacban · 2025-01-14T10:49:06Z

runtime/runtime/src/lib.rs

@@ -2053,6 +2053,7 @@ impl Runtime {
        let pending_delayed_receipts = processing_state.delayed_receipts;
        let processed_delayed_receipts = process_receipts_result.processed_delayed_receipts;
        let promise_yield_result = process_receipts_result.promise_yield_result;
+        let shard_layout = epoch_info_provider.shard_layout(&apply_state.epoch_id)?;


Why would it break replayability?

wacban · 2025-01-14T10:50:09Z

Implementing it was the easy part, now how do I test it... I guess I could modify some resharding test and put a lot of big receipts that require bandwidth requests in the outgoing buffer to parent?

yeah, please see the testloop tests for resharding_v3, those should be usable for your case

shreyan-gupta

Looks good, thanks!

jancionear · 2025-01-17T18:12:29Z

I added a resharding_v3 test which sends 3MB receipts from one shard to a shard that is split into two children. The 3MB receipts are buffered in the buffer aimed at the parent shard, and bandwidth request generation needs to generate the proper requests to children in order to forward them. The are no receipts in the buffers to children to ensure that requests to children take into account the buffer to parent.

The situation looks like this:

new block #10 shards: [5, 3, 6] chunk mask [true, true, true] block hash QnHFtFZPfSGvQHxVB9G1JAhq4ZD1G4sibtA1kdAUy1D epoch id 2xnDZ56HJtq4ZkyYtqjcRoHyb6FdtMdVCAh46CUBFHhv
 1.129s  INFO test: outgoing buffers from shard 5: {}
 1.129s  INFO test: outgoing buffers from shard 3: {}
 1.129s  INFO test: outgoing buffers from shard 6: {}
 1.129s  INFO test: Sending 3MB receipt from account1 to account4. tx_hash: GqzwERTKRzRwAXfuSm1A4ywGqTxySRnNw6sEyn2MFNMD
 1.129s  INFO test: Sending 3MB receipt from account1 to account6. tx_hash: 6sPB9k96Rh7P8GsXT1jtsoLe7DnQbNuTJYa6GjJiqrWq
new block #11 shards: [5, 3, 6] chunk mask [true, true, true] block hash 37gf65nGFyumvuWL53uaViCgu7t9xHGKtjPDgf4bW3PH epoch id 2xnDZ56HJtq4ZkyYtqjcRoHyb6FdtMdVCAh46CUBFHhv
 1.203s  INFO test: outgoing buffers from shard 5: {}
 1.203s  INFO test: outgoing buffers from shard 3: {}
 1.203s  INFO test: outgoing buffers from shard 6: {}
 1.204s  INFO test: Sending 3MB receipt from account1 to account4. tx_hash: 6ULGwQoMeg6puNTqBK9pfqrPSKgk6Hjdag2bi3wWWLFz
 1.204s  INFO test: Sending 3MB receipt from account1 to account6. tx_hash: J5p1X9NGrUFUarbB3KNm3zjn2bqcJkUgRbhxWEYit85c
new block #12 shards: [5, 3, 6] chunk mask [true, true, true] block hash 4zYVCVQJjS8Dw9tPSRXwhXnkoU4DufmJAhUwhstSAi7r epoch id 2xnDZ56HJtq4ZkyYtqjcRoHyb6FdtMdVCAh46CUBFHhv
 1.380s  INFO test: outgoing buffers from shard 5: {}
 1.380s  INFO test: outgoing buffers from shard 3: {}
 1.380s  INFO test: outgoing buffers from shard 6: {}
 1.380s  INFO test: Sending 3MB receipt from account1 to account4. tx_hash: APiSRseffDPizHykMtbrU9NsNVuLeM7aMawU8x3gfYeT
 1.380s  INFO test: Sending 3MB receipt from account1 to account6. tx_hash: Hyvop8jvYfyUEui3dgfnGdQd6VwpXZX63pZos8Vh9PLz
new block #13 shards: [5, 3, 6] chunk mask [true, true, true] block hash AZPcFxXtF97cCL7gDEVG4QYFVUQmeoLWu3TYw2AYvig5 epoch id 2xnDZ56HJtq4ZkyYtqjcRoHyb6FdtMdVCAh46CUBFHhv
 1.932s  INFO test: outgoing buffers from shard 5: {}
 1.933s  INFO test: outgoing buffers from shard 3: {6: [3.0 MB, 3.0 MB]}
 1.933s  INFO test: outgoing buffers from shard 6: {}
new block #14 shards: [5, 3, 6] chunk mask [true, true, true] block hash 9qzQqmYPVgcck3T74HbyLjNPk8u4W1rx6BMzMZ1rU8e epoch id 2xnDZ56HJtq4ZkyYtqjcRoHyb6FdtMdVCAh46CUBFHhv
 2.759s  INFO test: outgoing buffers from shard 5: {}
 2.759s  INFO test: outgoing buffers from shard 3: {6: [3.0 MB]}
 2.759s  INFO test: outgoing buffers from shard 6: {}
new block #15 shards: [5, 3, 6] chunk mask [true, true, true] block hash yWrM9DLwivyW97thyBX8VAjVfpkTTEEYuLUQzNCdNZa epoch id 2xnDZ56HJtq4ZkyYtqjcRoHyb6FdtMdVCAh46CUBFHhv
 3.285s  INFO test: outgoing buffers from shard 5: {}
 3.287s  INFO test: outgoing buffers from shard 3: {6: [3.0 MB, 3.0 MB, 3.0 MB, 3.0 MB]}
 3.287s  INFO test: outgoing buffers from shard 7: {}
 3.287s  INFO test: outgoing buffers from shard 8: {}
new block #16 shards: [5, 3, 7, 8] chunk mask [true, true, true, true] block hash HSJ6nUBoSWRzaVkuLiT17TcfeXx7NJKKwzyyzS1VovNG epoch id 4N2CxMwS3G7uyYjXRPPNGgHzgxp1HZkJ3r8RZ1zYQAiU
 3.480s  INFO test: outgoing buffers from shard 5: {}
 3.482s  INFO test: outgoing buffers from shard 3: {6: [3.0 MB, 3.0 MB, 3.0 MB, 3.0 MB]}
 3.482s  INFO test: outgoing buffers from shard 7: {}
 3.482s  INFO test: outgoing buffers from shard 8: {}
new block #17 shards: [5, 3, 7, 8] chunk mask [true, true, true, true] block hash 7zr4maTR9BTzHU9uU6ciphy3UAxxN5WXpw58paHCapLQ epoch id 4N2CxMwS3G7uyYjXRPPNGgHzgxp1HZkJ3r8RZ1zYQAiU
 3.955s  INFO test: outgoing buffers from shard 5: {}
 3.957s  INFO test: outgoing buffers from shard 3: {6: [3.0 MB, 3.0 MB, 3.0 MB, 3.0 MB]}
 3.957s  INFO test: outgoing buffers from shard 7: {}
 3.957s  INFO test: outgoing buffers from shard 8: {}
new block #18 shards: [5, 3, 7, 8] chunk mask [true, true, true, true] block hash 9BQoNysb5fB8pf7cHfdnU5gfKjmwgnQodG3HZKVfC9Ae epoch id 4N2CxMwS3G7uyYjXRPPNGgHzgxp1HZkJ3r8RZ1zYQAiU
 4.731s  INFO test: outgoing buffers from shard 5: {}
 4.732s  INFO test: outgoing buffers from shard 3: {6: [3.0 MB, 3.0 MB, 3.0 MB]}
 4.732s  INFO test: outgoing buffers from shard 7: {}
 4.732s  INFO test: outgoing buffers from shard 8: {}
new block #19 shards: [5, 3, 7, 8] chunk mask [true, true, true, true] block hash A9knKwbPMujvHzd1qGjXBgPKvB6Et9orRT1ejdfGdWZy epoch id 4N2CxMwS3G7uyYjXRPPNGgHzgxp1HZkJ3r8RZ1zYQAiU
 5.515s  INFO test: outgoing buffers from shard 5: {}
 5.516s  INFO test: outgoing buffers from shard 3: {6: [3.0 MB, 3.0 MB]}
 5.516s  INFO test: outgoing buffers from shard 7: {}
 5.516s  INFO test: outgoing buffers from shard 8: {}
new block #20 shards: [5, 3, 7, 8] chunk mask [true, true, true, true] block hash AefEKam8WJ98qvLyrSczMXTGWwJuZQpsnwczWE9g68mX epoch id 4N2CxMwS3G7uyYjXRPPNGgHzgxp1HZkJ3r8RZ1zYQAiU
 6.293s  INFO test: outgoing buffers from shard 5: {}
 6.293s  INFO test: outgoing buffers from shard 3: {6: [3.0 MB]}
 6.293s  INFO test: outgoing buffers from shard 7: {}
 6.293s  INFO test: outgoing buffers from shard 8: {}
 new block #21 shards: [5, 3, 7, 8] chunk mask [true, true, true, true] block hash 2Zyb5eG975zvjSMeNofEZbYvLpSfucfAkJfy7Ar4DPHB epoch id 4N2CxMwS3G7uyYjXRPPNGgHzgxp1HZkJ3r8RZ1zYQAiU
 6.669s  INFO test: outgoing buffers from shard 5: {}
 6.669s  INFO test: outgoing buffers from shard 3: {}
 6.669s  INFO test: outgoing buffers from shard 7: {}
 6.669s  INFO test: outgoing buffers from shard 8: {}

The test actually found some bugs in the initial implementation. The main bug was that bandwidth requests were only generated for shards which had an outgoing buffer, meaning that there had to be at least one receipt aimed at the shard in the past. New shards created after resharding didn't have an outgoing buffer, and because of that requests were not generated for them. The other bug was that the code short-circuited when there were no receipts in the outgoing buffer. It made sense before, but now we also have to check the receipts in the parent buffer before we can decide that there's no need to make a request.

I rewrote generate_bandwidth_request to better reflect the new logic.

wacban

LGTM, thanks!

wacban · 2025-01-19T10:26:15Z

integration-tests/src/test_loop/tests/resharding_v3.rs

+/// This test sends large (3MB) receipts from a stable shard to shard that will be split into two.
+/// These large receipts are buffered and at the resharding boundary the stable shard's outgoing
+/// buffer contains receipts to the shard that was split. Bandwidth requests to the child where the
+/// receipts will be sent must include the receipts stored in outgoing buffer to the parent shard,
+/// otherwise there will be no bandwidth grants to send them.


Just to further my understanding, can you confirm the following?

The stable shard will contain a mix of receipts for both children. The parent buffer will be included in the request calculation for both child shard and so the requests will be overestimated. That will continue until the parent shard is emptied.

Yes that's correct 👍

wacban · 2025-01-19T10:31:27Z