Smarter block deadlines for speculative blocks #1481

heifner · 2023-08-07T16:32:27Z

Change: Start speculative blocks every block interval but not more often. Fixes a corner case of rapid start-block/abort block burst for speculative blocks. Also create speculative blocks even when state is stale (over 5 seconds old).

Before this PR, block deadline was set for speculative blocks the same as production blocks for block production time. Because of when blocks are processed this could result in small block speculative windows. When a block came in close to deadline it would cause us to repeatedly start block with a very small deadline, abort block because of deadline, start another block and immediately abort again until the block time window was passed.

In addition, speculative blocks were not started on "stale" state (head block older than 5 seconds). Note this would be the case if a block producer was down and missed a production round (6 seconds).

The conditions where the stale condition would most likely be triggered (block producer being down or node disconnected from network for some time) are situations in which it most likely does undesirable things for both read-only transactions and non-read-only transactions. It unnecessarily delays a read-only transaction (and dry-run transaction) that may not care that they are potentially getting back stale data. And if they did care, they could protect themselves by setting the expiration appropriately.

It also causes similar problems with write transactions. If the stale condition is due to a block producer being down rather than the node being disconnected from the network, with this PR changes, a successfully executed write transaction can propagate through the network and reach a live block producer that queues up the transaction for when it is its turn to produce a block. Before this PR, the write transaction is simply queued up to be speculatively executed once a new block arrives. If the expiration time of the transaction was set low, there is a good chance the transaction just fails due to expiration by that point.

But it isn't all beneficial. There is a downside with this PR to write transactions, if the stale condition is due to a temporary network disruption (say for 10 seconds). In that case, queuing the write transaction for when the network is restored makes it more likely for that transaction to get into the block (assuming using the normal send_transaction endpoint without retries). But with this PR, the network disruption leads to the write transaction being silently dropped despite a successful trace being returned to the client. This is because the transaction is speculatively executed and attempts to send out the transaction which would be dropped if there are no currently available connected peers.

Includes refactor of producer_watermarks, calculate_producer_wake_up_time, calculate_next_block_slot into block_timing_util. No change to functionality. Created new tests around the moved functions.

Resolves #1432

…, calculate_next_block_slot into block_timing_util

…or can produce.

plugins/producer_plugin/producer_plugin.cpp

plugins/producer_plugin/test/test_block_timing_util.cpp

plugins/producer_plugin/include/eosio/producer_plugin/block_timing_util.hpp

greg7mdp · 2023-08-10T13:54:08Z

Speculative blocks on non-producer nodes will not restart unless a new block is received or it is exhausted. Speculative blocks on producing nodes will not restart until a production window is reached or it is exhausted.

What is the rationale for these two changes?

greg7mdp · 2023-08-10T13:57:39Z

plugins/producer_plugin/producer_plugin.cpp

+                                                                          _producer_watermarks);
+      _pending_block_deadline = wake_time ? *wake_time : fc::time_point::maximum();
+   } else {
+      _pending_block_deadline = fc::time_point::maximum();


I wonder if it should be:

_pending_block_deadline = chain.head_block_time() + fc::seconds(5);

To be consistent with line 1883:

if (in_speculating_mode()) { auto head_block_age = now - chain.head_block_time(); if (head_block_age > fc::seconds(5)) return start_block_result::waiting_for_block; }

Excellent point. I wrote up a paragraph why it shouldn't correspond to the other 5 seconds and in doing so convinced myself of the opposite. Thanks.

I do wonder the wisdom of the current 5 second limit on speculative block production. The idea is that you don't want to speculative execute on too old of state. However, if a block producer misses a round then this node will stop processing trxs and wait for a block. Since everyone is doing that, then all trxs are held up until a block producer produces a block and sends it along.

I wonder if the new IF provides us with a different liveliness we could use to help make better determination. If we know that a series of blocks were not produced we could safely speculative execute trxs on older state.

If the node is connected to multiple peers, shouldn't we assume that its state will be up-to-date, as it will receive blocks when they are produced? So I don't see the rationale to assume that our state is old vs the whole chain (again provided that we are connected to multiple peers).

See updated PR description.

plugins/producer_plugin/include/eosio/producer_plugin/block_timing_util.hpp

…dule

…spond to existing limit of 5secs on speculative execution on stale state.

heifner · 2023-08-11T17:03:40Z

Speculative blocks on non-producer nodes will not restart unless a new block is received or it is exhausted. Speculative blocks on producing nodes will not restart until a production window is reached or it is exhausted.

What is the rationale for these two changes?

Before this PR, block deadline was set for speculative blocks the same as production blocks for block production time. Because of when blocks are processed this could result in small block speculative windows. When a block came in close to deadline it would cause us to repeatedly start block with a very small deadline, abort block because of deadline, start another block and immediately abort again until the block time window was passed.

greg7mdp · 2023-08-11T18:24:06Z

Please update comment at line 47:
// approach, the block time points would become -500, -100, 300, 700, 1200 ...

1200 -> 1100

plugins/producer_plugin/include/eosio/producer_plugin/block_timing_util.hpp

…figured with producers

linh2931 · 2023-08-14T21:39:06Z

plugins/producer_plugin/test/test_block_timing_util.cpp

+         BOOST_CHECK_EQUAL(calculate_producer_wake_up_time(full_cpu_effort, 2, block_timestamp, producers, active_schedule, empty_watermarks), block_time);
+      }
+   }
+   { // We have all producers in active_schedule in active_schedule of 21 (plus a couple of extra producers configured), we should produce every block


Very minor: change in active_schedule in active_schedule to in active_schedule

heifner added 5 commits August 5, 2023 13:18

GH-1432 Refactor producer_watermarks, calculate_producer_wake_up_time…

dc4adfa

…, calculate_next_block_slot into block_timing_util

GH-1432 Add some initial tests for calculate_producer_wake_up_time

2bb1e0d

GH-1432 Set _pending_block_deadline according to if we are producing …

7a3801a

…or can produce.

GH-1432 Add additional tests including watermark

e2247da

GH-1432 Wake up time is start block time already.

6dcb96e

heifner requested review from linh2931 and greg7mdp August 7, 2023 16:32