Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Smarter block deadlines for speculative blocks #1481

Merged
merged 12 commits into from
Aug 15, 2023
Merged

Conversation

heifner
Copy link
Member

@heifner heifner commented Aug 7, 2023

Change: Start speculative blocks every block interval but not more often. Fixes a corner case of rapid start-block/abort block burst for speculative blocks. Also create speculative blocks even when state is stale (over 5 seconds old).

Before this PR, block deadline was set for speculative blocks the same as production blocks for block production time. Because of when blocks are processed this could result in small block speculative windows. When a block came in close to deadline it would cause us to repeatedly start block with a very small deadline, abort block because of deadline, start another block and immediately abort again until the block time window was passed.

In addition, speculative blocks were not started on "stale" state (head block older than 5 seconds). Note this would be the case if a block producer was down and missed a production round (6 seconds).

The conditions where the stale condition would most likely be triggered (block producer being down or node disconnected from network for some time) are situations in which it most likely does undesirable things for both read-only transactions and non-read-only transactions. It unnecessarily delays a read-only transaction (and dry-run transaction) that may not care that they are potentially getting back stale data. And if they did care, they could protect themselves by setting the expiration appropriately.

It also causes similar problems with write transactions. If the stale condition is due to a block producer being down rather than the node being disconnected from the network, with this PR changes, a successfully executed write transaction can propagate through the network and reach a live block producer that queues up the transaction for when it is its turn to produce a block. Before this PR, the write transaction is simply queued up to be speculatively executed once a new block arrives. If the expiration time of the transaction was set low, there is a good chance the transaction just fails due to expiration by that point.

But it isn't all beneficial. There is a downside with this PR to write transactions, if the stale condition is due to a temporary network disruption (say for 10 seconds). In that case, queuing the write transaction for when the network is restored makes it more likely for that transaction to get into the block (assuming using the normal send_transaction endpoint without retries). But with this PR, the network disruption leads to the write transaction being silently dropped despite a successful trace being returned to the client. This is because the transaction is speculatively executed and attempts to send out the transaction which would be dropped if there are no currently available connected peers.

Includes refactor of producer_watermarks, calculate_producer_wake_up_time, calculate_next_block_slot into block_timing_util. No change to functionality. Created new tests around the moved functions.

Resolves #1432

@heifner heifner requested review from linh2931 and greg7mdp August 7, 2023 16:32
@greg7mdp
Copy link
Contributor

Speculative blocks on non-producer nodes will not restart unless a new block is received or it is exhausted. Speculative blocks on producing nodes will not restart until a production window is reached or it is exhausted.

What is the rationale for these two changes?

_producer_watermarks);
_pending_block_deadline = wake_time ? *wake_time : fc::time_point::maximum();
} else {
_pending_block_deadline = fc::time_point::maximum();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it should be:

_pending_block_deadline = chain.head_block_time() + fc::seconds(5);

To be consistent with line 1883:

   if (in_speculating_mode()) {
      auto head_block_age = now - chain.head_block_time();
      if (head_block_age > fc::seconds(5))
         return start_block_result::waiting_for_block;
   }

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent point. I wrote up a paragraph why it shouldn't correspond to the other 5 seconds and in doing so convinced myself of the opposite. Thanks.

I do wonder the wisdom of the current 5 second limit on speculative block production. The idea is that you don't want to speculative execute on too old of state. However, if a block producer misses a round then this node will stop processing trxs and wait for a block. Since everyone is doing that, then all trxs are held up until a block producer produces a block and sends it along.

I wonder if the new IF provides us with a different liveliness we could use to help make better determination. If we know that a series of blocks were not produced we could safely speculative execute trxs on older state.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the node is connected to multiple peers, shouldn't we assume that its state will be up-to-date, as it will receive blocks when they are produced? So I don't see the rationale to assume that our state is old vs the whole chain (again provided that we are connected to multiple peers).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See updated PR description.

@heifner heifner added the OCI Work exclusive to OCI team label Aug 10, 2023
@heifner
Copy link
Member Author

heifner commented Aug 11, 2023

Speculative blocks on non-producer nodes will not restart unless a new block is received or it is exhausted. Speculative blocks on producing nodes will not restart until a production window is reached or it is exhausted.

What is the rationale for these two changes?

Before this PR, block deadline was set for speculative blocks the same as production blocks for block production time. Because of when blocks are processed this could result in small block speculative windows. When a block came in close to deadline it would cause us to repeatedly start block with a very small deadline, abort block because of deadline, start another block and immediately abort again until the block time window was passed.

@heifner heifner requested review from greg7mdp and linh2931 August 11, 2023 17:20
@greg7mdp
Copy link
Contributor

greg7mdp commented Aug 11, 2023

Please update comment at line 47:
// approach, the block time points would become -500, -100, 300, 700, 1200 ...

1200 -> 1100

BOOST_CHECK_EQUAL(calculate_producer_wake_up_time(full_cpu_effort, 2, block_timestamp, producers, active_schedule, empty_watermarks), block_time);
}
}
{ // We have all producers in active_schedule in active_schedule of 21 (plus a couple of extra producers configured), we should produce every block
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very minor: change in active_schedule in active_schedule to in active_schedule

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
OCI Work exclusive to OCI team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Speculative block repeatedly restarted when near block boundary
3 participants