Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IF: Update finalizer safety information and rules for how a finalizer signs #2070

Closed
Tracked by #2110
arhag opened this issue Jan 10, 2024 · 3 comments · Fixed by #2135
Closed
Tracked by #2110

IF: Update finalizer safety information and rules for how a finalizer signs #2070

arhag opened this issue Jan 10, 2024 · 3 comments · Fixed by #2135
Assignees

Comments

@arhag
Copy link
Member

arhag commented Jan 10, 2024

The finalizer safety information should be updated to track the following data per each finalizer:

  • The range of timestamps covered by their last vote.
  • The block ID of the last block that they voted on.
  • The block ID of the block that they are locked on.

Whenever a finalizer signs a new block, the finalizer safety information must be updated and durably committed before propagating the signature in a vote message.

Whenever a finalizer considers a new block to sign, it must consult its existing finalizer safety information to determine whether it should sign, and if so whether it should sign strongly or weakly. The rules for this are captured in the pseudo-code:

VoteDecision decide_vote(finalizer_safety_information& fsi, block_handle p){
bool monotony_check = false;
bool safety_check = false;
bool liveness_check = false;
b_phases = get_qc_chain(p);
b2 = b_phases[2] //first phase, prepare
b1 = b_phases[1] //second phase, precommit
b = b_phases[0] //third phase, commit
if (fsi.last_vote_block_ref != sha256.empty()){
if (p.timestamp > fork_db.get_block_by_id(fsi.last_vote_block_ref).timestamp){
monotony_check = true;
}
}
else monotony_check = true; //if I have never voted on a proposal, means the protocol feature just activated and we can proceed
if (fsi.locked_block_ref != sha256.empty()){
//Safety check : check if this proposal extends the proposal we're locked on
if (extends(p, fork_db.get_block_by_id(fsi.locked_block_ref)) safety_check = true;
//Liveness check : check if the height of this proposal's justification is higher than the height of the proposal I'm locked on. This allows restoration of liveness if a replica is locked on a stale proposal
if (fork_db.get_block_by_height(p.id(), p.last_qc_block_height).timestamp > fork_db.get_block_by_id(fsi.locked_block_ref).timestamp)) liveness_check = true;
}
else {
//if we're not locked on anything, means the protocol feature just activated and we can proceed
liveness_check = true;
safety_check = true;
}
if (monotony_check && (liveness_check || safety_check)){
uint32_t requested_vote_range_lower_bound = fork_db.get_block_by_height(p.block_id, p.last_qc_block_height).timestamp;
uint32_t requested_vote_range_upper_bound = p.timestamp;
bool time_range_interference = fsi.last_vote_range_lower_bound < requested_vote_range_upper_bound && requested_vote_range_lower_bound < fsi.last_vote_range_upper_bound;
//my last vote was on (t9, t10_1], I'm asked to vote on t10 : t9 < t10 && t9 < t10_1; //time_range_interference == true, correct
//my last vote was on (t9, t10_1], I'm asked to vote on t11 : t9 < t11 && t10 < t10_1; //time_range_interference == false, correct
//my last vote was on (t7, t9], I'm asked to vote on t10 : t7 < t10 && t9 < t9; //time_range_interference == false, correct
bool enough_for_strong_vote = false;
if (!time_range_interference || extends(p, fork_db.get_block_by_id(fsi.last_vote_block_ref)) enough_for_strong_vote = true;
//fsi.is_last_vote_strong = enough_for_strong_vote;
fsi.last_vote_block_ref = p.block_id; //v_height
if (b1.timestamp > fork_db.get_block_by_id(fsi.locked_block_ref).timestamp) fsi.locked_block_ref = b1.block_id; //commit phase on b1
fsi.last_vote_range_lower_bound = requested_vote_range_lower_bound;
fsi.last_vote_range_upper_bound = requested_vote_range_upper_bound;
if (enough_for_strong_vote) return VoteDecision::StrongVote;
else return VoteDecision::WeakVote;
}
else return VoteDecision::NoVote;
}
.

Related to #2069.

For this issue, update the finalizer signing process to consider the changes described above and in the pseudo-code. When signing weakly, the digest to sign should be a hash of the concatenation of the finalizer_digest and the string WEAK.

@enf-ci-bot enf-ci-bot moved this to Todo in Team Backlog Jan 10, 2024
@arhag arhag added 👍 lgtm and removed triage labels Jan 10, 2024
@heifner heifner added the OCI Work exclusive to OCI team label Jan 15, 2024
@heifner heifner moved this from Todo to In Progress in Team Backlog Jan 15, 2024
@heifner heifner added this to the Leap v6.0.0-rc1 milestone Jan 15, 2024
@greg7mdp greg7mdp assigned greg7mdp and unassigned heifner Jan 16, 2024
@arhag arhag removed the OCI Work exclusive to OCI team label Jan 17, 2024
@arhag arhag changed the title IF: Unification: Update finalizer safety information and rules for how a finalizer signs IF: Update finalizer safety information and rules for how a finalizer signs Jan 19, 2024
@arhag
Copy link
Member Author

arhag commented Jan 22, 2024

When starting nodeos, if there are no entries for finalizer keys that are configured for nodeos, then nodeos should automatically create entries for them.

The values in this new entry should be:

  • last_vote_time_range: set to interval consisting of current wall-clock time (for both lower and upper bound of interval) rounded up to 0.5s block time.
  • last_vote_block_id: set to nullopt
  • locked_block_id: set to LIB ID
  • locked_block_timestamp: set to LIB timestamp

@greg7mdp
Copy link
Contributor

written by Areg

Instant Finality transition, finalizer safety information, and disaster recovery

IF transition

The transition to IF has a start block and an end block where the end block is a descendant of the start block.

Within any given branch of the blockchain, if an IF transition exists within it, the start block is indicated by being the first block in that branch which has a finality header extension and it is defined as the first block in the branch in which a successful set_finalizer host function call is made. The end block is indicated by being the last block in the branch that has a non-zero action Merkle root in the block header and it is defined as the first block in the branch by which enough data and signatures were provided by the block header to advance the LIB from a block that was an ancestor of the start block to either the start block or a descendant of it.

When the end block is processed and irreversibility advances forward to include the start block as an irreversible block, the block_state_legacy structs of the blocks which are descendants of the start block need to be converted to block_state structs. The blocks that are descendants of start block but are not descendants of the new LIB (which may be the start block but may also be a descendant of it) can be pruned. The block_state_legacy of the start block should be converted to a block_state. Remaining descendants of the start block should generate the block_header_state part of their converted block_state by using the next member function called on their parent block’s block_header_state.

Note that all of the finality header extensions in blocks up to and including the end block should have a last_qc_block_num set to the block number of the start block and a is_last_qc_strong set to false. Additional the block_header_state for all of these blocks will have a core that includes that same last_qc_block_num but also includes the timestamp of that block referenced (last_qc_block_timestamp).

After the conversion, it is possible to receive votes on the start block or blocks descendant from it. But it is likely that the first votes will be received only on the new blocks generated after the conversion occurs, e.g. the block that has the end block as its parent.

Finalizer safety information

Initialization on Leap startup

When Leap starts up, it reads its configuration whether specified through the config.ini or through command line options. This configuration includes the path to the directory that may hold finalizer safety information and any finalizer keys provided to that Leap instance. Leap also initializes an in-memory associative map between each of the provided finalizer keys and their corresponding finalizer safety information (which may not be present). The initialization rules are discussed further below.

Leap must make no attempt to read or modify the finalizer safety information if there are no finalizer keys provided. However, if there is at least one finalizer key provided, Leap must attempt to read the finalizer safety information file from its appropriate path on startup. If the file exists, the values read from that file will be used to initialize the in-memory associative map. If the file does not exist, the in-memory associative map initially consists of the provided finalize keys mapping to std::nullopts.

Regardless of whether the file exists or not, in the case where at least one finalizer key is provided, after the initialization of the in-memory associative map is completed Leap should write out the contents of that associative map to the finalizer safety information file (creating it and the required directories as needed) before continuing with the rest of the startup process, unless it knows that there would be no changes that need to be written out.

The initialization of the in-memory associative map uses the values read from the finalizer safety information file to appropriately set the values mapped to by any provided finalizer keys that were represented in the file. Additionally, any finalizer keys in the file that are not provided through Leap's configuration as finalizer keys should also be loaded with their corresponding finalizer safety information values (which should never be std::nullopts since finalizer keys associated to std::nullopt values would never be written to the file) so that subsequent writes to the file do not unintentionally erase that information. Finally, any provided finalizer keys that are associated with a std::nullopt value need to be given consideration for proper initialization.

Proper initialization of the finalizer safety information to replace the std::nullopt value of provided finalizer keys depends on the state of the blockchain loaded by Leap. If the last irreversible block (LIB) is an ancestor of the start block (which can be determined easily by checking if the header does not have a finality header extension), then these std::nullopt values must be left alone during the startup process. Otherwise, they should all be initialized to the same value:

  • last_vote_range_start = std::nullopt
  • last_vote = std::nullopt
  • lock = proposal_ref{.id = /* LIB ID */, .timestamp = /* LIB timestamp */}

Finally, before the end of the Leap startup process, the current wall-clock time should be captured and used to determine a startup time lock on voting. The calculation requires adding some small amount of time (maybe 1 second) to the captured wall-clock time, possibly rounding up to the nearest 0.5 second (to be compatible to the block timestamp), and comparing with the head block's timestamp to pick the larger of the two. This determines the time that is saved on startup and remains untouched for the rest of the lifetime of the node process. The startup time lock on voting prevents Leap from using any finalizer key to vote on a block that has a timestamp earlier than that saved startup time regardless of what the finalizer safety information for the finalizer may allow. Additionally, even if voting on a block is allowed, it prevents a strong vote if it would imply a voting time interval containing the startup time.

Changes to the finalizer safety information

There are two ways the finalizer safety information associated with a finalizer key can be modified.

The first way, also the typical way, is if that finalizer key is used for a vote (strong or weak).

The second way only applies during the IF transition. If the end block is processed and irreversibility advances forward to include the start block as an irreversible block, all of the entries in the in-memory associative map that have a std::nullopt value of the finalizer safety information need to be initialized to the appropriate value before the end of the conversion process. That appropriate value is:

  • last_vote_range_start = std::nullopt
  • last_vote = proposal_ref{.id = b1.id, .timestamp = b1.timestamp}
  • lock = proposal_ref{.id = b2.id, .timestamp = b2.timestamp}

In the above, b1 refers to the start block and b2 refers the last irreversible block that either is the start block or is a descendant of the start block which is reached right after processing the end block and triggering the conversion process.

Any time the finalizer safety information is modified, the changes should be persisted by writing it out to the finalizer safety information file before continuing. In the case of the first way finalizer safety information changes, the changes should be persisted before the vote is sent out to the network In the case of the second way finalizer safety information changes, the changes should be persisted before continuing on after the end of the conversion process.

Finalizer voting during and shortly after transition

A few different scenarios can be explored to verify how the above rules for setting up and changing the finalizer safety information enable finalizer voting during and after the IF transition.

In all of the scenarios below, let t_startup be the time recorded for the startup time lock on voting, let b1refer to the start block, and let b2 refer to the new LIB after processing the end block (which may be the start block or may be a descendant of it).

A finalizer setups a new Leap instance prior to the start of the IF transition

The finalizer safety information for the finalizer would not be initialized. Instead initialization would be delayed until the IF transition at which point it would be set to:

  • last_vote_range_start = std::nullopt
  • last_vote = proposal_ref{.id = b1.id, .timestamp = b1.timestamp}
  • lock = proposal_ref{.id = b2.id, .timestamp = b2.timestamp}

Here it is assumed that t_startup < b1.timestamp.

Until sufficient QCs were achieved to advance finality according to IF, the root of the fork database would remain b2. Also, b1.timestamp <= b2.timestamp. Therefore, all reversible blocks would satisfy both the monotonicity and safety conditions, and so it would be okay for the finalizer to vote for them. In fact, the finalizer could vote strongly on any of those blocks because the last vote range time interval (-infinity, b1.timestamp] would not overlap with the voting time interval (b1.timestamp, b.timestamp] implied by a strong vote for block b (note that b1.timestamp <= b.timestamp).

After the finalizer votes strongly on block b, their finalizer safety information would be modified (and persisted to the file) to be:

  • last_vote_range_start = b1.timestamp
  • last_vote = proposal_ref{.id = b.id, .timestamp = b.timestamp}
  • lock = proposal_ref{.id = b2.id, .timestamp = b2.timestamp}

A finalizer setups a new Leap instance during the IF transition

The finalizer safety information for the finalizer would be initialized on startup to:

  • last_vote_range_start = std::nullopt
  • last_vote = std::nullopt
  • lock = proposal_ref{.id = b2.id, .timestamp = b2.timestamp}

While the root of the fork database would be b2 at this point, the block header state would have a core containing a last_qc_block_num equal to the block number of b1 and a last_qc_block_timestamp equal to b1.timestamp, i.e. b2.core.last_qc_block_timestamp == b1.timestamp.

Until sufficient QCs were achieved to advance finality according to IF, the root of the fork database would remain b2. Therefore, all reversible blocks would satisfy the safety condition. However, b2.timestamp < t_startup, so the Leap instance may have to wait until more recent blocks arrived before it could satisfy the monotonicity condition.

Assume that eventually a linkable block b arrived where t_startup < b.timestamp. The monotonicity and safety conditions would be satisfied for block b and so the finalizer could vote on it. However, it would not be possible to vote strongly because b1.timestamp < t_startup < b.timestamp and so the voting time interval (b1.timestamp, b.timestamp] implied by a strong vote would contain t_startup.

After the finalizer votes weakly on block b, their finalizer safety information would be modified (and persisted to the file) to be:

  • last_vote_range_start = b.timestamp
  • last_vote = proposal_ref{.id = b.id, .timestamp = b.timestamp}
  • lock = proposal_ref{.id = b2.id, .timestamp = b2.timestamp}

A finalizer setups a new Leap instance after the IF transition

The finalizer safety information for the finalizer would be initialized on start up to:

  • last_vote_range_start = std::nullopt
  • last_vote = std::nullopt
  • lock = proposal_ref{.id = b3.id, .timestamp = b3.timestamp}

In the above, b3 is the last irreversible block when Leap started up.

At that moment in time, it would be possible for the finalizer to vote on a linkable block that had a timestamp greater than t_startup since it would satisfy the safety condition and also the monotonicity condition would be trivially satisfied due to the last_vote being a std::nullopt. However, it is likely that as the Leap instance processes new QCs (whether the ones attached in the block extension of incoming blocks from the network or QCs aggregated by the node from vote messages coming from the network) the last irreversible block would advance.

Assume the last irreversible block advanced to b4 before the finalizer had a chance to vote on a block. So b3 would be removed from the fork database as the new root block became b4.

Now consider a block b that is a descendant of b4 and has with a timestamp greater than t_startup. The monotonicity condition is again trivially satisfied because last_vote is still std::nullopt. But now the safety condition is not satisfied because there is not sufficient data in the fork database to check whether b is a descendant of b3 (the block referenced by lock.id).

However, the liveness condition is satisfied because lock.timestamp (which is b3.timestamp) is less than b.core.last_qc_block_timestamp. This must be the case because the only way the last irreversible block could have advanced to b4 was if there was if there was a block b5 (which is a descendant of b4 and an ancestor of b) that claimed b4 as its last QC block and where a QC was reached on b5. And so, since b is a descendant of b5, whichever last QC it claims must either be b5 or some other block in between b5 and b. This means that the b4.timestamp < b5.timestamp <= b.core.last_qc_block_timestamp.

So with monotonicity and liveness satisfied, the finalizer can vote on b. Whether it can vote strongly or weakly depends on how t_startup compares to b.core.last_qc_block_timestamp. If t_startup <= b.core.last_qc_block.timestamp, then the finalizer can vote strongly on block b. Otherwise, if t_startup > b.core.last_qc_block_timestamp, then the finalizer is only allowed to vote weakly on block b.

Assuming the finalizer votes weakly on block b, their finalizer safety information would be modified (and persisted to the file) to be:

  • last_vote_range_start = b.timestamp
  • last_vote = proposal_ref{.id = b.id, .timestamp = b.timestamp}
  • lock = proposal_ref{.id = b4.id, .timestamp = b4.timestamp}

On the other hand, assuming the finalizer votes strongly on block b, their finalizer safety information would be modified (and persisted to the file) to be:

  • last_vote_range_start = b.core.last_qc_block_timestamp
  • last_vote = proposal_ref{.id = b.id, .timestamp = b.timestamp}
  • lock = proposal_ref{.id = new_lock.id, .timestamp = new_lock.timestamp}

In the above, new_lock may refer to some block that is an descendant of b4 but an ancestor of b with a block number of b.core.final_on_strong_qc_block_num.value() (assuming b.core.final_on_strong_qc_block_num.has_value()). But if b.core.final_on_strong_qc_block_num.has_value() == false, then new_lock would simply refer to b4 which is the current last irreversible block.

Notice that the lock after voting weakly is always b4 in this scenario regardless of the state of b.core.final_on_strong_qc_block_num. This is because weakly voting does not force the finalizer to update their lock. However, because they were locked on an old block that is not even in the fork database anyway and the finalizer safety information was being updated due to voting, it is preferred to update the lock forward to the last irreversible block. This makes it more likely for the future considerations of voting on a block to pass due to the safety condition being met rather than only relying on the liveness condition. Note that if the lock was already ahead of the last irreversible block, then weakly voting must not change lock.

A finalizer restarts a Leap instance that crashed during the IF transaction from a snapshot prior to the IF transition and syncs

TBC

A finalizer restarts a Leap instance that crashed during the IF transaction from a snapshot during the IF transition and syncs

TBC

A finalizer restarts a Leap instance that crashed during the IF transaction from a snapshot after the IF transition and syncs

TBC

A finalizer restarts a post-IF Leap instance from a snapshot prior to the IF transition and syncs

TBC

A finalizer restarts a post-IF Leap instance from a snapshot during the IF transition and syncs

TBC

A finalizer restarts a post-IF Leap instance from a snapshot after the IF transition and syncs

TBC

Disaster recovery

There are several disaster recovery scenarios the IF consensus algorithm should be designed to support:

  1. The Leap nodes of some finalizers crash while enough of the other finalizers needed to reach a quorum are not impacted.
  2. The Leap nodes of enough finalizers crashes that the network is not able to reach consensus without them. At least one live node in the network (not necessarily even a finalizer) retains the latest reversible blocks and associated QCs.
  3. The Leap nodes of enough finalizers crashes that the network is not able to reach consensus without them. There is no single live node in the network that retains all of the latest reversible blocks and associated QCs.

In all three scenarios there are cases A, B, C, and D to consider:
Case A (think Awesome): The finalizer safety information file of the Leap node that experienced the crash remains uncorrupted and retains the most up-to-date state prior to the crash.
Case B (think Backwards): There is an uncorrupted finalizer safety information file for the Leap node that experienced the crash but it is not contain the most up-to-date state prior to the crash.
Case C (think Corrupted): There is a finalizer safety information file for the Leap nodes that experienced the crash but it is corrupted and not reliably readable.
Case D (think Destroyed): The Leap node that experienced the crash is restarted without any finalizer safety information file. Either the file was lost or it was intentionally destroyed.

Leap should not start up with a corrupted finalizer safety information file if it has any finalizer keys provided. In this case (case C) it should force the node operator to make a decision: either they intentionally destroy the file so that Leap can start up (forcing it into case D) or they copy over an older finalizer safety information file that wasn't corrupted. If they make the latter decisions they are forcing it into either case A (unlikely: the backup file they copied over happened to somehow have the exact state prior to the crash), case B (likely: the backup file is relevant to the node but it is a little stale so it does not have the most up-to-date information), or effectively case D (unlikely: they copy over the wrong file meant for other finalizer keys, and so for the finalizer keys provided to this Leap instance the file has no finalizer safety information relevant to it).

In all cases for scenarios 1 and 2, there is at least one live node in the network that can eventually provide enough blocks to either help the restarted node recover the fork database it had prior to the crash (perhaps with additional blocks it did not have before) or a different fork database with a later root block which causes some of the blocks held in the prior iteration of the fork database to be removed (either because they are orphaned or because they are ancestors of the new last irreversible block). We can refer to the first situation as the recovered fork database situation. We can refer to the second as the progressed fork database situation.

TODO: Classify the various scenarios, e.g. 1A, 2A, etc., into classes such as "theoretically safe automatic recovery scenarios", "practically safe automatic recovery scenarios", "possibly manual recovery scenarios", etc. and expand on the details in each case.

@BenjaminGormanPMP BenjaminGormanPMP moved this from In Progress to Awaiting Review in Team Backlog Feb 13, 2024
@github-project-automation github-project-automation bot moved this from Awaiting Review to Done in Team Backlog Feb 15, 2024
@greg7mdp
Copy link
Contributor

From Areg

So the safety condition for a target block a finalizer is considering voting for could be checked using just the block header state.

If the lock.block_num() is less than target_block->core.last_final_block_num() or greater than or equal to target_block->core.current_block_num(), then the safety condition is not satisfied. Otherwise, use target_block->core.get_block_reference(lock.block_num()) to get the block_ref and use it to compare the the block ID (or finality digest) stored in lock to check if you have the correct block. If you do, then the safety condition is satisfied. If not, then the safety condition is not satisfied.

Similarly, the liveness condition can use core.get_block_reference to lookup with using the qc claim block number.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants