-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update fsi and decide_vote to solve weak masking issue without regression in liveness #621
Comments
heifner
added a commit
that referenced
this issue
Aug 23, 2024
…ol votes_forked_since_latest_strong_vote for vote decision
heifner
added a commit
that referenced
this issue
Aug 23, 2024
greg7mdp
added a commit
that referenced
this issue
Aug 26, 2024
heifner
added a commit
that referenced
this issue
Aug 26, 2024
heifner
added a commit
that referenced
this issue
Aug 27, 2024
[1.0] Use other_branch_latest_time for liveness
heifner
added a commit
that referenced
this issue
Aug 27, 2024
heifner
added a commit
that referenced
this issue
Aug 27, 2024
[1.0 -> main] Use other_branch_latest_time for liveness
greg7mdp
added a commit
that referenced
this issue
Aug 28, 2024
[1.0] Add testcase validating the new fsi design from issue #621.
heifner
added a commit
that referenced
this issue
Aug 29, 2024
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Depends on #617.
The fix in #534 for the weak masking issue respected a more conservative version of rule 2. This solved the safety concerns due to the weak masking issue, but it was unnecessarily restrictive with respect to liveness.
A finalizers that ends up with
votes_forked_since_latest_strong_vote == true
may be stuck voting weak if the QC is not formed quickly enough. Consider a chain where consistently a weak QC is formed on a block and it is claimed not in the next block but the block after. When a finalizer is consider to vote on the block that claims the weak QC on its grandparent block, it needs to consider the implied time span from the grandparent block's timestamp to the timestamp of the block it is deciding to vote on. To respect rule 2, the finalizer cannot vote strong on the block if that implied time span overlaps with the timestamp of any other blocks the finalizer has votes on previously unless those blocks are ancestors of the block it is deciding to vote on.But the finalizer would most likely have just recently voted on the parent block which would have updated the
last_vote
member in its fsi to that parent block. And sincevotes_forked_since_latest_strong_vote
is true, the finalizer would assume the worst about the overlap between the implied time span and thelast_vote
timestamp, even though thelast_vote
block is an ancestor of the block it is deciding to vote on. Therefore, under that conservative assumption, it would decide to vote weak, thus repeating the problem indefinitely. This could lead to a "live lock" situation, where the finalizer (and perhaps others like it) only vote weak, and thus only weak QCs are formed and finality does not advance.Finality could still advance if a block was able to claim a weak QC on its parent block. Then the implied time span would not conflict with the
last_vote
and the finalizer could vote strong on the block breaking the cycle. But this requires a QC to be formed in less than 500 ms. Prior to the change of #534, there was no requirement for QCs to be formed within any particular time window. So the changes of #534 caused a regression in liveness while fixing the safety problem.It is possible to keep the original safety problem fixed while avoiding the regression in liveness. The naive approach would be for a finalizer to keep track of its latest strong vote and all of its weak votes of since its latest strong vote. When deciding whether to vote strong or weak, the finalizer could first filter out all tracked votes with a timestamp less than or equal to the latest QC claim timestamp. Then it would check all remaining votes to ensure they were on blocks that are ancestors of the block it is deciding to vote on. If those conditions were satisfied, then it would know rule 2 is not violated and so it can vote strong; otherwise it would be forced to vote weak.
The naive approach requires storing
block_ref
s for each vote which can be unbounded. The goal with the changes of #534 were to keep the fsi record fixed size. In fact, the serializer for the fsi file assumed each record would serialize to a fixed size. That limitation on the serializer will be lifted with the changes in #617 which introduces a version 1 for the fsi file. It would appear that the only way to avoid limiting liveness while solving the weak masking problem is to allow the fsi record to be unbounded in size. However, there are more efficient approaches than the naive approach of tracking all votes.Consider the following section of the existing
decide_vote
implementation:spring/libraries/chain/finality/finalizer.cpp
Lines 63 to 93 in 68de44f
Instead of tracking
votes_forked_since_latest_strong_vote
in the fsi record, we could have a memberstd::vector<block_ref> other_branches
. This vector is not meant to track all of the blocks that the finalizer votes on. It avoids tracking any blocks that are ancestors of other tracked blocks in that vector or inlast_vote
. Soother_branches
tracks the branches (other than the one headed bylast_vote
) containing blocks that the finalizer has voted weak on since its latest strong vote; but it also tracks the block that the finalizer has last voted strong on, unless that block is already tracked bylast_vote
.other_branches
keeps theblock_ref
s sorted in ascending timestamp; alsolast_vote
has a timestamp no less than the one forother_branches.back()
. Whenever a finalizer decides to vote strong on a block,other_branches
is cleared.So when deciding to vote strong or weak, the finalizer only needs to pay attention to
last_vote
andother_branches.back()
(assuming it exists). If the timestamp oflast_vote
is less than or equal to the latest QC timestamp (which means the same holds for allblock_ref
s inother_branches
), then the finalizer can vote strong. Iflast_vote
has a timestamp greater than the latest QC timestamp but is for a block that is an ancestor of the block the finalizer is deciding how to vote on, then it is still possible to vote strong; it requires either thanother_branches
is empty orother_branches.back()
has a timestamp less than or equal to the latest QC timestamp. If these conditions are not met, then the finalizer must vote weak.This approach, while more efficient than the naive approach, still has an unbounded size. But we can do even better and bound the size without harming liveness. Notice that in the prior approach the decision to vote strong or weak does not actually require
other_branches
. It only requireslast_vote
andother_branches.back().timestamp
(assuming it exists). The fullother_branches
is only needed to figure out which (if any) of the branches will be extended by the new vote so thatother_branches
can correctly be kept up to date. But if all we care about isother_branches.back().timestamp
, then that value can be kept up to date without needing the rest ofother_branches
.So instead of tracking
votes_forked_since_latest_strong_vote
in the fsi record, we could have a memberstd::optional<block_timestamp_type> other_branch_latest_time
. This member is reset tostd::nullopt
on a strong vote. If present, it represents the timestamp of the most recent block that was voted on but is also not the block referenced bylast_vote
and not known to be ancestor of the block referenced bylast_vote
. If not present, it either means that the most recent vote was a strong vote or that all weak votes since the latest strong vote have been on blocks extending one after the other. Basicallyother_branch_latest_time.has_value()
is equivalent to the priorvotes_forked_since_latest_strong_vote
. But notice that there is now additional information (the timestamp) present in the fsi in the case where we would have hadvotes_forked_since_latest_strong_vote == true
.But the actual implementation to go with is to replace
votes_forked_since_latest_strong_vote
in the fsi record with the memberblock_timestamp_type other_branch_latest_time
. Notice there is nostd::optional
anymore. The case of this not being present in the prior paragraph can easily be represented by setting the earliest allowed timestamp. There are no issues with ambiguity because the condition to allow a strong vote does not care whether this timestamp is not present or if it is less than or equal to the latest QC timestamp (which the earliest allowed timestamp always would be). This also means that we preserve the property that the serialization of the fsi record is always the same size regardless of the values they hold.So the referenced code from
decide_vote
above can now instead become something like the following:To support the new
other_branch_latest_time
member of the fsi record, the fsi record type must be changed. We do not wish to break all compatibility with the existing version 0 fsi record that was introduced in Spring v1.0.0-rc1 and was still the only version supported in Spring v1.0.0-rc2. So a new version for the fsi record must be introduced. #617 introduces a new version 1 fsi record which this issue assumes has not yet been released in a version of the Spring software following v1.0.0-rc2. If so, the version 1 fsi record type can simply be modified to meet the needs of this issue.It is not possible to properly upgrade a version 0 fsi record to a version 1 fsi record because the version 0 fsi record is missing critical information (though only in the case when
votes_forked_since_latest_strong_vote
is true). Since the version 0 fsi record was introduced in an release candidate for Spring 1.0.0, and this issue will be resolved and included in another release candidate for Spring 1.0.0, it is okay to support this upgrade in the fsi record version in a manner that does not conservatively preserve safety. This is because it is very unlikely for safety to be violated in the brief window of time from when the upgrade happens to the next time the finalizer votes strong (which would reset things back to a good state) and also because even in the unlikely event the upgrade causes the finalizer to commit a rule violation, it does not really matter anyway since the release candidates are only used for test networks.However, there is a way to upgrade the record to conservatively preserve safety while avoiding the liveness regression. The trick is to initialize
other_branch_latest_time
tolast_vote.timestamp
within the upgraded version 1 fsi record, but only ifvotes_forked_since_latest_strong_vote
was true. Ifvotes_forked_since_latest_strong_vote
was false, thenother_branch_latest_time
should be initialized within the upgraded version 1 fsi record to its default value of the earliest allowed timestamp.Note that while this issue changes the behavior of
decide_vote
and also the fsi format, it is not a consensus protocol change. The changes proposed here are consensus compatible with Spring v1.0.0-rc1 and Spring v1.0.0-rc2. The changes simply adjust the protections finalizers employ to enable more situations in which they can (still safely) vote strong and therefore improves liveness of the network without a protocol change.The changes implemented as part of this issue should be accompanying by test cases that demonstrate the liveness issue caused by #534 which this issue should fix. Additionally, the test cases should ensure all the code paths through the above proposed implementation in
decide_vote
are exercised. Both of these objectives can be accomplished with a single test case.In the above example, things are moving along normally until time t4 when a microfork occurs. Instead of building block B4 off of block B3, the producer builds block B4 off of block B2. And then going forward, for some reason, it takes slightly longer for votes to propagate that a QC on a block cannot be formed in time to be included in the very next block; instead the QC goes in the block after.
The finalizer of interest is voting on all of the blocks as they come. For this example, it is sufficient to only have one finalizer. The first time the finalizer is forced to vote weak is on block B4. As the other blocks continue to build on that new branch, it votes on them appropriately and the producer collects the vote and forms a QC as soon as it can, which always remains one block late. The finalizer should begin voting strong again starting with block B6. However, prior to the changes described in this issue, the finalizer would remain stuck voting weak indefinitely.
The expected state of the fsi record for the finalizer after each vote is provided below. It also records what the new LIB should be after processing the block. In addition to checking that the blocks have the claims as required above and the LIB as noted below, the test should also check that the fsi record after each vote is as expected below.
The text was updated successfully, but these errors were encountered: