-
Notifications
You must be signed in to change notification settings - Fork 4.5k
add additional vote lockout stake threshold #34120
add additional vote lockout stake threshold #34120
Conversation
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## master #34120 +/- ##
=========================================
- Coverage 81.9% 81.9% -0.1%
=========================================
Files 819 819
Lines 219884 219919 +35
=========================================
+ Hits 180263 180270 +7
- Misses 39621 39649 +28 |
aea64cc
to
3c07f5e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might want to add some tests that pass the 8 deep check but fail the 4 deep check.
core/src/consensus.rs
Outdated
if vote.confirmation_count() as usize > threshold_depth { | ||
for old_vote in &vote_state.votes { | ||
if old_vote.slot() == vote.slot() | ||
&& old_vote.confirmation_count() == vote.confirmation_count() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this was the issue: #7904 (comment)
If you bounce back to voting on the main fork after not voting for a while, your latest vote N
on the main fork will pop off a lot of the stake of votes in the tower that would have rolled up to earlier votes in the tower. I think this is because for every vote state we apply the proposed vote N
latest before roling up to voted_stakes
Added 3 tests:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me but i'll wait for carl to weigh in.
A couple of things:
- I think we should test this out with the delay active.
- Although not strictly necessary we might want to feature flag this as it changes voting behavior. perhaps running with the delay and a 50/50 split with the new threshold turned on could give us some insight on whether it's necessary.
core/src/consensus.rs
Outdated
pub fn check_vote_stake_threshold( | ||
/// Checks a single vote threshold for `slot` | ||
fn check_vote_stake_threshold( | ||
vote: Option<&Lockout>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit maybe rename to threshold_vote
so that it's not implied that the vote referenced by vote_state_before_applying_vote
is this one
core/src/consensus.rs
Outdated
fork_stake, | ||
total_stake | ||
); | ||
if vote.confirmation_count() as usize > threshold_depth { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just realized that at this point t if vote is Some
it has to be greater than threshold_depth
, but makes sense to minimize changes here
core/src/consensus.rs
Outdated
if vote.confirmation_count() as usize > threshold_depth { | ||
for old_vote in &vote_state.votes { | ||
if old_vote.slot() == vote.slot() | ||
&& old_vote.confirmation_count() == vote.confirmation_count() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this was the issue: #7904 (comment)
If you bounce back to voting on the main fork after not voting for a while, your latest vote N
on the main fork will pop off a lot of the stake of votes in the tower that would have rolled up to earlier votes in the tower. I think this is because for every vote state we apply the proposed vote N
latest before roling up to voted_stakes
Definitely seems like something we should feature flag because I don't think it's something that can be adequately tested by slow rollout, i.e. the percentage of people following one threshold or the other probably matters We should also spin up a cluster with ~20 nodes and run the partition tests with 50-50 split and 30-70 split. I think those are generally good at finding any consensus issues |
Added feature gate to control checking this additional threshold: #34158 |
f4cc88f
to
cbcf3d4
Compare
I ❤️ this. I fully support it. I've been referring to this concept as Votes Funnel, drawing an analogy with Sales Funnel. However, Forks Funnel might be a more accurate term. 😃 |
sdk/src/feature_set.rs
Outdated
@@ -732,6 +732,10 @@ pub mod enable_zk_transfer_with_fee { | |||
solana_sdk::declare_id!("zkNLP7EQALfC1TYeB3biDU7akDckj8iPkvh9y2Mt2K3"); | |||
} | |||
|
|||
pub mod additional_vote_stake_threshold { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these changes shouldn't need a feature gate. the fd team and operators would i'm sure be happy if we didn't break consensus unnecessarily
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed the feature gate
26d6b66
to
361ef93
Compare
Addressed latest feedback to re-order stake threshold checks and remove the feature gate. Changes have been tested against (1) 10 node cluster partition testing w/ 50/50 and 70/30 stake split and (2) 10 node cluster w/ broadcast delay. Cluster seems to be able to recover from partitions lasting up to several minutes. Lockouts also appear to be limited to 4-deep in the broadcast delay case. |
361ef93
to
e355d0d
Compare
e355d0d
to
3b9c8e7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll await Carl's final approval, but the successful results observed in the test clusters gives me high confidence about the change
Following up on this: Looked at tower height metrics on testnet over the last few days (2/25/24 to 2/27/24) and things are looking much better than they did a few months back. I only saw a couple of places where tower height rose to ~80 and neither of them appeared to be due to 8-deep lockout issues (seemed like some short lived partitioning combined w/ delinquent leaders). We previously were seeing several second long root stalls a few times every day, so it seems like the additional check is helping 👍 |
Problem
We see several cases where cluster gets partitioned in such a way that some set of validators end up voting down the minority fork and hit deep lockouts (up to 8). This can stall OC for over a minute in the worst cases. See #34107 for one example.
It would be good to resolve all the reasons why validators can vote down the minority fork (deterministic fork selection, fixing reasons why leader's broadcast gets delayed, etc.) but it could take some time to resolve all of these. Treating the symptom by reducing how far down the minority fork validators will vote can mitigate lockout period significantly.
Summary of Changes
When checking vote stake threshold as part of voting criteria, check for both of the following criteria:
The reason switch threshold is selected for the threshold is because at that point the voting validator can be confident that either:
We often see cases where the validators that voted on the "majority" fork weren't able to land their votes because they got sent to a leader building off the "minority" fork. However, the minority fork has less than 38% and so those validators aren't allowed to switch over. This is the exact situation this new criteria should prevent because either: