Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

Approval Voting loop does not handle 'pre-covered' no-shows correctly #3847

Closed
rphmeier opened this issue Sep 14, 2021 · 4 comments
Closed
Labels
I3-bug Fails to follow expected behavior.

Comments

@rphmeier
Copy link
Contributor

rphmeier commented Sep 14, 2021

We observed a situation on Rococo where a candidate had 6 assignments and 5 approvals, and all nodes saw it as unapproved indefinitely, with no further wakeups pending.

Needed-approvals: 4
Tranche-0 assignments: A, B, C, D
Tranche-1 assignments: E, F

All validators issued assignments at the same time and A no-showed. This lead to a technically approved state of 3 tranche-0 approvals and 1 tranche-0 no-show, which is covered by 2 tranche-1 approvals.

What I think happened is that we imported all assignments and then when A no-showed, it was immediately covered by the validators of tranche 1. There are two issues with this:

  1. No further time was given to other validators of tranche 1 to broadcast their assignments. We need to account for the case where E and F aren't the only validators in tranche 1 but the others were waiting for a no-show before triggering. This can happen organically due to clock drift or malicious actors in early tranches could attack more easily by 'pre-covering' tranche-0 validators they intend to DoS.
  2. Even though the block technically had an 'approved' state, it wasn't marked as approved in the node's databases, leading to a finality stall.

Both the wakeup state machine and approval-counting procedures need to be adjusted to account for this:

  1. At the moment, candidates can only be marked as approved in the database in the code path that imports an approval vote. In scenarios such as the one in this issue where all approvals were imported before the no-show, this means that the block can never be considered approved.
  2. There is no state in RequiredTranches for waiting for new assignments after a no-show. We should have a new state that indicates that we've just encountered a no-show but all assignments covering it were received before we detected the no-show and therefore we should wait NO_SHOW_REACTION_TICKS ~= 3 before marking as approved.
@rphmeier rphmeier added the I3-bug Fails to follow expected behavior. label Sep 14, 2021
@rphmeier rphmeier changed the title Approval Voting loop may not handle 'pre-covered' no-shows correctly Approval Voting loop does not handle 'pre-covered' no-shows correctly Sep 14, 2021
@burdges
Copy link
Contributor

burdges commented Sep 14, 2021

Why did the tranche one validators broadcast? You think 0.5 s maybe too fast for tranches? Ugh

@rphmeier
Copy link
Contributor Author

Yeah, 0.5s might be too fast for the moment although once we get header rebroadcasting it's probably fine.

@burdges
Copy link
Contributor

burdges commented Sep 17, 2021

It's strange this caused a chain halt since if we think some candidate is not approved then we should eventually check it ourselves. Is there perhaps a second bug somewhere?

@rphmeier
Copy link
Contributor Author

@burdges

  1. The candidate was approved, but we didn't commit that to the metadata database
  2. Since it is approved, there were no further wakeups to schedule
  3. It never got woken up anymore was marked as unapproved, therefore a finality stall

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
I3-bug Fails to follow expected behavior.
Projects
None yet
Development

No branches or pull requests

2 participants