-
Notifications
You must be signed in to change notification settings - Fork 180
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Dynamic Protocol State] Remaining work and ToDos #4649
Comments
Adding suggestion for naming conventionmore details are here in the appendix of notion doc
In both APIs, the sub-APIs are mutually exclusive. Hence, we would need to change the current API: |
All items are either implemented, a PR is up, or we decided that we don't want to implement the item. |
High priority
If the protocol state
protocol.StateUpdater
reports that it encountered an invalid state transition (InvalidServiceEventError
), we currently just set theStateUpdater.InvalidStateTransitionAttempted
flag to true and proceed as usual:flow-go/state/protocol/badger/mutator.go
Lines 1124 to 1128 in 0172dd1
However, we don't know that the
updater
's internal state is a valid Protocol State. There is no atomicity requirement on theStateUpdater
as far as I can tell, in that it either applies the update entirely or not at all. For all we know, it could be complete garbage. Nevertheless, we just build whatever (potentially broken) interim state that we aborted the update at.Suggestion:
InvalidStateTransitionAttempted
totrue
).Updater
to "drop all modifications ifInvalidStateTransitionAttempted
" and document this properly throughout the code base. In a nutshell, within theBuild
method I would check thestate.InvalidStateTransitionAttempted
first; in that case we could just return a copy of theparentState
with only thestate.InvalidStateTransitionAttempted
set to true.Updater
, the corresponding interface, and ideally also the places where we call theUpdater
Further thoughts:
InvalidStateTransitionAttempted
, should the updater maybe already set its ownInvalidStateTransitionAttempted
flag? We would still raise the error to signal the failure to the caller. However, the caller would no longer be required to explicitly callupdater.SetInvalidStateTransitionAttempted()
(I'd leave that method, to allowUpdater
external logic to set this flag for other reasons).(for further details, see this PR comment)
☝️ implemented in [Dynamic Protocol State] Dedicated protocol state machine for epoch fallback #4931
Explicitly encode a node's epoch lifecycle status (e.g. joining, active, leaving) in the protocol state.
Currently, we detect that a node is authorized to be part of the network but not to actively contribute to extending the block with the following conditional:
Context:
For consensus participants (consensus nodes, collector nodes), we anyway use their initial weight. The practical benefits are massive, because this enables light clients that don't need to download and locally check every block header.
Leader selection is also pre-determined for an entire epoch based on initial weight.
For other Verification [VNs] and Execution [ENs] nodes we currently don't have meaningful ways for weight-based load balancing. ENs anyway need to execute every block and our chunk-assignment algorithm generates uniform load across all VNs. It is not clear whether weighted protocol variants even exist (and even if such algorithms exist [open research topic], a large amount of complicated software changes would likely be needed).
Therefore, we conclude that at the moment only a binary notion of weight (
zero
vspositive
) is used and we don't believe the protocol will benefit from a more fine-grained differentiation in the foreseeable future.Further reasoning:
participation status
with possible values ofactive
[positive weight] andjoining
,leaving
[zero weight]Ejected
status just as one possible value of the this enum. If we kept theparticipation status
and theEjected
flag separately, we could easily differentiate between (joining
,ejected
) and (leaving
,ejected
). However, I don't think this differentiation is conceptually useful. If a node is ejected, we should just drop it from the protocol state at the next possible occasion. Whatever state it had before (joining, active, leaving) is no longer relevant for the core protocol layer (For the system smart contracts, that might be different)Suggested revision:
participation status
with possible values ofactive
[positive weight] andjoining
,leaving
,ejected
For further context, this todo is based on the PR comments:
flow.Identity
inflow.IdentitySkeleton
andflow.DynamicIdentity
#4545 (comment)EpochStateContainer
stores epoch active identities #4834 (comment)☝️ implemented in [Dynamic Protocol State] Replace dynamic weight and ejection flag with 'epoch participation status' #5039
At the moment, the
protocol_state.Updater
returnsInvalidServiceEventError
for obviously broken inputs. However, lets take a full pass over the implementation to make sure we correctly differentiate between internal exceptions and external errors in all cases.Context
In a sense, all protocol-state-changing events should be considered as external inputs. Those inputs can be broken (though they should not be because they passed verification).
I think the updater should clearly distinguish between
The node internally only maintains protocol state snapshots -- should anything be wrong with them, then it is an exception. In contrast, the blocks essentially only represent state transitions. Should something not match up with the state transition, something is wrong with the block or the process generating the state transitions that are recorded in the block (e.g. epoch smart contract).
☝️ implemented in [Dynamic Protocol State] Dedicated protocol state machine for epoch fallback #4931
Take a fresh look at the
InvalidStateTransitionAttempted
and its relation to the node-internalEpochEmergencyFallbackTriggered
event and the related flag inInstanceParams
. See PR comment here. In context of this discussion, we will probably also have to consider the option for leaving epoch fallback mode as well.👉 moved to separate issue [Dynamic Protocol State] passing the epoch commitment deadline without an
EpochCommit
event #5104Medium priority
Review precise conditions under which an epoch transition occurs and how that interacts with epoch emergency fallback. See Jordan's comment☝️ @jordanschalm:
This has changed again since the original comment. We do not explicitly check thatSetupID != nil
, but we do check in both cases that we are in the EpochCommitted phase before transitioning to the next epoch (which we should). I don't think there is a need to change anything here any more.Update: we concluded that this sanity check might be helpful in case the node has corrupted its own state. However, there are multiple similar checks already in place. So we decided to not overly bloat the code with sanity checks that are most likely never needed.
following Jordan's comment on PR #4559, renameProtocolStateEntry
toIdentityTableSnapshot
(🎏 does it contain more than the identity table?)☝️ @jordanschalm:
Now, it does contain more than the identity table.This is based on Jordan's comment on PR #4559 on the data structure
ProtocolStateEntry
, which is mixing conceptually different approaches at the moment:In the following, we are going to refer to the
ProtocolStateEntry
asIdentityTableSnapshot
(i.e. assume the previous ToDo is already implemented)Concerns are:
Conceptually, the current approach could become confusing and error-prone when we add slashing/ejection events, which modify one node and leave the rest unchanged (maintenance and extendability concern). Furthermore, we need to separate the state transition logic anyway to implement construction and verification of Identity Table changes in consensus nodes (see issue https://github.com/dapperlabs/flow-go/issues/5517).
Proposal:
Implement a separate state machine$\mathcal{S}$ for evolving the protocol state.
Conceptually, each consensus node has knowledge of the parent block's$IdentityTableSnapshot_{N-1}$ (that takes effect, after the QC in the child block confirms its validity, so specifically for the child block)
The child block$N$ might contain state changing operations $\Delta_N$ (example: $\Delta_N$ might be an
EpochSetup
event, ejection request, slashing ...)the proposer of the child block will include the hash of the resulting$IdentityTableSnapshot_{N}$ , which is a commitment to the resulting Identity Table after the block $N$ is processed. Both primary as well as consensus replicas compute / validate $IdentityTableSnapshot_{N}$ by applying the state machine $\mathcal{S}$ :
where$view$ is the view of the block containing $\Delta_N$ (I think this will be very helpful for consistency checks)
Extend the unit tests for
RichProtocolStateEntry
by also testing scenarios:RichProtocolStateEntry
for epoch setup phase where no prior epoch exist (i.e. first epoch setup phase after spork) 👉 ToDoRichProtocolStateEntry
for epoch commit phase where no prior epoch exist (i.e. first epoch commit phase after spork) 👉 ToDo☝️ implemented in [Dynamic Protocol State] TODOs and refactoring, part 2 #5080
For the
ProtocolState
storage layer abstraction I would suggest to also include a cache for the secondary index (retrieving IdentityTable by block ID) 👉 PR comment☝️ implemented in [Dynamic Protocol State] TODOs and refactoring, part 2 #5080
Low priority
I feel the method
Members
ofprotocol.Cluster
should returnIdentitySkeleton
s (👉 code) We are doing multiple conversions toIdentitySkeleton
in code usingprotocol.Cluster
, which I feel indicate a design inconsistency.☝️ implemented in [Dynamic Protocol State] Changing structure of participants in
EpochSetup
#4726For theProtocolState
storage layer abstraction, review possibility for 'caching theRichProtocolStateEntry
on store' 👉 PR comment❌ Won't implement: On the happy path, the Protocol State changes rarely throughout the epoch. The protocol only requires 3 stores per epoch. On the one hand, there is a non-negligible overhead for the data base read on the first retrieve. On the other hand, this is extremely rare and the latency cost is not great but reasonable.
Cache
GlobalParams
rather than reading per request and panicking on unexpected errors [Dynamic Protocol State] Read-only interfaces of protocol state #4579 (comment)ForEpochSetup
andEpochCommit
events, we have a long list of consistency checks in theRichProtocolStateEntry
constructorI originally proposed to introduce the convention ofnil ≡ ZeroID
forEpochSetup
andEpochCommit
, but as Yurii pointed out, this would be violating "safety by default" approach:If we have time, lets revisit theRichProtocolStateEntry
constructor and try to add convenience functions to reduce the amount of boilerplate code that we have to write for such sanity checks. Presumably, we want to have analogous sanity checks in other parts of the code and a more compact way of implementing them would benefit code readability.For details, please see this PR discussion: [Dynamic Protocol State]ProtocolStateEntry
refactoring #4721 (comment)❌ Won't implement: sanity checks are reasonably consolidated. While they add quite a lot of lines, their explicitness and easy readability are benefits.
My gut feeling is that we should try to remove
EpochStatus
as this is now superseded byProtocolStateEntry
. I think our understanding of implementation backing theInitialProtocolState
andDynamicProtocolState
has evolved. Specifically, theRichProtocolStateEntry
is already available as part of the API, which contains Epoch Setup and Commit events for the current, next and previous epochs (if available).☝️ implemented in [Dynamic Protocol State] Remove
EpochStatus
#5089I am not sure how much work the following refactoring would be. I think we should only invest a limited amount of time, maybe a week to get this done (if it takes significantly more time, I think we should leave the code as is).
RichProtocolStateEntry
struct through the interface. Instead, I think we should utilize an interface that exposes all relevant data through methods. We already have an interface for all the Epoch information:EpochQuery
InitialProtocolState
interface could be completely implemented byRichProtocolStateEntry
(removing the need for theInitialProtocolStateAdapter
wrapper struct)DynamicProtocolState
could also be completely implemented byRichProtocolStateEntry
(in most cases I think we haveGlobalParams
anyway available). Thereby we would remove the need forDynamicProtocolStateAdapter
, which would further simplify the code.protocol.State
and theprotocol.ProtocolState
Code structure and clarity
Epoch setup event should only provide
IdentitySkeleton
s☝️ implemented in [Dynamic Protocol State] Changing structure of participants in
EpochSetup
#4726I feel it would be the most consistent if all initial identities would only be represented by
IdentitySkeleton
s. Thereby, we have a clear type distinction between fullIdentity
that can change from block to block andIdentitySkeleton
that is fixed for the the Epoch.In particular for the
Cluster
, I would suggest to haveMembers()
return anIdentitySkeletonList
. I hope this wouldn't be a huge refactor. Though, we probably would need to take a closer look atCluster.initialClusterMembers
. This is because thecommittees.Cluster
construct the cluster members at the root block frominitialClusterMembers
:flow-go/consensus/hotstuff/committees/cluster_committee.go
Lines 83 to 89 in c9087fc
Cluster
is a good place to implement this convention of how we translateIdentitySkeleton
to fullIdentity
at the very beginning of an epoch.☝️ implemented in [Dynamic Protocol State] Changing structure of participants in
EpochSetup
#4726From my perspective, we are approaching a complexity / maintainability challenge w.r.t. the protocol state. The protocol state has been complex for a long time and we have iteratively refactored it to manage its growing set of functionality. With the addition of the dynamic protocol state, I feel we are at the point again where we should be thinking about further modularization and separation of concerns.
Areas I would suggest to consider:
committee.Cluster
andcommittee.Consensus
live on top ofprotocol.State
(👉 code). As a result, some logic for managing the Identity Table is inprotocol.State
while other logic is in the committees. As a result, we convert forth and back between fullIdentitie
s andIdentitySkeleton
s (👉 code), which I feel is prone to inconsistent code modifications.committee.Cluster
andcommittee.Consensus
. They manage their own portion of the state in the DB and provide mutator operations that return the necessary data base operationsprotocol.State
would then become the high-level orchestration layer. On the retrieval end, the protocol state would compose the interfaces from the lower-level components listed above. When adding new blocks, theprotocol.State
would delegate to the respective logic of the lower-level components, collect the resulting data base changes and apply them all in one single transaction.Not really sure if this is tractable; there will probably be a lot of challenges when we try to implement. Nevertheless, I feel it would still be beneficial to explore this further to at least get an idea how complex it is and whether we can maybe incrementally work on this over time.
Currently, there is a lot of ambiguity in the naming ofprotocol.State
,protocol.Snapshot
,protocol.ProtocolState
,protocol.InitialProtocolState
,protocol.DynamicProtocolState
(see Jordan's PR comment)RenameDynamicProtocolState
toDynamicProtocolSnapshot
Consolidateprotocol.InitialProtocolState
withprotocol.Epoch
APIs. Currently there is a lot of overlapRichProtocolStateEntry
. ~We are exposingEntry
in protocol state API to access low-level data, this is handy in few cases such as bootstrapping, but also dangerous as mentioned here. Consider removing it from higher level APIs. This should be handled after having a working solution since it's hard to say how to perform certain actions that depend on existence ofEntry
.Potentially obsolete comments and suggestions (for completeness)
This is based on (👉 Jordan's PR comment)
Comment:
I think the duplication issue will be addressed in the ToDo further above:
regarding the suggestion to 'track all current identities together':
As discussed in [Proposal] Detailed specification for framework functionality to support future slashing, I believe is is fundamentally necessary to track the identity table for the next and the current epoch separately (see section 'necessity of deferred operations'), which in turn necessitates data duplication.
The concepts ofPreviousEpoch
andCurrentEpoch
andNextEpoch
in theProtocolStateEntry
could potentially lead to misunderstandings resulting in subtle security vulnerabilities. Specifically, they need to be updated at the Epoch transition, despite the information technically still being the same. Furthermore, they are imprecise compared to identifying an epoch uniquely via its counter, becauseprevious
,current
,next
are defined with respect to ca changing reference frame.Todo:identify Epochs solely by their epoch counteraProtocolStateEntry
would then be an ordered pair for two adjacent epochs (N, N+1), where the earlier Epoch can benil
if and only if N+1 is the first epoch of the spork / genesis.similarlyRichProtocolStateEntry
, would also be an ordered pair for two adjacent epochs (N, N+1)Notes:With this change, we are removing the previous epoch from theProtocolStateEntry
.This is intended for the following reason: At the beginning of the Epoch (specifically all the way throughout the staking phase), the pair (N-1, N) will contain information about the last epoch (N). With the epoch setup event, the protocol state changes to carrying the pair (N, N+1). Therefore, any light client that is trusting a block of Epoch N can retrieve trustlessly information about the prior epoch and the next epoch by querying two different block header (+ short sequence of descendants to prove finality). I think the additional networking and cpu consumption is negligible while the software simplification is notable.We decided to not do this, for the following reason. The conceptual status of a node changes at epoch boundaries for nodes joining or leaving. Hence, the conceptual complexity remains. We chose to work with a conceptually more intuitive model at the code level, where the identity table changes at epoch switchover. While this complicates the safety and consistency proofs, it move this complexity to a formal (mathematical) level outside of the code, where we have the drastically more powerful tool of formal proofs at our disposal to guarantee correctness.
The text was updated successfully, but these errors were encountered: