Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Dynamic Protocol State] Remaining work and ToDos #4649

Closed
12 of 19 tasks
AlexHentschel opened this issue Aug 22, 2023 · 2 comments
Closed
12 of 19 tasks

[Dynamic Protocol State] Remaining work and ToDos #4649

AlexHentschel opened this issue Aug 22, 2023 · 2 comments
Assignees
Labels

Comments

@AlexHentschel
Copy link
Member

AlexHentschel commented Aug 22, 2023

High priority

  • If the protocol state protocol.StateUpdater reports that it encountered an invalid state transition (InvalidServiceEventError), we currently just set the StateUpdater.InvalidStateTransitionAttempted flag to true and proceed as usual:

    if protocol.IsInvalidServiceEventError(err) {
    // we have observed an invalid service event, which triggers epoch fallback mode
    updater.SetInvalidStateTransitionAttempted()
    return dbUpdates, nil
    }

    However, we don't know that the updater's internal state is a valid Protocol State. There is no atomicity requirement on the StateUpdater as far as I can tell, in that it either applies the update entirely or not at all. For all we know, it could be complete garbage. Nevertheless, we just build whatever (potentially broken) interim state that we aborted the update at.

    Suggestion:

    • I think it it is ok to treat all protocol state operations mutations in a block as one atomic state transition. It either succeeds in its entirety or in case of a failure, there is no update to the protocol state (except setting InvalidStateTransitionAttempted to true).
    • I think it is straight forward to extend the protocol state Updater to "drop all modifications if InvalidStateTransitionAttempted" and document this properly throughout the code base. In a nutshell, within the Build method I would check the state.InvalidStateTransitionAttempted first; in that case we could just return a copy of the parentState with only the state.InvalidStateTransitionAttempted set to true.
    • While the code changes are relatively small, we need to properly test this and add detailed documentation to the Updater, the corresponding interface, and ideally also the places where we call the Updater

    Further thoughts:

    • If the updater is signalling that it encountered an invalid state transition via an InvalidStateTransitionAttempted, should the updater maybe already set its own InvalidStateTransitionAttempted flag? We would still raise the error to signal the failure to the caller. However, the caller would no longer be required to explicitly call updater.SetInvalidStateTransitionAttempted() (I'd leave that method, to allow Updater external logic to set this flag for other reasons).

    (for further details, see this PR comment)

    ☝️ implemented in [Dynamic Protocol State] Dedicated protocol state machine for epoch fallback #4931

  • Explicitly encode a node's epoch lifecycle status (e.g. joining, active, leaving) in the protocol state.

    Currently, we detect that a node is authorized to be part of the network but not to actively contribute to extending the block with the following conditional:

    registeredButNotAuthorizedForCurrentEpoch := identity.Weight == 0 && !identity.Ejected
    

    Context:

    • There are two reasons for node X to have 0 weight. Either (1) X has been ejected, or (2) X is authorized for a past or upcoming epoch, but not the current one. We detect (2) implicitly at the moment.
    • Currently, we maintain a dynamic weight but only differentiate between zero weight (not active) and positive (active). It is not clear whether the protocol would benefit from dynamically changing trust weight in the future. Reasoning:
      • For consensus participants (consensus nodes, collector nodes), we anyway use their initial weight. The practical benefits are massive, because this enables light clients that don't need to download and locally check every block header.

        Leader selection is also pre-determined for an entire epoch based on initial weight.

      • For other Verification [VNs] and Execution [ENs] nodes we currently don't have meaningful ways for weight-based load balancing. ENs anyway need to execute every block and our chunk-assignment algorithm generates uniform load across all VNs. It is not clear whether weighted protocol variants even exist (and even if such algorithms exist [open research topic], a large amount of complicated software changes would likely be needed).

    Therefore, we conclude that at the moment only a binary notion of weight (zero vs positive) is used and we don't believe the protocol will benefit from a more fine-grained differentiation in the foreseeable future.

    Further reasoning:

    • For clarity, we could will replace the dynamic weight by an enum representing the participation status with possible values of active [positive weight] and joining, leaving [zero weight]
    • Intuitively, it seems reasonable to include the Ejected status just as one possible value of the this enum. If we kept the participation status and the Ejected flag separately, we could easily differentiate between (joining, ejected) and (leaving, ejected). However, I don't think this differentiation is conceptually useful. If a node is ejected, we should just drop it from the protocol state at the next possible occasion. Whatever state it had before (joining, active, leaving) is no longer relevant for the core protocol layer (For the system smart contracts, that might be different)

    Suggested revision:

    • replace the dynamic weight by an enum representing the participation status with possible values of active [positive weight] and joining, leaving, ejected

    For further context, this todo is based on the PR comments:

    ☝️ implemented in [Dynamic Protocol State] Replace dynamic weight and ejection flag with 'epoch participation status' #5039

  • At the moment, the protocol_state.Updater returns InvalidServiceEventError for obviously broken inputs. However, lets take a full pass over the implementation to make sure we correctly differentiate between internal exceptions and external errors in all cases.

    Context

    • In a sense, all protocol-state-changing events should be considered as external inputs. Those inputs can be broken (though they should not be because they passed verification).

    • I think the updater should clearly distinguish between

      1. an input event representing an invalid state transition (sentinel error)
      2. a failed sanity check or consistency requirement indicating a corrupted protocol state

      The node internally only maintains protocol state snapshots -- should anything be wrong with them, then it is an exception. In contrast, the blocks essentially only represent state transitions. Should something not match up with the state transition, something is wrong with the block or the process generating the state transitions that are recorded in the block (e.g. epoch smart contract).

    ☝️ implemented in [Dynamic Protocol State] Dedicated protocol state machine for epoch fallback #4931

  • Take a fresh look at the InvalidStateTransitionAttempted and its relation to the node-internal EpochEmergencyFallbackTriggered event and the related flag in InstanceParams. See PR comment here. In context of this discussion, we will probably also have to consider the option for leaving epoch fallback mode as well.

    👉 moved to separate issue [Dynamic Protocol State] passing the epoch commitment deadline without an EpochCommit event #5104

Medium priority

  • Review precise conditions under which an epoch transition occurs and how that interacts with epoch emergency fallback. See Jordan's comment

    ☝️ @jordanschalm: This has changed again since the original comment. We do not explicitly check that SetupID != nil, but we do check in both cases that we are in the EpochCommitted phase before transitioning to the next epoch (which we should). I don't think there is a need to change anything here any more.

    Update: we concluded that this sanity check might be helpful in case the node has corrupted its own state. However, there are multiple similar checks already in place. So we decided to not overly bloat the code with sanity checks that are most likely never needed.

  • following Jordan's comment on PR #4559, rename ProtocolStateEntry to IdentityTableSnapshot (🎏 does it contain more than the identity table?)

    ☝️ @jordanschalm: Now, it does contain more than the identity table.

  • This is based on Jordan's comment on PR #4559 on the data structure ProtocolStateEntry, which is mixing conceptually different approaches at the moment:

    1. Links to state-change events, from which can be derived the core useful piece of data (the Identity Table)
    2. The Identity Table itself

    In the following, we are going to refer to the ProtocolStateEntry as IdentityTableSnapshot (i.e. assume the previous ToDo is already implemented)

    Concerns are:

    I'm wondering how this approach will scale when we have more different state-change events. Would we need to put them all in here? Ideally, after a state-change event is processed, we store only the resulting IdentityTable.

    Conceptually, the current approach could become confusing and error-prone when we add slashing/ejection events, which modify one node and leave the rest unchanged (maintenance and extendability concern). Furthermore, we need to separate the state transition logic anyway to implement construction and verification of Identity Table changes in consensus nodes (see issue https://github.com/dapperlabs/flow-go/issues/5517).

    Proposal:

    • Implement a separate state machine $\mathcal{S}$ for evolving the protocol state.

    • Conceptually, each consensus node has knowledge of the parent block's $IdentityTableSnapshot_{N-1}$ (that takes effect, after the QC in the child block confirms its validity, so specifically for the child block)

    • The child block $N$ might contain state changing operations $\Delta_N$ (example: $\Delta_N$ might be an EpochSetup event, ejection request, slashing ...)

    • the proposer of the child block will include the hash of the resulting $IdentityTableSnapshot_{N}$, which is a commitment to the resulting Identity Table after the block $N$ is processed. Both primary as well as consensus replicas compute / validate $IdentityTableSnapshot_{N}$ by applying the state machine $\mathcal{S}$:

      $IdentityTableSnapshot_{N} = \mathcal{S}(IdentityTableSnapshot_{N-1}, \Delta_N, view)$

      where $view$ is the view of the block containing $\Delta_N$ (I think this will be very helpful for consistency checks)

  • Extend the unit tests for RichProtocolStateEntry by also testing scenarios:

    • constructing RichProtocolStateEntry for epoch setup phase where no prior epoch exist (i.e. first epoch setup phase after spork) 👉 ToDo
    • constructing RichProtocolStateEntry for epoch commit phase where no prior epoch exist (i.e. first epoch commit phase after spork) 👉 ToDo

    ☝️ implemented in [Dynamic Protocol State] TODOs and refactoring, part 2 #5080

  • For the ProtocolState storage layer abstraction I would suggest to also include a cache for the secondary index (retrieving IdentityTable by block ID) 👉 PR comment

    ☝️ implemented in [Dynamic Protocol State] TODOs and refactoring, part 2 #5080

Low priority

  • I feel the method Members of protocol.Cluster should return IdentitySkeletons (👉 code) We are doing multiple conversions to IdentitySkeleton in code using protocol.Cluster, which I feel indicate a design inconsistency.

    ☝️ implemented in [Dynamic Protocol State] Changing structure of participants in EpochSetup  #4726

  • For the ProtocolState storage layer abstraction, review possibility for 'caching the RichProtocolStateEntry on store' 👉 PR comment

    ❌ Won't implement: On the happy path, the Protocol State changes rarely throughout the epoch. The protocol only requires 3 stores per epoch. On the one hand, there is a non-negligible overhead for the data base read on the first retrieve. On the other hand, this is extremely rare and the latency cost is not great but reasonable.

  • Cache GlobalParams rather than reading per request and panicking on unexpected errors [Dynamic Protocol State] Read-only interfaces of protocol state #4579 (comment)

  • For EpochSetup and EpochCommit events, we have a long list of consistency checks in the RichProtocolStateEntry constructor

    I originally proposed to introduce the convention of nil ≡ ZeroID for EpochSetup and EpochCommit, but as Yurii pointed out, this would be violating "safety by default" approach:

    some values should always be not nil, otherwise it will crash during the development, we lose this property if we implement ZeroID for nil epoch setup

    If we have time, lets revisit the RichProtocolStateEntry constructor and try to add convenience functions to reduce the amount of boilerplate code that we have to write for such sanity checks. Presumably, we want to have analogous sanity checks in other parts of the code and a more compact way of implementing them would benefit code readability.

    For details, please see this PR discussion: [Dynamic Protocol State] ProtocolStateEntry refactoring #4721 (comment)

    ❌ Won't implement: sanity checks are reasonably consolidated. While they add quite a lot of lines, their explicitness and easy readability are benefits.

  • My gut feeling is that we should try to remove EpochStatus as this is now superseded by ProtocolStateEntry. I think our understanding of implementation backing the InitialProtocolState and DynamicProtocolState has evolved. Specifically, the RichProtocolStateEntry is already available as part of the API, which contains Epoch Setup and Commit events for the current, next and previous epochs (if available).

    ☝️ implemented in [Dynamic Protocol State] Remove EpochStatus #5089

    I am not sure how much work the following refactoring would be. I think we should only invest a limited amount of time, maybe a week to get this done (if it takes significantly more time, I think we should leave the code as is).

    • I would be inclined to not expose the RichProtocolStateEntry struct through the interface. Instead, I think we should utilize an interface that exposes all relevant data through methods. We already have an interface for all the Epoch information: EpochQuery
    • The InitialProtocolState interface could be completely implemented by RichProtocolStateEntry (removing the need for the InitialProtocolStateAdapter wrapper struct)
    • similarly, the DynamicProtocolState could also be completely implemented by RichProtocolStateEntry (in most cases I think we have GlobalParams anyway available). Thereby we would remove the need for DynamicProtocolStateAdapter, which would further simplify the code.
    • I feel this would be one step towards consolidating the protocol.State and the protocol.ProtocolState

Code structure and clarity

  • Epoch setup event should only provide IdentitySkeletons
    ☝️ implemented in [Dynamic Protocol State] Changing structure of participants in EpochSetup  #4726

  • I feel it would be the most consistent if all initial identities would only be represented by IdentitySkeletons. Thereby, we have a clear type distinction between full Identity that can change from block to block and IdentitySkeleton that is fixed for the the Epoch.

  • From my perspective, we are approaching a complexity / maintainability challenge w.r.t. the protocol state. The protocol state has been complex for a long time and we have iteratively refactored it to manage its growing set of functionality. With the addition of the dynamic protocol state, I feel we are at the point again where we should be thinking about further modularization and separation of concerns.

    Areas I would suggest to consider:

    • The the moment, committee.Cluster and committee.Consensus live on top of protocol.State (👉 code). As a result, some logic for managing the Identity Table is in protocol.State while other logic is in the committees. As a result, we convert forth and back between full Identities and IdentitySkeletons (👉 code), which I feel is prone to inconsistent code modifications.
    • I think it would be beneficial to strongly separate the logic on a lower level. We introduce largely independent components for
      1. the Identity Table, including the committee.Cluster and committee.Consensus. They manage their own portion of the state in the DB and provide mutator operations that return the necessary data base operations
      2. The Epoch information
      3. everything related to indexing blocks and their processing status (finality, sealing etc)
    • The protocol.State would then become the high-level orchestration layer. On the retrieval end, the protocol state would compose the interfaces from the lower-level components listed above. When adding new blocks, the protocol.State would delegate to the respective logic of the lower-level components, collect the resulting data base changes and apply them all in one single transaction.

    Dynamic Protocol State

    Not really sure if this is tractable; there will probably be a lot of challenges when we try to implement. Nevertheless, I feel it would still be beneficial to explore this further to at least get an idea how complex it is and whether we can maybe incrementally work on this over time.

  • Currently, there is a lot of ambiguity in the naming of protocol.State, protocol.Snapshot, protocol.ProtocolState, protocol.InitialProtocolState, protocol.DynamicProtocolState (see Jordan's PR comment)

    • Rename DynamicProtocolState to DynamicProtocolSnapshot
    • Consolidate protocol.InitialProtocolState with protocol.Epoch APIs. Currently there is a lot of overlap
    • ~As mentioned here consider implementing interfaces for accessing dynamic protocol state using RichProtocolStateEntry. ~
  • We are exposing Entry in protocol state API to access low-level data, this is handy in few cases such as bootstrapping, but also dangerous as mentioned here. Consider removing it from higher level APIs. This should be handled after having a working solution since it's hard to say how to perform certain actions that depend on existence of Entry.

Potentially obsolete comments and suggestions (for completeness)

  • This is based on (👉 Jordan's PR comment)

    ProtocolStateEntry.NextEpochProtocolState contains a lot of duplicated data from ProtocolStateEntry. What if we replaced the field NextEpochProtocolState with:

    • a field, NextEpochEventIDs
    • a field, NextEpochIdentities
    • a method, NextEpochProtocolState(), which aggregates the next epoch protocol state

    Separately, if we use an explicit state enum (suggestion) rather than using zero-weight, we could always track all current identities together.

    Comment:

    • I think the duplication issue will be addressed in the ToDo further above:

      ProtocolStateEntry would then be an ordered pair for two adjacent epochs (N, N+1), where the earlier Epoch can be nil if and only if N+1 is the first epoch of the spork / genesis.

    • regarding the suggestion to 'track all current identities together':

      As discussed in [Proposal] Detailed specification for framework functionality to support future slashing, I believe is is fundamentally necessary to track the identity table for the next and the current epoch separately (see section 'necessity of deferred operations'), which in turn necessitates data duplication.

  • The concepts of PreviousEpoch and CurrentEpoch and NextEpoch in the ProtocolStateEntry could potentially lead to misunderstandings resulting in subtle security vulnerabilities. Specifically, they need to be updated at the Epoch transition, despite the information technically still being the same. Furthermore, they are imprecise compared to identifying an epoch uniquely via its counter, because previous, current, next are defined with respect to ca changing reference frame.

    Todo:

    • identify Epochs solely by their epoch counter
    • a ProtocolStateEntry would then be an ordered pair for two adjacent epochs (N, N+1), where the earlier Epoch can be nil if and only if N+1 is the first epoch of the spork / genesis.
    • similarly RichProtocolStateEntry, would also be an ordered pair for two adjacent epochs (N, N+1)

    Notes:

    • With this change, we are removing the previous epoch from the ProtocolStateEntry.
    • This is intended for the following reason: At the beginning of the Epoch (specifically all the way throughout the staking phase), the pair (N-1, N) will contain information about the last epoch (N). With the epoch setup event, the protocol state changes to carrying the pair (N, N+1). Therefore, any light client that is trusting a block of Epoch N can retrieve trustlessly information about the prior epoch and the next epoch by querying two different block header (+ short sequence of descendants to prove finality). I think the additional networking and cpu consumption is negligible while the software simplification is notable.

    We decided to not do this, for the following reason. The conceptual status of a node changes at epoch boundaries for nodes joining or leaving. Hence, the conceptual complexity remains. We chose to work with a conceptually more intuitive model at the code level, where the identity table changes at epoch switchover. While this complicates the safety and consistency proofs, it move this complexity to a formal (mathematical) level outside of the code, where we have the drastically more powerful tool of formal proofs at our disposal to guarantee correctness.

@AlexHentschel
Copy link
Member Author

Adding

suggestion for naming convention

more details are here in the appendix of notion doc [Proposal] Detailed specification for framework functionality to support future slashing

  • The current DynamicProtocolState (snapshot of the protocol state at a specific block) could be named as

    type ProtocolStateSnapshot struct {
     EpochStaticProtocolState
     DynamicProtocolState
    }
  • Similarly, the Identity would decompose into

    type Identity struct { 
     EpochStaticIdentity
     DynamicIdentity
    }

In both APIs, the sub-APIs are mutually exclusive. Hence, we would need to change the current API:

@AlexHentschel
Copy link
Member Author

All items are either implemented, a PR is up, or we decided that we don't want to implement the item.
Only exception is last High-Prio point, which has ben separated out into its own issue #5104. This work is only needed for EFM Recovery

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants