status | flip | authors | sponsor | updated |
---|---|---|---|---|
draft |
298 |
Alex Hentschel (alex.hentschel@flowfoundation.org) |
Jordan Schalm (jordan.schalm@flowfoundation.org), Leo Zhang (leo.zhang@flowfoundation.org), Janez Podhostnik (janez.podhostnik@flowfoundation.org) |
2024-11-01 |
FLIP 296: Utilize Dynamic Protocol State for Version Beacon (coordinating upgrades of the Execution Stack)
For a few years, we have been using a mechanism for upgrading Flow's execution stack. In April 2024, we have finished the implementation of the Dynamic Protocol State, which provides a significantly more robust and generally applicable framework for coordinating upgrades (for details, please see next section).
This FLIP focuses on the following short term objective, but there is also the goal to foster our long-term development roadmap:
- Direct goal [focus of this flip]: transition the existing mechanism for upgrading Flow's execution stack to using Dynamic Protocol State. A more detailed scoping is necessary, but my gut feeling is that there is only a limited amount of work needed. In essence, we want to avoid extending the old upgrade mechanism in a way that is incompatible with the Dynamic Protocol State and needing to be rewritten later.
- Midterm goal: upgrades of the execution stack are by far the most frequent. By utilizing the Dynamic Protocol State for those upgrades, we want to generate learnings on where the new framework should be improved.
- Midterm goal: we are starting to use the Dynamic Protocol State for coordinating upgrades of other parts of the protocol. We want to utilize the same framework as for execution stack upgrades, to reduce code and intellectual complexity that would otherwise result from using significantly different approaches.
- Longer-term goal: Access Nodes [ANs] can decide on whose blocks’ execution states they can run scripts for across different version of the execution stack.
A blockchain is a distributed system without a central authority controlling the underlying IT infrastructure. Hence, scenarios must be considered where nodes run different software, node operators potentially don't update in time, and it is very difficult and time-intensive to coordinate upgrades via means outside the protocol. Hence, we desire that the protocol itself provides mechanisms to coordinate and enforce behaviors upgrades of honest nodes, whose operators are responsive on reasonable time scales.
Here, we largely focus on the process by which the protocol specifies how certain component should behave, i.e. the component specification. Over time, the Flow protocol evolves and the specification for some components changes. Hence, we version component specifications for ease of referencing them.
At the moment, we have a mechanism for upgrading Flow's execution stack only. This is by far the most frequently updated part of the node software and updates are frequently time-sensitive security fixes. We have a component called the Version Beacon, which specifies how the implementation should behave and when that "expected behaviour" is supposed to change. In a nutshell, the Version Beacon for the execution stack currently specifies a block height and a "version" for the behaviour that should become active when nodes (that are concerned with execution, i.e. ENs, VNs, ANs) reach this height. The updates themselves are triggered based on block height, and are therefore called Height-Coordinated Upgrades [HCUs].
Current Version Beacon:
-
focused solely on the software stack for execution, where many aspects would need to be re-implemented if we wanted to apply the same upgrade progress to other parts of the protocol
Better: some framework that unifies and encapsulates common functionality for broad protocol upgrades (including but not limited to the execution stack)
-
requires that nodes have (potentially long) history (have seen version beacon service event, which is not guaranteed for nodes joining at epoch boundaries)
Better: each block specifies which component version is to be used for processing it
-
based on height and hence not usable for upgrading most protocol-related aspects (such as consensus, processing of information beyond transaction execution).
Better: using View instead of height for triggering behaviour changes is generally applicable and more robust (view monotonously increases over time, while height might also decrease).
💡 In a nutshell, the Protocol State tracks information about each block, including a mechanism to transfer information from the Execution state to the Protocol State in a BFT manner.
- Flow’s Protocol State to tracks and provides simple access to information about each blocks (such as epoch number, staking phase, staked nodes allowed to participate as of this block, nodes public keys, etc) 👉code.
- The Protocol State now also tracks the Component Versions of the most critical consensus component (at the moment: its own version) 👉code.
☑️ The Protocol State already tracks its own Component Version. You can take a look at these places in the code:
- Protocol State reports its own version as part of every block
- mechanism for scheduling version upgrades (at future view) exists
- mechanism enforcing that node supports and uses correct version as specified by the protocol
Upgrades without significant downtime are very important:
- For large-scale adoption, a good user and developer experience, the flow platform must be reliably available.
- However, we have the need to ship security fixes and evolve Flow, which requires software upgrades. The more frequently we can deploy upgrades, the less risk there is for unforeseen problems caused by upgrades. Furthermore, frequent code deployments generally reduce engineering efforts.
See blog post [1] for further details:
Software Version - The version identifier of a binary distribution of Flow Node software. By convention, we use semver-ish tag in Git and Docker releases.
Software has bugs and is frequently incomplete (e.g. API returning ‘not yet implemented’). The software version is a meaningful reference to describe what the software does in the real world.
However, we also desire a compact identifier [which we will call the ‘Component Version’] of how a Flow node should behave.
Component Version: version identifier for a component of the flow protocol. It references one specific behaviour of a sub-system (e.g. Execution Stack or HotStuff) of Flow, as prescribed by the protocol.
In the nutshell, for every block there is one and only one correct way of how to process that block, and how to evolve the execution state.
For distributed BFT systems, we need this notion of ‘correct behaviour’, which is inherently implementation agnostic.
We want to explicitly express that up to a certain view
-
Conceptually, for every block, each component of Flow has one and only one component version.
-
A software version can implement multiple Component Versions. E.g. AN supporting script execution across HCU boundaries
❗Don’t couple the software version to the component version! We know there will be scenarios where we want one software to implement multiple Component Versions and at that point, any one-to-one coupling of Software and Component Version will necessarily break. Instead, for each software version, we conceptually have a list of Component Versions that this software supports (even if that list only contains a single element most of the time).
Roadmap: Dynamic Protocol State for coordinating Execution Stack upgrades (including Cadence changes)
Biggest change (and possibly only significant change):
- Dynamic Protocol State should ingest Version Beacon Service Event and track’s the Execution Stack’s Component Version
Ultimately, we want to solidify the concept of Component Versions in the protocol to make it easier to evolve the protocol (i.e. the notion of how components should behave). There exists a variety of possible versioning conventions that we could employ, each with its own tradeoffs between expressing granularity of version changes, compatability between versions, and complexity.
At the moment, it is not sufficiently evident that there exists a single versioning convention that is easy and intuitive for all the different components in Flow. Therefore, we recommend that maintainers for a specific component decide what component versioning convention works well for their component's particular upgrade pattern.
In the following, we briefly review two prominent versioning convention (Semantic Versioning and Integer Versioning) and summarize how these could be used for component versioning. Though, before we do so, lets look at the notion of downwards compatability, because many versioning schemes encompass some notion of cross-compatability, which is much more constrained for blockchains compared to traditional IT systems.
For example, assume we add trigonometric functions (
Let's look at the following example:
- We have announced that trigonometric functions will be available in Cadence as of view 1000 and later.
- Assume that somebody sends a transaction that computes
$\texttt{sin}(\pi)$ , which makes it into the block with view 999. - Some nodes have already updated to the new Cadence version, which in principle can compute
$\texttt{sin}(\pi)$ , while other node operators haven't upgraded yet. - For a blockchain, we need all honest nodes to produce the same result. As our transaction is in block 999, trigonometric functions should not be available yet and all honest nodes should fail the transaction.
- Specifically, this means that the new Cadence version should have two different modes: one with trigonometric functions disabled and one with them enabled.
So in summary, we need to precisely specify the feature set and the behaviour of each feature. The notion of "at least these features and optionally more" is insufficient for blockchains.
Example:
The most basic convention is to use an integer as the version for a particular component (specification). When using this convention, we only have a notion of breaking changes. For this discussion, it is important to keep in mind that we are versioning component specifications, i.e. how a particular component should behave. We say that two specifications are equivalent, if they are describing the same behaviour - in other words, they are completely interoperable. Equivalent/interoperable component specifications have the same integer version. However, whenever the specified behaviour changes or some new functionality is added (that the prior component specification did not include) we have a new version, because the two specifications are not entirely interoperable anymore. In this case, the Version number is incremented.
Applicability and Limitations
The Integer Versioning Scheme is simple and very intuitive and suits components, where we mostly change existing behaviours. This is predominantly the case for the core protocol, where for example we say that consensus votes previously worked in one way and now work differently. For example the Component Version of the Protocol State is specified by a single integer (👉code).
With the Integer Versioning Scheme we cannot express the notion of cross-compatability. Nevertheless, Integer Versioning can still be applied to our example of extending Cadence with trigonometric functions. For example:
- The specification of Cadence without the trigonometric functions could be version
8
. - Extended with trigonometric functions, we have the specification for Cadence
9
. - Up to and including view 999, the protocol specifies that a block is to be executed with Cadence
8
:- The old Cadence implementation would report that it can execute blocks with Cadence versions
[8]
(slice containing only one version value). The FVM would tell the old Cadence implementation to execute the given block with version8
and since this version is supported, it proceeds. Our transaction fails, because the old Cadence does not know trigonometric functions. - The new Cadence implementation would report that it can execute blocks with Cadence versions
[8, 9]
. The difference is that version8
refuses to execute trigonometric functions while version9
permits them.
The FVM would tell the new Cadence implementation to execute the given block with version8
and since this version is supported, it proceeds. Our transaction fails, because the new Cadence knows that in Cadence version8
trigonometric functions were not yet included.
- The old Cadence implementation would report that it can execute blocks with Cadence versions
- For blocks wit views 1000 and higher, the protocol specifies that those blocks are to be executed with Cadence
9
: - The old Cadence implementation reports that it can execute blocks with Cadence versions
[8]
. However, the FVM would request an execution with version9
, which is not supported to Cadence errors. - The new Cadence implementation reports that it can execute blocks with Cadence versions
[8, 9]
and since the FVM requests the block's execution with a version that is supported, the block execution proceeds.
A notion of compatibility is at the heart of Semantic Versioning [SemVer]. Though, as discussed before, the notion of downwards compatability must be carefully analyzed and correctly applied to avoid problems in the blockchain space. Let's revisit our example from before of adding trigonometric functions to cadence:
- Let's assume that the Cadence specification without trigonometric functions is denoted by version
1.8
. The spec extended by trigonometric functions is version1.9
. - We have announced that trigonometric functions will be available in Cadence as of view 1000 and later, but a transaction
$T$ computing$\texttt{sin}(\pi)$ was already included in the block with view 999. - When executing block with view 999, the FVM would tell the new Cadence implementation to execute the block according to Cadence Component Spec version
1.8
, i.e. without trigonometric functions. - As you can see from this example, the new Cadence Version should still be able to execute
$T$ with trigonometric functions disabled. In other words, it is insufficient for an implementation of Cadence1.9
to do everything that1.8
can and some more - it must be able to restrict its functionality to1.8
. Specifically this additional requirement to restrict the functionality to a prior version goes beyond the established framework of SemVer, but it critically required for Flow.
The advantage of SemVer is that consolidates the concepts of Software Version and Component Version (to some degree): $$ \textnormal{Software Version :}\quad \underbrace{\texttt{major},.,\texttt{minor},}_{\textnormal{Component Version}}.,\texttt{patch} $$
-
Software with identical
$\texttt{major}.\texttt{minor}$ is cross-compatible irrespective of patch version, because they all implement the same specification (represented by the Component Version). Hence, the$\texttt{patch}$ version is not part of the Component Version, because it does not influence the conceptual behaviour. Though, it represents implementation details, so it is part of the Software Version. -
In addition, we introduce compatibility requirement from semantic versioning:
$\textnormal{Component Version :} \quad \texttt{major},.,\texttt{minor}$ - Protocol specifications with the same
$\texttt{major}$ must be fully backwards compatible (in practise, mostly limited to additive functionality)
- Protocol specifications with the same
-
Additional requirement beyond traditional SemVer:
A component implementation supporting
$\texttt{major}.\texttt{minor}$ must be able to restrict its feature set to any prior version$\texttt{major}.k$ with$k \leq \texttt{minor}$ .
Applicability and Limitations
The benefits from backwards compatibility are strong in case we want to process inputs of mixed versions over a prolonged period of time.
Though, for blockchain, the protocol typically switches abruptly at a specific view from one behaviour to another and never changes back.
Furthermore, maintaining backwards compatibility throughout many
For Flow, controlling complexity and codifying compatibility is essential. The compatibility expectations, associated risks and additional complexity of SemVer may not be beneficial in various areas of the core protocol. Maintainers considering SemVer for their component version should provide a strong answer why the Integer Versioning Scheme is disadvantageous for their particular component.
I (AlexH) do no (yet) see significant benefits of SemVer over Integer Versioning for block execution. Specifically, I don't think it makes any difference
whether the software says that it supports Cadence Specification [8, 9]
(Integer Versioning) or [1.8, 1.9]
(SemVer). The only hypothetical scenario would be,
if we said that any Cadence implementation [1.7, 1.8, 1.9]
, then there is little to no difference to
just using integer versions, e.g. the new Cadence implementation specifying that it Component Versions [7, 8, 9]
.
The only tangible benefit for SemVer I can see is for the Access Node. Here, a client could say that it wants to execute a script on the block with view 999. Transactions at this view were not allowed to use trigonometric functions. But the client may explicitly request a newer cadence version for its script that already supports trigonometric functions. Nevertheless, this is a niche scenario with limited practical relevance. It is not clear whether this scenario warrants the additional complexity of SemVer with the associated correctness risks (SemVer assumes downwards compatability by default, while maintaining downwards compatability in the implementation is generally additional work, so the default assumption of compatability induces additional risks for the happy path of block execution - no a good tradeoff in my opinion).
For components with frequent addition of entirely new features, we have discussion the importance of restricting feature sets to match prior versions in order to maintain downwards
compatability. At this point, it is intuitive to think about feature flags. For example, we could have a boolean flag
The following diagram illustrated that Feature Vectors could be represented via Integer versions as well. Here we specifically utilize that the prevalent application pattern for mainnet is that features are progressively added and enabled.
In my [AlexH] opinion, feature flags are an implementation property. If some software exposes three feature flags, e.g.
If we hypothetically expressed a Component Version via a feature vector (which I am advocating to avoid for most components), we would implicitly commit to allowing any combination
of features to be enables/disabled, which is a very difficult commitment to maintain on the implementation level.
I feel bug fixes bring versioning via Feature Vectors to its limits: once we learn that we need a bug fix, we would need to add a new feature flag. By exposing the bug fix as a feature toggle in the Dynamic Protocol State, we are essentially providing a switch to turn the bugfix on and off repeatedly at runtime. For some features this is probably very useful. However, does it make sense to (figuratively) install a switch in your control board with the sole purpose to turn it on once and never off again? Keep in mind that for every feature flag, we still need to touch the Protocol State, which is reasonably straight forward and contained engineering work, but not entirely free either. Furthermore, all that information needs to go through a smart contract and a specialized service event (this is already the case for the Execution Node's old Version Beacon), which additionally all would need to support the new feature flag.
I think it would make the most sense in this scenario to bundle changes and roll them out together. In some cases it will be a single bug fix but in many cases it will be multiple changes combined, because software upgrades on mainnet still need to go through node operators on mainnet for decentralization reasons. Feature flags are still a great tool for the implementation to provide downwards compatability (e.g. "switch off" bugfixes for older blocks). Nevertheless, I think the fact that behaviour changes are frequently bundled and form a strict chronological sequence is important.
- Use semantic versioning for areas where
- benefit from backwards compatibility outweigh the implementation and complexity cost
- we want to maintain backwards compatability over a longer period of time and across multiple upgrades
- In areas where we can’t easily provide backwards compatibility (e.g. for security or BFT reasons), we should make this explicit by using a single-integer for the Component Version.
🚧 section needs further extension
💡Note that this approach is significantly different from Ethereum's. In Ethereum, it is common practise to hard-code the block heights / views at which features change directly into the implementation. For Flow, we have the Dynamic Protocol State as an interim abstraction layer:
- the protocol state translates view to version
- the component implementation utilizes the version to determine what features to enable / disable
Flow's approach is a major advancement over Ethereum's approach in the following regard:
- Ethereum assumes forward compatability by default. In other words, unless the implementation knows that the protocol specification will change at a certain point in the future, it assumes that everything stays the same. So nodes whose software is not upgraded don't recognize themselves that they are incompatible to a future version. Hence, it requires manual human action (software upgrade) for the nodes to behave correctly and in the absence of human intervention, nodes will behave incorrectly (continue with the old version). Therefore, it is hard for the Ethereum protocol to evolve, because the default is to continue with the current functionality. Ethereum addressed this challenge with the "Difficulty Bomb", an arguably crude an-on to the protocol with limited effectiveness.
- In contrast, Flow's versioning mechanism through the Dynamic Protocol State works the opposite: the protocol communicates upfront to the implementation that the specified honest behaviour will change at some specific point in the future. This information about upgrades is distributed upfront through the Dynamic Protocol State and therefore also accessible to old implementations that have not yet upgraded. This inverts the default and facilitates protocol evolution: by default, the protocol switches to the new behaviour and nodes not supporting the new behaviour will proactively stop participating (as opposed to continuing by default with the old version).
Challenge: missing seed for starting engineering work:
- complex topic, spanning three areas of Flow [execution, Protocol, Data Availability]
- everyone is worried they are missing something, but we have limited time / priority to flesh out a holistic vision for versioning each and every aspect of the protocol
- keep talking about it, bigger picture remains hazy, so ICs keep extending our existing but insufficiently general solution (existing Version Beacon, solely based on service events, where tracking and complying to service events is entirely left to the implementation)
Approach:
We work towards transitioning the existing Version Beacon to use the Protocol State.
-
Decide now what convention we use for:
-
Component Version format of Execution Stack? (e.g.
$\texttt{major}.\texttt{minor}$ or single integer? Currently, the Version Beacon uses semver, but Bastian thinks this might be unnecessary complex for Cadence) -
One Component Version for Cadence only?
or one Component Version for Cadence+FVM combined?
or two separate Component Versions (one for Cadence and one for FVM)
-
-
Start by including Component Version for Execution Stack into Dynamic Protocol State
Bastian recommended that we specify a component version for the "Cadence Language", which specifies the available language features and their behaviour. This is useful when incrementally adding new features, where it is relatively trivial to have one binary supporting the different feature sets.
In addition, Bastian recommended to consider Component Versions for subcomponents of Cadence/FVM, such as "storage format" (how are objects stored in account storage encoded?), "CCF format" and "JSON format" (how are "external" objects encoded?).
- [1] Jordan’s Protocol Version Upgrade Mechanisms Discussion (flow forum post)
- Core-Protocol WG Meeting on Versioning, May 23, 2024
- [Brainstorming] HCU-style upgrades for all node roles
Answer: For most components, we do not track their version explicitly. Necessary updates are infrequent and not time sensitive, so that we can just bundle all changes across many components and ship them all together as part of a major upgrade (aka Spork).
However, for very few components, upgrades are frequent and time sensitive (e.g. security fixes in Cadence), so that we cannot wait for the major upgrade (aka Spork). In that case, we want to deploy the upgrades into the running network and need to specify what should happen (i.e. the component version) and when that is going to change. Only for those components we want to track their component version via the protocol state.