Skip to content

Latest commit

 

History

History
287 lines (188 loc) · 19.3 KB

nep-0455.md

File metadata and controls

287 lines (188 loc) · 19.3 KB
NEP Title Author Status DiscussionsTo Type Category Created
455
Parameter Compute Costs
Andrei Kashin <andrei.kashin@near.org>, Jakob Meier <jakob@near.org>
Final
Protocol Track
Runtime
26-Jan-2023

Summary

Introduce compute costs decoupled from gas costs for individual parameters to safely limit the compute time it takes to process the chunk while avoiding adding breaking changes for contracts.

Motivation

For NEAR blockchain stability, we need to ensure that blocks are produced regularly and in a timely manner.

The chunk gas limit is used to ensure that the time it takes to validate a chunk is strictly bounded by limiting the total gas cost of operations included in the chunk. This process relies on accurate estimates of gas costs for individual operations.

Underestimating these costs leads to undercharging which can increase the chunk validation time and slow down the chunk production.

As a concrete example, in the past we undercharged contract deployment. The responsible team has implemented a number of optimizations but a gas increase was still necessary. Meta-pool and Sputnik-DAO were affected by this change, among others. Finding all affected parties and reaching out to them before implementing the change took a lot of effort, prolonging the period where the network was exposed to the attack.

Another motivating example is the upcoming incremental deployment of Flat Storage, where during one of the intermediate stages we expect the storage operations to be undercharged. See the explanation in the next section for more details.

Rationale

Separating compute costs from gas costs will allow us to safely limit the compute usage for processing the chunk while still keeping the gas prices the same and thus not breaking existing contracts.

An important challenge with undercharging is that it is not possible to disclose them widely because it could be used to increase the chunk production time thereby impacting the stability of the network. Adjusting the compute cost for undercharged parameter eliminates the security concern and allows us to publicly discuss the ways to solve the undercharging (optimize implementation, smart contract or increasing the gas cost).

This design is easy to implement and simple to reason about and provides a clear way to address existing undercharging issues. If we don't address the undercharging problems, we increase the risks that they will be exploited.

Specifically for Flat Storage deployment, we plan to stop charging TTN (touching trie node) gas costs, however the intermediate implementation (read-only Flat Storage) will still incur these costs during writes introducing undercharging. Setting temporary high compute costs for writes will ensure that this undercharging does not lead to long chunk processing times.

Alternatives

Increase the gas costs for undercharged operations

We could increase the gas costs for the operations that are undercharged to match the computational time it takes to process them according to the rule 1ms = 1TGas.

Pros:

  • Does not require any new code or design work (but still requires a protocol version bump)
  • Security implications are well-understood

Cons:

  • Can break contracts that rely on current gas costs, in particular steeply increasing operating costs for the most active users of the blockchain (aurora and sweat)
  • Doing this safely and responsibly requires prior consent by the affected parties which is hard to do without disclosing security-sensitive information about undercharging in public

In case of flat storage specifically, using this approach will result in a large increase in storage write costs (breaking many contracts) to enable safe deployment of read-only flat storage and later a correction of storage write costs when flat storage for writes is rolled out. With compute costs, we will be able to roll out the read-only flat storage with minimal impact on deployed contracts.

Adjust the gas chunk limit

We could continuously measure the chunk production time in nearcore clients and compare it to the gas burnt. If the chunk producer observes undercharging, it decreases the limit. If there is overcharging, the limit can be increased up to a limit of at most 1000 Tgas. To make such adjustment more predictable under spiky load, we also limit the magnitude of change of gas limit by 0.1% per block.

Pros:

  • Prevents moderate undercharging from stalling the network
  • No protocol change necessary (as this feature is already a part of the protocol), we could easily experiment and revert if it does not work well

Cons:

  • Very broad granularity --- undercharging in one parameter affects all users, even those that never use the undercharged parts
  • Dependence on validator hardware --- someone running overspecced hardware will continuously want to increase the limit, others might run with underspecced hardware and continuously want to decrease the limit
  • Malicious undercharging attacks are unlikely to be prevented by this --- a single 10x undercharged receipt still needs to be processed using the old limit. Adjusting 0.1% per block means 100 chunks can only change by a maximum of 1.1x and 1000 chunks could change up to x2.7
  • Conflicts with transaction and receipt limit --- A transaction or receipt can (today) use up to 300Tgas. The effective limit per chunk is gas_limit + 300Tgas since receipts are added to a chunk until one exceeds the limit and the last receipt is not removed. Thus a gas limit of 0gas only reduces the effective limit from 1300Tgas to 300Tgas, which means a single 10x undercharged receipt can still result in a chunk with compute usage of 3 seconds (equivalent to 3000TGas)

Allow skipping chunks in the chain

Slow chunk production in one shard can introduce additional user-visible latency in all shards as the nodes expect a regular and timely chunk production during normal operation. If processing the chunk takes much longer than 1.3s, it can cause the corresponding block and possibly more consecutive blocks to be skipped.

We could extend the protocol to produce empty chunks for some of the shards within the block (effectively skipping them) when processing the chunk takes longer than expected. This way will still ensure a regular block production, at a cost of lower throughput of the network in that shard. The chunk should still be included in a later block to avoid stalling the affected shard.

Pros:

  • Fast and automatic adaptation to the blockchain workload

Cons:

  • For the purpose of slashing, it is hard to distinguish situations when the honest block producer skips chunk due to slowness from the situations when the block producer is offline or is maliciously stalling the block production. We need some mechanism (e.g. on-chain voting) for nodes to agree that the chunk was skipped legitimately due to slowness as otherwise we introduce new attack vectors to stall the network

Specification

  • Chunk Compute Usage -- total compute time spent on processing the chunk

  • Chunk Compute Limit -- upper-bound for compute time spent on processing the chunk

  • Parameter Compute Cost -- the numeric value in seconds corresponding to compute time that it takes to include an operation into the chunk

Today, gas has two somewhat orthogonal roles:

  1. Gas is money. It is used to avoid spam by charging users
  2. Gas is CPU time. It defines how many transactions fit in a chunk so that validators can apply it within a second

The idea is to decouple these two by introducing parameter compute costs. Each gas parameter still has a gas cost that determines what users have to pay. But when filling a chunk with transactions, parameter compute cost is used to estimate CPU time.

Ideally, all compute costs should match corresponding gas costs. But when we discover undercharging issues, we can set a higher compute cost (this would require a protocol upgrade). The stability concern is then resolved when the compute cost becomes active.

The ratio between compute cost and gas cost can be thought of as an undercharging factor. If a gas cost is 2 times too low to guarantee stability, compute cost will be twice the gas cost. A chunk will be full 2 times faster when gas for this parameter is burned. This deterministically throttles the throughput to match what validators can actually handle.

Compute costs influence the gas price adjustment logic described in https://nomicon.io/Economics/Economic#transaction-fees. Specifically, we're now using compute usage instead of gas usage in the formula to make sure that the gas price increases if chunk processing time is close to the limit.

Compute costs do not count towards the transaction/receipt gas limit of 300TGas, as that might break existing contracts by pushing their method calls over this limit.

Compute costs are static for each protocol version.

Using Compute Costs

Compute costs different from gas costs are only a temporary solution. Whenever we introduce a compute cost, we as the community can discuss this publicly and find a solution to the specific problem together.

For any active compute cost, a tracking GitHub issue in nearcore should be created, tracking work towards resolving the undercharging. The reference to this issue should be added to this NEP.

In the best case, we find technical optimizations that allow us to decrease the compute cost to match the existing gas cost.

In other cases, the only solution is to increase the gas cost. But the dApp developers who are affected by this change should have a chance to voice their opinion, suggest alternatives, and implement necessary changes before the gas cost is increased.

Reference Implementation

The compute cost is a numeric value represented as u64 in time units. Value 1 corresponds to 10^-15 seconds or 1fs (femtosecond) to match the gas costs scale.

By default, the parameter compute cost matches the corresponding gas cost.

Compute costs should be applicable to all gas parameters, specifically including:

Changes necessary to support ExtCosts:

  1. Track compute usage in GasCounter struct
  2. Track compute usage in VMOutcome struct (alongside burnt_gas and used_gas)
  3. Store compute usage in ActionResult and aggregate it across multiple actions by modifying ActionResult::merge
  4. Store compute costs in ExecutionOutcome and aggregate them across all transactions
  5. Enforce the chunk compute limit when the chunk is applied

Additional changes necessary to support ActionCosts:

  1. Return compute costs from total_send_fees
  2. Store aggregate compute cost in TransactionCost struct
  3. Propagate compute costs to VerificationResult

Additionaly, the gas price computation will need to be adjusted in compute_new_gas_price to use compute cost instead of gas cost.

Security Implications

Changes in compute costs will be publicly known and might reveal an undercharging that can be used as a target for the attack. In practice, it is not trivial to exploit the undercharging unless you know the exact shape of the workload that realizes it. Also, after the compute cost is deployed, the undercharging should no longer be a threat for the network stability.

Drawbacks

  • Changing compute costs requires a protocol version bump (and a new binary release), limiting their use to undercharging problems that we're aware of

  • Updating compute costs is a manual process and requires deliberately looking for potential underchargings

  • The compute cost would not have a full effect on the last receipt in the chunk, decreasing its effectiveness to deal with undercharging. This is because 1) a transaction or receipt today can use up to 300TGas and 2) receipts are added to a chunk until one exceeds the limit and the last receipt is not removed. Therefore, a single receipt with 300TGas filled with undercharged operations with a factor of K can lead to overshooting the chunk compute limit by (K - 1) * 300TGas

  • Underchargings can still be exploited to lower the throughput of the network at unfair price and increase the waiting times for other users. This is inevitable for any proposal that doesn't change the gas costs and must be resolved by improving the performance or increasing the gas costs

  • Even without malicious intent, the effective peak throughput of the network will decrease when the chunks include undercharged operations (as the stopping condition based on compute costs for filling the chunk becomes stricter). Most of the time, this is not the problem as the network is operating below the capacity. The effects will also be softened by the fact that undercharged operations comprise only a fraction of the workload. For example, the planned increase for TTN compute cost alongside the Flat Storage MVP is less critical because you cannot fill a receipt with only TTN costs, you will always have other storage costs and ~5Tgas overhead to even start a function call. So even with 10x difference between gas and compute costs, the DoS only becomes 5x cheaper instead of 10x

Unresolved Issues

Future possibilities

We can also think about compute costs smaller than gas costs. For example, we charge gas instead of token balance for extra storage bytes in NEP-448, it would make sense to set the compute cost to 0 for the part that covers on-chain storage if the throttling due to increased gas cost becomes problematic. Otherwise, the throughput would be throttled unnecessarily.

A further option would be to change compute costs dynamically without a protocol upgrade when block production has become too slow. This would be a catch-all, self-healing solution that requires zero intervention from anyone. The network would simply throttle throughput when block time remains too high for long enough. Pursuing this approach would require additional design work:

  • On-chain voting to agree on new values of costs, given that inputs to the adjustment process are not deterministic (measurements of wall clock time it takes to process receipt on particular validator)
  • Ensuring that dynamic adjustment is done in a safe way that does not lead to erratic behavior of costs (and as a result unpredictable network throughput). Having some experience manually operating this mechanism would be valuable before introducing automation

and addressing challenges described in near/nearcore#8032 (comment).

The idea of introducing a chunk limit for compute resource usage naturally extends to other resource types, for example RAM usage, Disk IOPS, Background CPU Usage. This would allow us to align the pricing model with cloud offerings familiar to many users, while still using gas as a common denominator to simplify UX.

Changelog

1.0.0 - Initial Version

This NEP was approved by Protocol Working Group members on March 16, 2023 (meeting recording):

1.0.1 - Storage Related Compute Costs

Add five compute cost values for protocol version 61 and above.

  • wasm_touching_trie_node
  • wasm_storage_write_base
  • wasm_storage_remove_base
  • wasm_storage_read_base
  • wasm_storage_has_key_base

For the exact values, please refer to the table at the bottom.

The intention behind these increased compute costs is to address the issue of storage accesses taking longer than the allocated gas costs, particularly in cases where RocksDB, the underlying storage system, is too slow. These values have been chosen to ensure that validators with recommended hardware can meet the required timing constraints. (Analysis Report)

The protocol team at Pagoda is actively working on optimizing the nearcore client storage implementation. This should eventually allow to lower the compute costs parameters again.

Progress on this work is tracked here: near/nearcore#8938.

Benefits

  • Among the alternatives, this is the easiest to implement.
  • It allows us to able to publicly discuss undercharging issues before they are fixed.

Concerns

No concerns that need to be addressed. The drawbacks listed in this NEP are minor compared to the benefits that it will bring. And implementing this NEP is strictly better than what we have today.

Copyright

Copyright and related rights waived via CC0.

References

Live Compute Costs Tracking

Parameter Name Compute / Gas factor First version Last version Tracking issue
wasm_touching_trie_node 6.83 61 TBD nearcore#8938
wasm_storage_write_base 3.12 61 TBD nearcore#8938
wasm_storage_remove_base 3.74 61 TBD nearcore#8938
wasm_storage_read_base 3.55 61 TBD nearcore#8938
wasm_storage_has_key_base 3.70 61 TBD nearcore#8938