diff --git a/EIPS/eip-5843.md b/EIPS/eip-5843.md new file mode 100644 index 00000000000000..90755f382d3c4a --- /dev/null +++ b/EIPS/eip-5843.md @@ -0,0 +1,450 @@ +--- +eip: 5843 +title: EVM Modular Arithmetic Extensions +status: Draft +type: standards track +author: Jared Wasinger <@jwasinger>, Alex Beregszaszi (@axic) +discussions-to: https://ethereum-magicians.org/t/eip-5843-evm-modular-arithmetic-extensions/12425 +category: Core +created: 2022-10-26 +requires: 4750, 3670 +--- + +## Abstract + +This EIP introduces EVMMAX (EVM Modular Arithmetic Extensions) - a set of opcodes for performing efficient modular addition, subtraction and Montgomery modular multiplication at varying bitwidths. + +## Motivation +Additional crypto precompiles have long been desired for the EVM. Among these, operations for various elliptic curves have been proposed and not yet adopted ([BLS12-381](./eip-2537.md), [BW6-761](./eip-3026.md), [MNT4](./eip-1895.md), [Baby Jubjub](./eip-2494.md), [secp256k1](https://github.com/ethereum/EIPs/issues/603), [BLS12-377](./eip-2539.md), [BN256 HashToCurve](./eip-3068.md)). + +Crypto precompiles can be problematic for several reasons: +* They introduce complicated algorithms into the Ethereum protocol which are not understood by most client developers, and often lack good explanatory resources beyond academic papers. + * aside from cryptography: input deserialization and gas-charging logic are other sources of added protocol surface that can potentially introduce consensus bugs. +* It takes significant development/testing time and effort to ensure performance and correctness. And even this effort is often not sufficient as precompiles have been the source of previous consensus bugs (e.g.: bn128 ecmul not doing subgroup checks, ecrecover bug from 2016, Nimbus modexp input padding bug, others?). +* The skillset required to implement performant crypto software is very specialized. This makes developing and maintaining multiple production-grade implementations of the same specification difficult. + * This can have the effect that multiple clients will default to use the same implementation. It can introduce a potential single point of failure for the network by effectively enshrining specific implementations into the spec if the pricing model is aggressive. +* New research developments can make existing precompiles outdated and necessitate new ones. Removing old precompiles completely will likely be impossible (the current situation regarding `SELFDESTRUCT` removal illustrates the difficulties of removing old functionality). + +To reduce pressure for new crypto precompiles, we can identify and optimize operations which form a common bottleneck for many desired use-cases in order to make them efficiently-implementable as EVM contracts. Above-mentioned elliptic curve precompiles share a common performance bottleneck: modular arithmetic (and specifically: modular multiplication) using a fixed, prime modulus. + +Currently in the EVM, modular addition and multiplication are performed with the `ADDMOD` and `MULMOD` opcodes. Inputs and a modulus are passed as stack items and a result is returned via the stack. The opcodes are permissive: the modulus can be any number and inputs are not required to be less than the modulus. + +`ADDMOD` and `MULMOD` have a lot of utility. However, they are restricted to values that fit within 256 bits. They also cannot take advantage of optimizations that can greatly improve performance in many cases. + +If we add the requirement that inputs must be less than the modulus, modular addition and subtraction can be implemented efficiently with each requiring one extended-precision addition and one extended-precision subtraction. Normal modular multiplication is slower and requires an extended-precision multiplication to compute a double-width product followed by a modular reduction. + +To improve the performance of modular multiplication, values can be expressed in Montgomery form. This allows for the use of Montgomery modular multiplication which replaces the division by the modulus with a bit-shift and single subtraction in the reduction step. + +Operating in Montgomery form requires precomputation of a modulus-specific constant as well as the cost of converting values in canonical form to/from Montgomery form. However, these costs are greatly offset by the savings gained when performing many modular multiplications. + +## Specification + +#### Context Variables + +| Name | Type | Meaning | +| ---- | ------- | --- | +| `evmmax_state` | `EVMMAXState` | a variable representing ephemeral state which exists for the duration of the current call and in the scope of the current call | + +``` +class EVMMAXState(): + def __init__(self, mod: int, r_squared: int, mod_inv: int, value_size: int, mem_start_offset: int): + self.mod = mod + self.r_squared = r_squared + self.mod_inv = mod_inv + self.value_size = value_size + self.mem_start_offset = mem_start_offset +``` + +#### Conventions + +| Syntax | Meaning | +| ------ | ------ | +| `x === y % m` | equality of residue classes: `x % m == y % m` | +| `pow(x, -1, m)` | modular multiplicative inverse of `x` with respect to modulus `m`: `x * pow(x, -1, m) === -1 % m` | +| `gcd(x,y)` | greatest common denominator of `x` and `y` | + +#### Definitions + +| Word | Meaning | +| ---- | ------- | +| word | a value whose size is the amount of data a client CPU's internal data registers can hold and process at one time (64bits on most machines)| +| bigint | an arbitrary-precision unsigned integer represented by a list of word-sized digits in descending order of significance| + +#### Constants: + +| Name | Value | +| ---- | ------- | +| `MAX_VALUE_SIZE` | 16 | + +Introduce new opcodes: + +| Name | Opcode | Stack In | Stack out | Immediate Size (bytes) | +| ----- | ---- | -------- | --------- | -------------- | +| SETMODX | 0x21 | 3 | 0 | 0 | +| ADDMODX | 0x22 | 0 | 0 | 3 | +| SUBMODX | 0x23 | 0 | 0 | 3 | +| MULMONTX | 0x24 | 0 | 0 | 3 | +| TOMONTX | 0x25 | 2 | 0 | 0 | + +`ADDMODX`, `SUBMODX` and `MULMONTX` take inputs encoded in a required 3-byte immediate value appended to the opcode. + +**Note**: Gas models are based on worst-case input benchmarks of the Go implementation for the arithmetic on consumer hardware (i7-6500U CPU @ 2.50GHz). They assume a target gas rate of 25 nanoseconds of execution time per gas. + +### `SETMODX` + +`SETMODX` takes three stack inputs: `(top of stack) mod_offset, value_size, evmmax_mem_start_offset` where `mod_offset` refers to an EVM memory offset where a modulus occupying `value_size * 8` bytes with big-endian ordering is expected. If `value_size == 0` or `value_size > MAX_VALUE_SIZE` or `(mod_offset + value_size * 8) - 1` falls outside the bounds of allocated memory, consume all call gas and return to the caller in an exceptional state. Charge `gas_setmodx(value_size)` return to the caller in an exceptional state if there was insufficient gas remaining: +``` +SETMODX_GAS_A = 3.8 +SETMODX_GAS_B = 75.0 + +def gas_setmodx(value_size): + return round(SETMODX_GAS_A * value_size + SETMODX_GAS_B) +``` + +Load the modulus and check that the modulus is: +* odd +* greater than 1 +* greater than or equal to `(1<<((value_size - 1) * 64)`: the modulus representation occupies the most significant 64 bits. + +If any check fails, consume all call gas and return to the caller in an exceptional state. Otherwise, compute `evmmax_state` (overwriting the previous value if `evmmax_state != None`) +``` +r_val = 1 << (value_size * 64) +aux_mod = 1 << SYSTEM_WORD_SIZE_BITS +mod_inv = pow(-mod, -1, aux_mod) +r_squared = (r_val ** 2) % mod + +evmmax_state = EVMMAXState(mod, r_squared, mod_inv, value_size, evmmax_mem_start_offset) +``` + +### Arithmetic Opcodes + +The immediate bytes appended after `ADDMODX`, `SUBMODX` and `MULMONTX` are interpreted as 1-byte values `out_slot`,`x_slot`,`y_slot`. Each slot corresponds to a range of the memory space: +``` +def map_slot_to_mem_range(evmmax_state, slot): + return (evmmax_state.mem_start + slot * evmmax_state.value_size * 8, evmmax_state.mem_start + (slot + 1) * value_size * 8 - 1) +``` + +If `evmmax_state` is not set with the call context or the maximum byte of the ranges specified in the inputs is outside the bounds of allocated memory, consume all call gas and return to the caller in an exceptional state. Otherwise, charge a gas cost based on the value size (`gas_addmodx`, `gas_submodx` or `gas_mulmontx`), returning to the caller in an exceptional state if there was insufficient remaining gas. + +``` +MULMONTX_GAS_A = 0.1 +MULMONTX_GAS_B = 0.7 + +ADDMODX_GAS_A = 0.2 +ADDMODX_GAS_B = 0.6 + +def gas_mulmontx(value_size): + cost = round(MULMONTX_GAS_A * (value_size ** 2) + MULMONTX_GAS_B) + if cost == 0: + cost = 1 + return cost + +def gas_addmodx(value_size): + cost = round(ADDMODX_GAS_A * value_size + ADDMODX_GAS_B) + if cost == 0: + cost = 1 + return cost + +def gas_submodx(value_size): + return gas_addmodx(value_size) + +def gas_tomontx(value_size): + return gas_mulmontx(value_size) +``` + +##### Gas Cost Table + +| Operation | 64bit | 128bit | 192bit | 256bit | 320bit | 384bit | 448bit | 512bit | 576bit | 640bit | 704bit | 768bit | 832bit | 896bit | 960bit | +| --------- | ---------- | ---- | ---- | ----- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ----- | +|MULMONTX/TOMONTX|1|1|2|2|3|4|6|7|9|11|13|15|18|20|23|26| +|ADDMODX/SUBMODX|1|1|1|1|2|2|2|2|2|3|3|3|3|3|4|4| +|SETMODX|79|83|86|90|94|98|102|105|109|113|117|121|124|128|132|136| + +Interpret the memory ranges associated with `x_slot`, `y_slot` as big-endian values `x`,`y`. Assert that `x` and `y` are less than the modulus, returning to the caller in an exceptional state if they are not. Then, each arithmetic opcode computes a result, expressing it as a big-endian value padded to be `evmmax_state.value_size * 8` bytes long, and placing it into the memory range referenced by `out_slot`. + +`ADDMODX` and `SUBMODX` compute normal modular addition/subtraction: +``` +# define some helpers ---------------------------------------------------- + +# note: WORD_BITS = 64, BASE = 1<<64 on a 64-bit system + +# extract two words of a double-width value into hi, low words +def hi_lo(valDouble: int) -> (int, int): + assert valDouble < 1 << (WORD_BITS * 2), "val must fit in two words" + return (valDouble >> WORD_BITS) % BASE, valDouble % BASE + +# single-word subtraction with borrow-in/out +# compute (x - y - b). if negative return (1, abs(x - y - b) % BASE), else return (0, (x - y - b) % BASE) +# b (borrow-in) must be 1 or 0 +def sub_with_borrow(x: int, y: int, b: int) -> (int, int): + assert b == 0 or b == 1, "borrow in must be zero or one" + res = x - y - b + b_out = 0 + if res < 0: + res = BASE - abs(res) + b_out = 1 + + return b_out, res + +# modular addition +# input +# x, y, mod - bigint inputs +# requires: +# * x < mod and y < mod +# * len(x) == len(y) == len(mod) +# returns +# (x + y) % mod expressed as a bigint +def addmodx_arith(x: [int], y: [int], mod: [int]) -> [int]: + assert len(x) == len(y) and len(y) == len(mod), "bignum inputs must have same length" + word_count = len(mod) + tmp = [0] * word_count + z = [0] * word_count + c, c1 = 0, 0 + + for i in reversed(range(word_count)): + c, tmp[i] = hi_lo(x[i] + y[i] + c) + + for i in reversed(range(word_count)): + c1, z[i] = sub_with_borrow(tmp[i], mod[i], c1) + + if c == 0 and c1 != 0: + z[:] = tmp[:] + + return z + +# modular subtraction +# input +# x, y, mod - bigint inputs +# requires: +# * x < mod and y < mod +# * len(x) == len(y) == len(mod) +# returns +# (x - y) % mod expressed as a bigint +def submodx_arith(x: [int], y: [int], mod: [int]) -> [int]: + assert len(x) == len(y) and len(y) == len(mod), "bignum inputs must have same length" + word_count = len(mod) + tmp = [0] * word_count + z = [0] * word_count + c, c1 = 0, 0 + + for i in reversed(range(word_count)): + c, tmp[i] = sub_with_borrow(x[i], y[i], c) + + for i in reversed(range(word_count)): + c1, z[i] = hi_lo(tmp[i] + mod[i] + c1) + + if c == 0: + z[:] = tmp[:] + + return z +``` + +`MULMONTX` computes the Montgomery modular multiplication of two inputs with respect to a modulus (described below). + +#### Montgomery Multiplication + +A value `x` expressed in Montgomery form is is `(x * R) % mod` where `R > mod` and `gcd(R, mod) = 1`. We choose `R = 1 << (evmmax_state.value_size * 64)`. + +Modular addition and subtraction of values in Montgomery form is computed the same as normally. + +``` +x * R % mod + y * R % mod == (x + y) * R % mod +x * R % mod + y * R % mod == (x - y) * R % mod +``` + +However modular multiplication in Montgomery form works somewhat differently. The modular multiplication `x_mont * y_mont == x * R % mod * y * R % mod == x * y * R * R % mod` is not the expected value in Montgomery form: `x * y * R % mod`. The extra factor of `R` must be removed. Therefore Montgomery multiplication computes computes `x_mont * y_mont * pow(R, -1, mod) % mod`. + +``` +# input: +# x < mod +# y < mod +# mod - an odd modulus +# R - a power of two, and greater than mod +# mod_inv_full - pow(-mod, -1, R) +# output: +# (x * y * pow(R, -1, mod)) % mod +# +def mulmont(x: int, y: int, mod: int, mod_inv_full: int, R: int) -> int: + T = x * y # 1 + m = ((T % R) * mod_inv_full) % R # 2 + T1 = T + m * mod # 3 + T1 /= R # 4 + if T1 >= mod: + T1 -= mod # 5 + return T1 +``` + +To show that `mulmont` is correct, we can prove a few properties of the algorithm: +* `T1 % R == 0` before the division in step 4: + * `T1 == ((T % R) * mod_inv_full % R) * mod + T === ((-1) * T + T) % R === 0 % R` +* Before the subtraction in step 5, `T1 === x * y * r_inv % mod` where `r_inv = pow(R, -1, mod)`: + * `T1 === (T + m*mod)*r_inv % mod === (T * r_inv + m * r_inv * mod) % mod === T * r_inv % mod === x * y * r_inv % mod`. +* `0 < T1 < 2 * mod` after the division in step 4: + * After step 3, `m` is between 0 and `R*mod - 1` so `0 < T1 < (mod**2 - 1) + (R - 1) * mod < (R*mod - 1) + (R - 1)*mod < 2*R*mod`. Division by `R` makes the result smaller than `2 * mod`. + +Note that the choice of `R` as a power of two enables the division in step 4 to be implemented as a bit-shift. + +`mulmont` is the generic algorithm to perform Montgomery Multiplication. There are other algorithms which achieve better performance at small bit-widths. These use `evmmax_state.mod_inv` for their precomputed Montgomery constant. A reference implementation of the most widely-used algorithm is provided in the appendix. + +#### Conversion between Montgomery and Canonical forms + +Conversion of a value to Montgomery form can be computed as the montgomery multiplication of the value with `evmmax_state.r_squared`. + +`TOMONTX` takes two stack items as input `(top of stack) input, output`. The least significant byte of each input is an EVMMAX slot (`input_slot`/`output_slot`). If `evmmax_state == None` or the offset of the highest byte of the ranges referenced by the inputs is outside the bounds of allocated memory, consume all call gas and return to the caller in an exceptional state. Interpret the memory range associated with `input_slot` as big-endian value `input`, compute and place the result in the memory range referenced by `output_slot`. + +Conversion of a Montgomery form value to canonical form can be computed as the montgomery multiplication of the value and `1`. + +## Rationale + +#### Why only addition/subtraction/multiplication and not other operations? + +Most commonly seen operation other than addition/subtraction/multiplication is modular inverse. However, when the modulus is prime, the inverse can be computed using only `MULMONTX` as `(val ** (modulus - 2)) % modulus` (Fermat's Little Theorem). + +Similarly, it appears modular square root can be computed using mostly exponentiation looking at the [source code](https://github.com/kilic/bls12-381/blob/2c8d680e434d6e5316bba2a79a43b751d8922d4f/fp.go#L252-L261) in the Kilic BLS12381 library (TODO: It's not clear to me what this algorithm is and what moduli it works for). + +In addition, it would likely be economical in various cases to compute inverses/square-roots off-chain, pass them as calldata, and cheaply verify them in the EVM. + +#### Gas Model + +* `ADDMODX` and `SUBMODX` have the same, linear complexity based on the value size. +* `MULMONTX` and `TOMONTX` have quadratic complexity based on the value size. +* `SETMODX` has linear complexity: the operation is dominated by two modular reductions by the modulus and a word-sized modular inverse. Modular reduction should have `O(N**2)` complexity. But for the benchmarks I've ran, I saw linear behavior (TODO: better justification here). + +#### Choice of Arithmetic Opcode Format + +Use of immediate values for arguments instead of the stack reduces gas cost and contract size footprint. + +The value size cap and restriction of each argument in the immediate to 1 byte keeps the amount of memory addressable to EVMMAX arithmetic opcodes (`256*16*64` bits ~= 260 kb) within the L1/L2 CPU caches of most client machines. + +This is important because we don't want to have bad worst-cases caused by exploiting cache-misses: Currently in the EVM, the cost of a memory access is 6 gas to access a 256bit value (1 push + 1 mload). EVMMAX arithmetic opcodes access 3 memory values of varying size dependent on the configured value size. + +#### EOF-Dependencies + +This EIP depends on [EIP-4750](./eip-4750.md) and [EIP-3670](./eip-3670.md): +* EIP-4750 removes dynamic jumps. This EIP could conceivably be implemented without requiring 4750 by modifying jumpdest analysis to disallow dynamic jumps into EVMMAX arithmetic opcode immediates. +* EIP-3670 disallows deployment of invalidly-formed EVM bytecode (PUSH without immediate, unused opcodes). This EIP would modify 3670 to introduce the EVMMAX opcodes as valid EVM opcodes and explicitly require presence of the 3-byte immediate values for EVMMAX arithmetic opcodes upon contract deployment. + +#### Necessity of TOMONTX Opcode + +There may be use-cases (e.g. an EVM contract implementation of the modexp precompile) where the modulus is not hardcoded. It would be difficult/expensive to compute the constant needed for conversion in the EVM. + +## Security Considerations + +### Validity of inputs + +Because `R` and `mod` are coprime, the set of `(q * R ) % mod` where `q < mod` is a permutation of the set `q % mod`: all representable values less than `mod` are Montgomery form representations of other values less than `mod`. + +## Implementation + +Geth: https://github.com/jwasinger/go-ethereum/tree/evmmax + +Executable reference code from this EIP: https://github.com/jwasinger/py-evmmax + +## Test Cases + +TODO + +## Open Questions + +### Value Size Cap + +`MAX_VALUE_SIZE` is tentatively set to 16. Use cases such as 2048/4096bit RSA verification are not possible to implement with using EVMMAX unless the limit is raised. + +This hinges on three issues: +1) Benchmarks which measure worst-case memory latency for higher bit-widths. +2) Altered gas models to account for added memory latency overhead at higher bit-widths. +2) Altered algorithms and gas-model for Montgomery Multiplication at higher bit-widths: Algorithms like CIOS (described in the appendix) have `O(N**2)` computational complexity. The implementation and gas model for higher bit-widths could use the generic `mulmont` algorithm where the constituent multiplications use a multiplication method with better complexity (like Karatsuba with `O(N**1.58)`) + +### Do we want to Provide Cheaper Pricing for Moduli of Certain Forms? + +Example: [Gnark](https://hackmd.io/@gnark/modular_multiplication) Montgomery multiplication optimization when the most significant bit of the modulus is 0 which reduced runtime by 18% for 256-bit width. + +### Aggressive Pricing for Certain Value Sizes + +Do we want to tune the costs of arithmetic operations at certain value sizes to the performance of hand-optimized assembly? + +## Appendix + +### 1 - Fast Montgomery multiplication for small bitwidths + +Montgomery multiplication algorithms which are most performant at smaller bit-widths opt to interleave the multiplication and reduction steps, removing the need to compute the full double-width product of the inputs. + +The CIOS (coarsely integrated operand scanning) Montgomery multiplication algorithm is an example, which is noted to be the most performant out of several surveyed in [Tolga Acar's thesis](https://www.microsoft.com/en-us/research/wp-content/uploads/1998/06/97Acar.pdf). + +``` +# return x >= y +def bigint_gte(x, y) -> bool: + assert len(x) == len(y), "x and y should have same number of limbs" + + for (x_word, y_word) in list(zip(x,y)): + if x_word > y_word: + return True + elif x_word < y_word: + return False + + return True + +# return x - y (omitting borrow-out) +def bigint_sub(x: [int], y: [int]) -> [int]: + assert len(x) == len(y), "num_limbs must be equal" + num_words = len(x) + res = [0] * num_words + c = 0 + + for i in reversed(range(num_words)): + c, res[i] = sub_with_borrow(x[i], y[i], c) + + return res[:] + + +# CIOS Montgomery multiplication algorithm implementation adapted from section 2.3.2 in https://www.microsoft.com/en-us/research/wp-content/uploads/1998/06/97Acar.pdf +# +# input: +# * x, y, mod - bigint inputs +# * mod_inv - pow(-mod, -1, 1< [int]: + assert len(x) == len(y) and len(y) == len(mod), "{}, {}, {}".format(x, y, mod) + assert mod[0] != 0, "modulus must occupy all limbs" + + word_count = len(mod) + + t = [0] * (word_count + 2) + + for i in reversed(range(word_count)): + # first inner-loop multiply x * y[i] + c = 0 + for j in reversed(range(word_count)): + c, t[j + 2] = hi_lo(t[j + 2] + x[j] * y[i] + c) + + t[0], t[1] = hi_lo(t[1] + c) + + m = (mod_inv * t[-1]) % BASE + c, _ = hi_lo(m * mod[-1] + t[-1]) + + # second inner-loop: reduction. + for j in reversed(range(1, word_count)): + c, t[j + 2] = hi_lo(t[j + 1] + mod[j - 1] * m + c) + + hi, t[2] = hi_lo(t[1] + c) + t[1] = t[0] + hi + + t = t[1:] + if t[0] != 0: + return bigint_sub(t, [0] + mod)[1:] + elif bigint_gte(t[1:], mod): + return bigint_sub(t[1:], mod) + else: + return t[1:] +``` + +TODO: demonstrate that `mulmont_cios` has the same properties as `mulmont`. + +## Copyright + +Copyright and related rights waived via [CC0](../LICENSE.md).