Faster modular inversion #1479

randombit · 2018-03-08T15:44:58Z

At this point modular inversion is one of the major bottlenecks in ECDSA signature generation, at about 30% of the total runtime. Two inversions are required, one for the nonce and the other to convert the point to affine.

Niels Moller sent me an email where he contended that the fastest approach for const time inversion at ECC sizes was using Fermat's little theorem. However in Botan with P-521, using the const time inversion algorithm (that Niels invented) is over twice as fast. So maybe the issue is (also) that our modular exponentiation algorithm is too slow.

I ran some quick checks, OpenSSL seems to take ~210k cycles for const time modular inversion, vs 360k cycles in ct_inverse_mod_odd_modulus. Simply matching that would improve ECDSA signature perf by 15%

randombit · 2018-03-12T10:55:43Z

WRT using Fermat's little theorem it would be beneficial to use the known addition chains for computing x^(p-2) % p https://briansmith.org/ecc-inversion-addition-chains-01#p256_scalar_inversion has chains for inversions mod order and mod p for P-256 and P-384.

randombit · 2018-03-12T11:01:11Z

https://eprint.iacr.org/2014/852.pdf has an addition chain for P-521 inversion.

Centralizing this logic allows curve specific implementations such as using a precomputed ladder for exponentiating by p - 2 GH #1479

Could be slightly more clever here but this is pretty decent. GH #1479

Cuts about 100K cycles from the inversion, improving ECDSA sign by 10% and ECDH by ~2% Addition chain from https://briansmith.org/ecc-inversion-addition-chains-01 GH #1479

randombit · 2018-04-20T14:36:55Z

With #1546 and #1547 we have faster field inversions for P-256, P-384, and P-521 all of which improved ECDSA signature performance by ~~ 10-15%. ECDSA verification and ECDH also improved but not as much (% wise) because getting the affine coordinate is less of the total runtime there.

We may also want specialized inversions modulo the curve order. But this doesn't really seem worthwhile because the only algorithm that benefits is ECDSA. It would be better at this point to work on improving the performance of the generic const time modular inversion, which would have a lot of benefits across the codebase.

kriskwiatkowski · 2019-06-25T21:13:34Z

Have you looked at https://eprint.iacr.org/2019/266.pdf already? It seems to me, there maybe is a chance it can be useful. Any opinion?

randombit · 2019-06-25T23:00:06Z

@henrydcase I have seen that paper but currently don't understand at all how the algo works. It seems much faster, eg they report for inversion modulo the 511-bit M-511 prime 30K Skylake cycles, while Botan's const time algo takes 200K+ Skylake cycles for a randomly chosen 512-bit prime. And the paper claims "Our advantage is also larger in applications that use “random” primes rather than special primes" [vs Fermat]. So overall certainly promising. Even if we assume a 5x slowdown going from DJB hand coded asm to C++ that's still a decent speedup.

Also our current const time algorithm only works for odd modulus which means the generation of d during RSA keygen is not protected.

That paper also has (Figure 1.2) a fast algorithm for gcd which looks simple to make const time, which would be useful to replace our current "mostly" const time gcd.

mratsim · 2020-03-20T11:11:55Z

@henrydcase I have seen that paper but currently don't understand at all how the algo works. It seems much faster, eg they report for inversion modulo the 511-bit M-511 prime 30K Skylake cycles, while Botan's const time algo takes 200K+ Skylake cycles for a randomly chosen 512-bit prime. And the paper claims "Our advantage is also larger in applications that use “random” primes rather than special primes" [vs Fermat]. So overall certainly promising. Even if we assume a 5x slowdown going from DJB hand coded asm to C++ that's still a decent speedup.

Also our current const time algorithm only works for odd modulus which means the generation of d during RSA keygen is not protected.

That paper also has (Figure 1.2) a fast algorithm for gcd which looks simple to make const time, which would be useful to replace our current "mostly" const time gcd.

Bernstein-Young algorithm also only works for odd moduli.

Note: I'm not a mathematician and I didn't try to implement it yet because it's still over my head but here is a start of what I understood, the challenges and a naive analysis of what could be the speed of the algorithm.

The algorithm as described in the paper works in ℤ × ℤ*₂ × ℤ₂ → ℤ × ℤ*₂ × ℤ₂
- with ℤ including the negative number,
- ℤ₂ quotient ring of integer modulo even number. In practice, the denominator is only a power of 2 as we do repeated halving so at the very least this can be represented by a counter of delayed halving.
- ℤ*₂, same as ℤ₂ but without zero.
Each step apply a transition matrix that swap/halves/adds/negates

def truncate(f, t):
    if t == 0: return 0
    twot = 1 << (t - 1)
    return ((f + twot) & (2 * twot - 1)) - twot

def divsteps2(n, t, delta, f, g):
    assert t >= n and n >= 0
    f, g = truncate(f, t), truncate(g, t)
    u, v, q, r = 1, 0, 0, 1
    while n > 0:
        f = truncate(f, t)
        if delta > 0 and g & 1:
            delta, f, g, u, v, q, r = -delta, g, -f, q, r, -u, -v
        g0 = g & 1
        delta, g, q, r = 1 + delta, (g + g0 * f) / 2, (q + g0 * u) / 2, (
            r + g0 * v) / 2
        n, t = n - 1, t - 1
        g = truncate(ZZ(g), t)
    M2Q = MatrixSpace(QQ, 2)
    return delta, f, g, M2Q((u, v, q, r))

def iterations(d):
    return (49 * d + 80) // 17 if d < 46 else (49 * d + 57) // 17

def recip2(f, g):
    ## Compute g^-1 mod f: f MUST be odd
    assert f & 1
    d = max(f.nbits(), g.nbits())
    m = iterations(d)
    print(f'm: {m}')
    precomp = Integers(f)((f + 1) / 2) ^ (m - 1)
    print(f'precomp: {precomp}')
    delta, fm, gm, P = divsteps2(m, m + 1, 1, f, g)
    print(f'P[0][1]: {P[0][1]}')
    V = sign(fm) * ZZ(P[0][1] * 2 ^ (m - 1))
    return ZZ(V * precomp)

Implementation challenges

As it stands the paper is hard to naively implement in a cryptographic library:

There is no negative numbers, this can be worked around by have a sign flag and the sign would tell if we need to add or sub, which can be made constant-time by conditional negation.
Or we use 2 complement representation, then the main issue is that a number with the representation 0b0111_1111_1111_1111 cannot be negated. We need a proof that this doesn't happen or we need an extra word for temporary working space.
The Sage implementation delays the halving, the quotient gets quite high, in the 2^xxxx. We cannot use delayed halvings (similar to the k parameter in Almost Montgomery Inverse in Kaliski's algorithm) as the step (g+g0*f)/2 would require us to scale (g0*f) by the delayed denominator of g. However we can eagerly do g/2 + g0*f/2, ensuring that we never overflow if we directly do g+f.
The shift needs to be an arithmetic shift and needs to restore the sign bit of the operands as they can be negative
The multiplication by 2^(m-1) can be done via Newton-Raphson / Dumas iteration (p11 of Koc paper https://eprint.iacr.org/2017/411)
and if using the delayed halving, you can shave off a lot of iterations.
The rest should only need conditional copies.

Analysis

This is my naive analysis, unfortunately I couldn't find an optimized implementation.
This is contrasted with the algorithm from Niels Möller that you use. (Note I've corrected the algorithm from paper, it incorrectly initializes v to 0.

Input: integer x, odd integer n, x < n
Output: x−1 (mod n)
1:   function ModInv(x, n)
2:   (a, b, u, v) ← (x, n, 1, 0)
3:   ℓ ← ⌊log2 n⌋ + 1            ⮚ number of bits in n
4:   for i ← 0 to 2ℓ − 1 do
5:     odd ← a & 1
6:     if odd and a ≥ b then
7:       a ← a − b
8:     else if odd and a < b then
9:       (a, b, u, v) ← (b − a, a, v, u)
10:    a ← a >> 1
11:    if odd then u ← u − v
12:    if u < 0 then u ← u + n
13:    if u & 1 then u ← u + n
14:    u ← u >> 1
15:  return v
# 15:  return v

The first thing that jumps to me is that the number of iterations is scaled by 49/17 * number of bits.
This is a 2.88 factor while Niels algorithm (which seems to be an adaptation of Stein's binary GCD for constant-time) is a fixed 2 * number of bits

The ZZ(P[0][1] * 2 ^ (m - 1) multiplication seems to be quite expensive if taken literally

Alternative implementations

Did you see the implementation by Aranha in Relic? https://github.com/relic-toolkit/relic/blob/5cdabd57/src/fp/relic_fp_inv.c#L402.
There is also an formally verified implementation in Fiat Crypto at Inversion (WIP) mit-plv/fiat-crypto#670, and here is the code generated for BLS12-381 prime: https://github.com/relic-toolkit/relic/blob/5cdabd57/src/low/x64-fiat-381/bls12_381_q_64.c#L2623 and the wrapper https://github.com/relic-toolkit/relic/blob/5cdabd57/src/low/x64-fiat-381/relic_fp_inv_low.c#L52-L78

Other algorithms

Besides the Bernstein-Young paper, the most recent papers on constant-time inversion are:

Bos, 2014, http://www.joppebos.com/files/CTInversion.pdf
Koc, 2017, http://cetinkoc.net/docs/j82.pdf

Both cost 2 * number of bits for the main algorithm but the intermediate steps seem much more complex than Möller's algorithm.

randombit · 2020-03-21T10:43:41Z

@mratsim Very helpful thank you. I had not seen either of the other implementations you reference.

I had earlier read Bos' paper for const-time Montgomery inversion but it does not seem to me a promising approach as the (not constant-time) implementation of this algorithm in almost_montgomery_inverse/normalized_montgomery_inverse (https://github.com/randombit/botan/blob/master/src/lib/math/numbertheory/mod_inv.cpp#L15) is about (IIRC) 1/2 of the performance of the constant time code, and Bos reports almost an 8x slowdown when moving to constant time (Table 1 page 6). Compared to Moeller, loop iteration count is somewhat less than 2*bits (seems to be about 1.4) but the operation count in each loop is easily twice

mratsim · 2020-03-22T10:41:21Z

Regarding speed you are probably aware of this but GCC is absolutely horrible at handling multiprecision arithmetic.

For my elliptic curve library, this is the speed I get on field operations with Clang (inversion using Möller's algorithm) on Ethereum/Blockchain related elliptic curves.

0 means that the compiler optimized the operation away (tried some volatile reads/writes but I don't want to slow the bench as well and I'm more interested in multiplication/squaring/inversion anyway)

⚠️ Measurements are approximate and use the CPU nominal clock: Turbo-Boost and overclocking will skew them.
==========================================================================================================

All benchmarks are using constant-time implementations to protect against side-channel attacks.

Compiled with Clang
Running on Intel(R) Core(TM) i9-9980XE CPU @ 3.00GHz
! Overclocked all-core turbo @4.1GHz -> CPU clock is not nominal clock, comparisons only valid ont his CPU.


--------------------------------------------------------------------------------
Addition        Fp[BN254]               0 ns         0 cycles
Substraction    Fp[BN254]               0 ns         0 cycles
Negation        Fp[BN254]               0 ns         0 cycles
Multiplication  Fp[BN254]              22 ns        67 cycles
Squaring        Fp[BN254]              18 ns        55 cycles
Inversion       Fp[BN254]            6254 ns     18763 cycles
--------------------------------------------------------------------------------
Addition        Fp[Secp256k1]           0 ns         0 cycles
Substraction    Fp[Secp256k1]           0 ns         0 cycles
Negation        Fp[Secp256k1]           0 ns         0 cycles
Multiplication  Fp[Secp256k1]          22 ns        68 cycles
Squaring        Fp[Secp256k1]          20 ns        61 cycles
Inversion       Fp[Secp256k1]        6303 ns     18910 cycles
--------------------------------------------------------------------------------
Addition        Fp[BLS12_381]           0 ns         0 cycles
Substraction    Fp[BLS12_381]           0 ns         0 cycles
Negation        Fp[BLS12_381]           0 ns         0 cycles
Multiplication  Fp[BLS12_381]          46 ns       138 cycles
Squaring        Fp[BLS12_381]          39 ns       118 cycles
Inversion       Fp[BLS12_381]       15659 ns     46979 cycles
--------------------------------------------------------------------------------

And GCC

⚠️ Measurements are approximate and use the CPU nominal clock: Turbo-Boost and overclocking will skew them.
==========================================================================================================

All benchmarks are using constant-time implementations to protect against side-channel attacks.

Compiled with GCC
Running on Intel(R) Core(TM) i9-9980XE CPU @ 3.00GHz

--------------------------------------------------------------------------------
Addition        Fp[BN254]               5 ns        17 cycles
Substraction    Fp[BN254]               3 ns        10 cycles
Negation        Fp[BN254]               2 ns         6 cycles
Multiplication  Fp[BN254]              32 ns        98 cycles
Squaring        Fp[BN254]              29 ns        88 cycles
Inversion       Fp[BN254]            7893 ns     23679 cycles
--------------------------------------------------------------------------------
Addition        Fp[Secp256k1]           5 ns        17 cycles
Substraction    Fp[Secp256k1]           3 ns        10 cycles
Negation        Fp[Secp256k1]           2 ns         6 cycles
Multiplication  Fp[Secp256k1]          34 ns       104 cycles
Squaring        Fp[Secp256k1]          33 ns        99 cycles
Inversion       Fp[Secp256k1]        7954 ns     23862 cycles
--------------------------------------------------------------------------------
Addition        Fp[BLS12_381]           9 ns        27 cycles
Substraction    Fp[BLS12_381]           5 ns        16 cycles
Negation        Fp[BLS12_381]           3 ns        10 cycles
Multiplication  Fp[BLS12_381]          62 ns       188 cycles
Squaring        Fp[BLS12_381]          57 ns       172 cycles
Inversion       Fp[BLS12_381]       28863 ns     86591 cycles
--------------------------------------------------------------------------------

For inversion GCC is 2x slower than Clang

This is something that is not unique to my library, I reported the same to the fiat-crypto project:
mit-plv/fiat-crypto#691

Carries

In particular it does not handle carries properly (see https://gcc.godbolt.org/z/2h768y) even when using the real addcarry_u64 intrinsics

#include <stdint.h>
#include <x86intrin.h>

void add256(uint64_t a[4], uint64_t b[4]){
  uint8_t carry = 0;
  for (int i = 0; i < 4; ++i)
    carry = _addcarry_u64(carry, a[i], b[i], &a[i]);
}

GCC

add256:
        movq    (%rsi), %rax
        addq    (%rdi), %rax
        setc    %dl
        movq    %rax, (%rdi)
        movq    8(%rdi), %rax
        addb    $-1, %dl
        adcq    8(%rsi), %rax
        setc    %dl
        movq    %rax, 8(%rdi)
        movq    16(%rdi), %rax
        addb    $-1, %dl
        adcq    16(%rsi), %rax
        setc    %dl
        movq    %rax, 16(%rdi)
        movq    24(%rsi), %rax
        addb    $-1, %dl
        adcq    %rax, 24(%rdi)
        ret

Clang

add256:
        movq    (%rsi), %rax
        addq    %rax, (%rdi)
        movq    8(%rsi), %rax
        adcq    %rax, 8(%rdi)
        movq    16(%rsi), %rax
        adcq    %rax, 16(%rdi)
        movq    24(%rsi), %rax
        adcq    %rax, 24(%rdi)
        retq

gmaxwell · 2020-04-16T00:48:15Z

one for the nonce and the other to convert the point to affine.

When an inverse is extremely expensive compared to a field multiply: One thing to consider is that modular inversions are extremely easy to perfectly blind instead of making constant time. At worst, you simply do a batch inversion with a random value in the batch, though sometimes like for the projection from jacobian to affine you can blind even more efficiently.

It may well be that the cost of blinding including generating a 'random number' (e.g. hash of nonce) for it is greater than the cost of just using a sufficiently good(tm), constant time inversion. But that should probably be the comparison point, especially given that a fast constant time inversion is complicated to implement and validate while blinding is less so.

ECDSA verification and ECDH also improved but not as much (% wise) because getting the affine coordinate is less of the total runtime there.

ECDSA verification can be done without projecting back to affine. Instead you can project the affine R provided by the signature, which doesn't require any inversion. Some care is required to correctly handle the modular reduction of r by the signer.

The only inversion needed in ecdsa validation is the scalar inversion of the incoming s because the standards foolishly don't have the signer do it for you. :)

@gmaxwell

…ality check @gmaxwell pointed out in a really great comment on #1479 that you don't need to actually perform a projective->affine conversion in ECDSA verification, since instead you can project the r value. However in the current setup that's not possible since the function is defining as returning the value and then the comparison happens in the pubkey code. Instead have the expected value be passed down and all that comes back is a boolean accept or reject. This allows the project-r optimization. This also avoids some back and forth with the various type wrappers, which is a small win on its own.

@gmaxwell

…ality check @gmaxwell pointed out in a really great comment on #1479 that you don't need to actually perform a projective->affine conversion in ECDSA verification, since instead you can project the r value. However in the current setup that's not possible since the function is defining as returning the value and then the comparison happens in the pubkey code. Instead have the expected value be passed down and all that comes back is a boolean accept or reject. This allows the project-r optimization. This also avoids some back and forth with the various type wrappers, which is a small win on its own.

@gmaxwell

…ality check @gmaxwell pointed out in a really great comment on #1479 that you don't need to actually perform a projective->affine conversion in ECDSA verification, since instead you can project the r value. However in the current setup that's not possible since the function is defining as returning the value and then the comparison happens in the pubkey code. Instead have the expected value be passed down and all that comes back is a boolean accept or reject. This allows the project-r optimization. This also avoids some back and forth with the various type wrappers, which is a small win on its own.

Saves approximately 11K cycles for P-384 and 36K cycles for P-521, improving ECDSA signing for both curves by ~4%. GH #4027 #1479

randombit added enhancement Enhancement or new feature help wanted labels Mar 8, 2018

randombit added a commit that referenced this issue Apr 17, 2018

Add EC_Group::inverse_mod_order

2cfa191

Centralizing this logic allows curve specific implementations such as using a precomputed ladder for exponentiating by p - 2 GH #1479

randombit added a commit that referenced this issue Apr 18, 2018

Add optimized inversion for P-256

1866b7a

Could be slightly more clever here but this is pretty decent. GH #1479

randombit mentioned this issue Apr 18, 2018

Add faster field inversion for P-256 and P-521 #1546

Merged

randombit mentioned this issue Apr 20, 2018

Add Fermat based inversion of P-384 field elements #1547

Merged

randombit added the performance Go faster label Apr 30, 2018

mratsim mentioned this issue Apr 15, 2020

Use the extened euclidian algorithm to compute scalar inverse in variable time bitcoin-core/secp256k1#730

Closed

mratsim mentioned this issue Jul 15, 2020

WIP: "safegcd" field and scalar inversion bitcoin-core/secp256k1#767

Closed

randombit mentioned this issue Jul 13, 2024

Replace BigInt based elliptic curve library #4027

Open

37 tasks

This was referenced Jul 15, 2024

Faster scalar inversions for P-384 and P-521 #4213

Merged

Support fast field inversion for P-256, P-384, and P-521 #4209

Merged

randombit added a commit that referenced this issue Jul 15, 2024

Add precomputed addition chains for P-384 and P-521 scalar inversion

2594fc0

Saves approximately 11K cycles for P-384 and 36K cycles for P-521, improving ECDSA signing for both curves by ~4%. GH #4027 #1479

randombit added a commit that referenced this issue Jul 15, 2024

Add precomputed addition chains for P-384 and P-521 scalar inversion

e47d451

Saves approximately 11K cycles for P-384 and 36K cycles for P-521, improving ECDSA signing for both curves by ~4%. GH #4027 #1479

randombit added a commit that referenced this issue Jul 15, 2024

Add precomputed addition chains for P-384 and P-521 scalar inversion

033612a

Saves approximately 11K cycles for P-384 and 36K cycles for P-521, improving ECDSA signing for both curves by ~4%. GH #4027 #1479

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster modular inversion #1479

Faster modular inversion #1479

randombit commented Mar 8, 2018

randombit commented Mar 12, 2018

randombit commented Mar 12, 2018 •

edited

Loading

randombit commented Apr 20, 2018

kriskwiatkowski commented Jun 25, 2019 •

edited

Loading

randombit commented Jun 25, 2019

mratsim commented Mar 20, 2020 •

edited

Loading

randombit commented Mar 21, 2020 •

edited

Loading

mratsim commented Mar 22, 2020

gmaxwell commented Apr 16, 2020 •

edited

Loading

Faster modular inversion #1479

Faster modular inversion #1479

Comments

randombit commented Mar 8, 2018

randombit commented Mar 12, 2018

randombit commented Mar 12, 2018 • edited Loading

randombit commented Apr 20, 2018

kriskwiatkowski commented Jun 25, 2019 • edited Loading

randombit commented Jun 25, 2019

mratsim commented Mar 20, 2020 • edited Loading

Implementation challenges

Analysis

Alternative implementations

Other algorithms

randombit commented Mar 21, 2020 • edited Loading

mratsim commented Mar 22, 2020

Carries

gmaxwell commented Apr 16, 2020 • edited Loading

randombit commented Mar 12, 2018 •

edited

Loading

kriskwiatkowski commented Jun 25, 2019 •

edited

Loading

mratsim commented Mar 20, 2020 •

edited

Loading

randombit commented Mar 21, 2020 •

edited

Loading

gmaxwell commented Apr 16, 2020 •

edited

Loading