-
Notifications
You must be signed in to change notification settings - Fork 582
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster modular inversion #1479
Comments
WRT using Fermat's little theorem it would be beneficial to use the known addition chains for computing |
https://eprint.iacr.org/2014/852.pdf has an addition chain for P-521 inversion. |
Centralizing this logic allows curve specific implementations such as using a precomputed ladder for exponentiating by p - 2 GH #1479
Could be slightly more clever here but this is pretty decent. GH #1479
Cuts about 100K cycles from the inversion, improving ECDSA sign by 10% and ECDH by ~2% Addition chain from https://briansmith.org/ecc-inversion-addition-chains-01 GH #1479
With #1546 and #1547 we have faster field inversions for P-256, P-384, and P-521 all of which improved ECDSA signature performance by ~~ 10-15%. ECDSA verification and ECDH also improved but not as much (% wise) because getting the affine coordinate is less of the total runtime there. We may also want specialized inversions modulo the curve order. But this doesn't really seem worthwhile because the only algorithm that benefits is ECDSA. It would be better at this point to work on improving the performance of the generic const time modular inversion, which would have a lot of benefits across the codebase. |
Have you looked at https://eprint.iacr.org/2019/266.pdf already? It seems to me, there maybe is a chance it can be useful. Any opinion? |
@henrydcase I have seen that paper but currently don't understand at all how the algo works. It seems much faster, eg they report for inversion modulo the 511-bit M-511 prime 30K Skylake cycles, while Botan's const time algo takes 200K+ Skylake cycles for a randomly chosen 512-bit prime. And the paper claims "Our advantage is also larger in applications that use “random” primes rather than special primes" [vs Fermat]. So overall certainly promising. Even if we assume a 5x slowdown going from DJB hand coded asm to C++ that's still a decent speedup. Also our current const time algorithm only works for odd modulus which means the generation of That paper also has (Figure 1.2) a fast algorithm for gcd which looks simple to make const time, which would be useful to replace our current "mostly" const time gcd. |
Bernstein-Young algorithm also only works for odd moduli. Note: I'm not a mathematician and I didn't try to implement it yet because it's still over my head but here is a start of what I understood, the challenges and a naive analysis of what could be the speed of the algorithm.
def truncate(f, t):
if t == 0: return 0
twot = 1 << (t - 1)
return ((f + twot) & (2 * twot - 1)) - twot
def divsteps2(n, t, delta, f, g):
assert t >= n and n >= 0
f, g = truncate(f, t), truncate(g, t)
u, v, q, r = 1, 0, 0, 1
while n > 0:
f = truncate(f, t)
if delta > 0 and g & 1:
delta, f, g, u, v, q, r = -delta, g, -f, q, r, -u, -v
g0 = g & 1
delta, g, q, r = 1 + delta, (g + g0 * f) / 2, (q + g0 * u) / 2, (
r + g0 * v) / 2
n, t = n - 1, t - 1
g = truncate(ZZ(g), t)
M2Q = MatrixSpace(QQ, 2)
return delta, f, g, M2Q((u, v, q, r))
def iterations(d):
return (49 * d + 80) // 17 if d < 46 else (49 * d + 57) // 17
def recip2(f, g):
## Compute g^-1 mod f: f MUST be odd
assert f & 1
d = max(f.nbits(), g.nbits())
m = iterations(d)
print(f'm: {m}')
precomp = Integers(f)((f + 1) / 2) ^ (m - 1)
print(f'precomp: {precomp}')
delta, fm, gm, P = divsteps2(m, m + 1, 1, f, g)
print(f'P[0][1]: {P[0][1]}')
V = sign(fm) * ZZ(P[0][1] * 2 ^ (m - 1))
return ZZ(V * precomp) Implementation challengesAs it stands the paper is hard to naively implement in a cryptographic library:
AnalysisThis is my naive analysis, unfortunately I couldn't find an optimized implementation.
The first thing that jumps to me is that the number of iterations is scaled by The Alternative implementations
Other algorithmsBesides the Bernstein-Young paper, the most recent papers on constant-time inversion are:
Both cost |
@mratsim Very helpful thank you. I had not seen either of the other implementations you reference. I had earlier read Bos' paper for const-time Montgomery inversion but it does not seem to me a promising approach as the (not constant-time) implementation of this algorithm in |
Regarding speed you are probably aware of this but GCC is absolutely horrible at handling multiprecision arithmetic. For my elliptic curve library, this is the speed I get on field operations with Clang (inversion using Möller's algorithm) on Ethereum/Blockchain related elliptic curves. 0 means that the compiler optimized the operation away (tried some volatile reads/writes but I don't want to slow the bench as well and I'm more interested in multiplication/squaring/inversion anyway)
And GCC
For inversion GCC is 2x slower than Clang This is something that is not unique to my library, I reported the same to the fiat-crypto project: CarriesIn particular it does not handle carries properly (see https://gcc.godbolt.org/z/2h768y) even when using the real addcarry_u64 intrinsics #include <stdint.h>
#include <x86intrin.h>
void add256(uint64_t a[4], uint64_t b[4]){
uint8_t carry = 0;
for (int i = 0; i < 4; ++i)
carry = _addcarry_u64(carry, a[i], b[i], &a[i]);
} GCC add256:
movq (%rsi), %rax
addq (%rdi), %rax
setc %dl
movq %rax, (%rdi)
movq 8(%rdi), %rax
addb $-1, %dl
adcq 8(%rsi), %rax
setc %dl
movq %rax, 8(%rdi)
movq 16(%rdi), %rax
addb $-1, %dl
adcq 16(%rsi), %rax
setc %dl
movq %rax, 16(%rdi)
movq 24(%rsi), %rax
addb $-1, %dl
adcq %rax, 24(%rdi)
ret Clang add256:
movq (%rsi), %rax
addq %rax, (%rdi)
movq 8(%rsi), %rax
adcq %rax, 8(%rdi)
movq 16(%rsi), %rax
adcq %rax, 16(%rdi)
movq 24(%rsi), %rax
adcq %rax, 24(%rdi)
retq |
When an inverse is extremely expensive compared to a field multiply: One thing to consider is that modular inversions are extremely easy to perfectly blind instead of making constant time. At worst, you simply do a batch inversion with a random value in the batch, though sometimes like for the projection from jacobian to affine you can blind even more efficiently. It may well be that the cost of blinding including generating a 'random number' (e.g. hash of nonce) for it is greater than the cost of just using a sufficiently good(tm), constant time inversion. But that should probably be the comparison point, especially given that a fast constant time inversion is complicated to implement and validate while blinding is less so.
ECDSA verification can be done without projecting back to affine. Instead you can project the affine R provided by the signature, which doesn't require any inversion. Some care is required to correctly handle the modular reduction of r by the signer. The only inversion needed in ecdsa validation is the scalar inversion of the incoming s because the standards foolishly don't have the signer do it for you. :) |
…ality check @gmaxwell pointed out in a really great comment on #1479 that you don't need to actually perform a projective->affine conversion in ECDSA verification, since instead you can project the r value. However in the current setup that's not possible since the function is defining as returning the value and then the comparison happens in the pubkey code. Instead have the expected value be passed down and all that comes back is a boolean accept or reject. This allows the project-r optimization. This also avoids some back and forth with the various type wrappers, which is a small win on its own.
…ality check @gmaxwell pointed out in a really great comment on #1479 that you don't need to actually perform a projective->affine conversion in ECDSA verification, since instead you can project the r value. However in the current setup that's not possible since the function is defining as returning the value and then the comparison happens in the pubkey code. Instead have the expected value be passed down and all that comes back is a boolean accept or reject. This allows the project-r optimization. This also avoids some back and forth with the various type wrappers, which is a small win on its own.
…ality check @gmaxwell pointed out in a really great comment on #1479 that you don't need to actually perform a projective->affine conversion in ECDSA verification, since instead you can project the r value. However in the current setup that's not possible since the function is defining as returning the value and then the comparison happens in the pubkey code. Instead have the expected value be passed down and all that comes back is a boolean accept or reject. This allows the project-r optimization. This also avoids some back and forth with the various type wrappers, which is a small win on its own.
At this point modular inversion is one of the major bottlenecks in ECDSA signature generation, at about 30% of the total runtime. Two inversions are required, one for the nonce and the other to convert the point to affine.
Niels Moller sent me an email where he contended that the fastest approach for const time inversion at ECC sizes was using Fermat's little theorem. However in Botan with P-521, using the const time inversion algorithm (that Niels invented) is over twice as fast. So maybe the issue is (also) that our modular exponentiation algorithm is too slow.
I ran some quick checks, OpenSSL seems to take ~210k cycles for const time modular inversion, vs 360k cycles in ct_inverse_mod_odd_modulus. Simply matching that would improve ECDSA signature perf by 15%
The text was updated successfully, but these errors were encountered: