Implement the modular inverse using unsigned 256-bit integers addition and shifts #1073

k06a · 2022-02-07T20:24:52Z

Original algorithm borrowed from this paper (LS3):
https://www.researchgate.net/publication/304417579_Modular_Inverse_Algorithms_Without_Multiplications_for_Cryptographic_Applications

Was improved by Anton Bukov (@k06a) and Mikhail Melnik (@ZumZoom) to use unsigned 256-bit integers: https://gist.github.com/k06a/b990b7c7dda766d4f661e653d6804a53

This would allow to avoid usage of 62-bit signed representation and compute modinv without 256-bit multiplications and division.

Code is passing all the tests, but we injected some code to benchmark new also, need to move it to the bench target for sure, maybe someone could help with this?

Running tests on ARM64 (Apple M1 Max) gives 20% improvement for modinv() method:

t1 = clock();
for (i = 0; i < 1000000; i++) {
    secp256k1_scalar_to_signed62(&s, x);
    secp256k1_modinv64(&s, &secp256k1_const_modinfo_scalar);
    secp256k1_scalar_from_signed62(r, &s);
    CHECK(secp256k1_scalar_eq(r, &rrr));
}
t2 = clock();
for (i = 0; i < 1000000; i++) {    
    secp256k1_modinv64_scalar(r, x, &secp256k1_const_mod_scalar);
    CHECK(secp256k1_scalar_eq(r, &rrr));
}
t3 = clock();
printf("Time old: %f\n", ((double)(t2 - t1)) / CLOCKS_PER_SEC);
printf("Time new: %f\n", ((double)(t3 - t2)) / CLOCKS_PER_SEC);
printf("Improvement: %.2f%%\n", (1 - (double)(t3 - t2) / (double)(t2 - t1)) * 100);

=>

Time old: 3.284568
Time new: 2.639177
Improvement: 19.65%

Please help to test in on x86 and x64 architectures, we tested only ARM64.

sipa · 2022-02-07T20:59:13Z

Benchmarks on AMD Ryzen 9 5950x:

On master (85b00a1):

$ ./bench_internal inverse
Benchmark                     ,    Min(us)    ,    Avg(us)    ,    Max(us)    

scalar_inverse                ,     1.28      ,     1.29      ,     1.44   
scalar_inverse_var            ,     0.873     ,     0.875     ,     0.881

This PR (879ecba):

$ ./bench_internal inverse
Benchmark                     ,    Min(us)    ,    Avg(us)    ,    Max(us)    

scalar_inverse                ,     1.29      ,     1.31      ,     1.43   
scalar_inverse_var            ,     4.69      ,     4.71      ,     4.72

So this looks like a ~5x slowdown. Is your benchmark realistic? Is the number being inverted perhaps highly structured in your benchmark?

sipa · 2022-02-07T21:09:53Z

On Developerbox 1GHz 24-core Cortex-A53 (ARM64):

On master (85b00a1):

$ ./bench_internal inverse
Benchmark                     ,    Min(us)    ,    Avg(us)    ,    Max(us)    

scalar_inverse                ,    12.8       ,    12.8       ,    12.8    
scalar_inverse_var            ,     7.54      ,     7.54      ,     7.55

This PR (2737a61):

$ ./bench_internal inverse
Benchmark                     ,    Min(us)    ,    Avg(us)    ,    Max(us)    

scalar_inverse                ,    12.8       ,    12.8       ,    12.8    
scalar_inverse_var            ,    36.1       ,    36.1       ,    36.1

Also a ~5x slowdown.

k06a · 2022-02-07T21:13:03Z

@sipa running benchmark for me gave same numbers in both branches somehow, please try to run tests under 64-bit architecture and uncomment following lines:

scalar_4x64_impl.h#L981: https://github.com/1inch/secp256k1/blob/19f524b92fcc0ef3df99c266ce75547ae8b03a3e/src/scalar_4x64_impl.h#L981
scalar_4x64_impl.h#L985-L1006: https://github.com/1inch/secp256k1/blob/19f524b92fcc0ef3df99c266ce75547ae8b03a3e/src/scalar_4x64_impl.h#L985-L1006

k06a · 2022-02-07T21:13:47Z

Ah I see you run scalar_inverse_var benchmark directly!

k06a · 2022-02-07T21:15:40Z

For my laptop (M1 Max CPU) results are the following:

On master:

./bench_internal inverse
Benchmark                     ,    Min(us)    ,    Avg(us)    ,    Max(us)    
scalar_inverse                ,     1.87      ,     1.91      ,     1.96   
scalar_inverse_var            ,     4.46      ,     4.49      ,     4.52   
field_inverse                 ,     1.87      ,     1.90      ,     1.91   
field_inverse_var             ,     1.15      ,     1.17      ,     1.26

On this PR:

./bench_internal inverse
Benchmark                     ,    Min(us)    ,    Avg(us)    ,    Max(us)    
scalar_inverse                ,     1.89      ,     1.97      ,     2.32   
scalar_inverse_var            ,     4.49      ,     4.54      ,     4.60   
field_inverse                 ,     1.91      ,     1.93      ,     1.96   
field_inverse_var             ,     1.16      ,     1.17      ,     1.19

sipa · 2022-02-07T21:17:09Z

@k06a Sure you recompiled in between running the benchmarks? Vartime scalar inverse should be faster than constant-time, and in your benchmark it seems slower for both master and your PR.

k06a · 2022-02-07T21:23:16Z

make && ./bench_internal inverse

On master:

Benchmark                     ,    Min(us)    ,    Avg(us)    ,    Max(us)    
scalar_inverse_var            ,     1.16      ,     1.17      ,     1.17

On this PR:

Benchmark                     ,    Min(us)    ,    Avg(us)    ,    Max(us)    
scalar_inverse_var            ,     4.45      ,     4.52      ,     4.62

k06a · 2022-02-07T21:24:33Z

Curious why we saw 20% improvement running tests, maybe different compilation options or compiler optimised some iterations...

sipa · 2022-02-07T21:26:21Z

@k06a My guess is that you were benchmarking inversion of the number 1 or so, or some other highly structured number.

It makes no sense to me why an algorithm like this would be faster than what we have. It seems designed for hardware in which multiplication is expensive, which is increasingly not the case in modern CPUs.

k06a · 2022-02-07T21:30:25Z

@sipa it was 265-bit value, actually value from first test.

k06a · 2022-02-07T21:36:29Z

Now we at least know how to run benchmarks properly :)

sipa · 2022-02-07T21:37:06Z

@k06a It's also possible that you're training the CPU's branch predictor non-representatively by inverting the same number over and over again (or alternating between the two same numbers). The scalar_inverse_var benchmark in bench_internal avoids that by adding a constant randomish term after every inversion.

k06a · 2022-02-07T21:41:19Z

@sipa we saw strange performance fluctuations during adding some changes, very likely we became victims of branch prediction.

mratsim · 2022-02-09T09:46:14Z

Your algorithm is similar to the classic Euclid algorithm. You can easily use 256-bit full-width numbers, see the (constant-time) algorithm from GMP by Niels Moller.

It's even simpler than your algorithm because there is no need for absolute value.
There is no check for less than 0, it only operates on the relative difference of unsigned u and v.
You might note a u < 0 on line 12, but this can be tested on the previous line by getting the borrow from u-v and then conditionally add the modulus so it's just regular unsigned modular addition.

Performance

Compared to algorithms based on divsteps/transition matrix (Bernstein-Yang and Pornin's), both yours and Moller full-width operations at each iteration while divsteps only use full-width operations once every 62 iterations. Even if the full-width operations are 4~5x more costly (applying a transition matrix to 4 bigints) that's still a theoretical 12~15x speedup. This leaves a lot of time for "book-keeping" operations like change of base to 2^62.

See more details here #767 (comment) (warning: rabbit hole).

Besides conversion to/from 62-bit representation is costless, x86-64 can issue at least 2 shifts per cycle and there are 10 shifts to issue for a total of 5 cycles at most.

k06a · 2022-02-09T12:17:32Z

@mratsim thx for the detailed explanation!

k06a added 10 commits February 7, 2022 02:07

New modinv implementation

9769931

Introduce constant-time versions of MSB and CMP methods for scalar

ab689a5

Introduce non-modular add and neg for secp256k1_scalar

bdc7cce

Fix bugs and add minus operation for secp256k1_scalar

5a11a9a

Benchmarking

7fa7e59

Some refactoring

5dc4a30

Added comment

206a75a

Remove unnecessary parameter

9758f4d

Introduce secp256k1_scalar_msb_hint

f3daaf6

Comment out modinv benchmarks

879ecba

k06a marked this pull request as ready for review February 7, 2022 20:50

Remove secp256k1_scalar_print

2737a61

Revert 8x32 changes

19f524b

Some cleanup

787e165

sipa closed this Apr 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement the modular inverse using unsigned 256-bit integers addition and shifts #1073

Implement the modular inverse using unsigned 256-bit integers addition and shifts #1073

k06a commented Feb 7, 2022 •

edited

Loading

sipa commented Feb 7, 2022 •

edited

Loading

sipa commented Feb 7, 2022

k06a commented Feb 7, 2022

k06a commented Feb 7, 2022

k06a commented Feb 7, 2022 •

edited

Loading

sipa commented Feb 7, 2022 •

edited

Loading

k06a commented Feb 7, 2022

k06a commented Feb 7, 2022 •

edited

Loading

sipa commented Feb 7, 2022 •

edited

Loading

k06a commented Feb 7, 2022

k06a commented Feb 7, 2022

sipa commented Feb 7, 2022 •

edited

Loading

k06a commented Feb 7, 2022

mratsim commented Feb 9, 2022 •

edited

Loading

k06a commented Feb 9, 2022

Implement the modular inverse using unsigned 256-bit integers addition and shifts #1073

Implement the modular inverse using unsigned 256-bit integers addition and shifts #1073

Conversation

k06a commented Feb 7, 2022 • edited Loading

sipa commented Feb 7, 2022 • edited Loading

sipa commented Feb 7, 2022

k06a commented Feb 7, 2022

k06a commented Feb 7, 2022

k06a commented Feb 7, 2022 • edited Loading

sipa commented Feb 7, 2022 • edited Loading

k06a commented Feb 7, 2022

k06a commented Feb 7, 2022 • edited Loading

sipa commented Feb 7, 2022 • edited Loading

k06a commented Feb 7, 2022

k06a commented Feb 7, 2022

sipa commented Feb 7, 2022 • edited Loading

k06a commented Feb 7, 2022

mratsim commented Feb 9, 2022 • edited Loading

Performance

k06a commented Feb 9, 2022

k06a commented Feb 7, 2022 •

edited

Loading

sipa commented Feb 7, 2022 •

edited

Loading

k06a commented Feb 7, 2022 •

edited

Loading

sipa commented Feb 7, 2022 •

edited

Loading

k06a commented Feb 7, 2022 •

edited

Loading

sipa commented Feb 7, 2022 •

edited

Loading

sipa commented Feb 7, 2022 •

edited

Loading

mratsim commented Feb 9, 2022 •

edited

Loading