Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement the modular inverse using unsigned 256-bit integers addition and shifts #1073

Closed
wants to merge 13 commits into from

Conversation

k06a
Copy link

@k06a k06a commented Feb 7, 2022

Original algorithm borrowed from this paper (LS3):
https://www.researchgate.net/publication/304417579_Modular_Inverse_Algorithms_Without_Multiplications_for_Cryptographic_Applications

image

Was improved by Anton Bukov (@k06a) and Mikhail Melnik (@ZumZoom) to use unsigned 256-bit integers: https://gist.github.com/k06a/b990b7c7dda766d4f661e653d6804a53

This would allow to avoid usage of 62-bit signed representation and compute modinv without 256-bit multiplications and division.

Code is passing all the tests, but we injected some code to benchmark new also, need to move it to the bench target for sure, maybe someone could help with this?

Running tests on ARM64 (Apple M1 Max) gives 20% improvement for modinv() method:

t1 = clock();
for (i = 0; i < 1000000; i++) {
    secp256k1_scalar_to_signed62(&s, x);
    secp256k1_modinv64(&s, &secp256k1_const_modinfo_scalar);
    secp256k1_scalar_from_signed62(r, &s);
    CHECK(secp256k1_scalar_eq(r, &rrr));
}
t2 = clock();
for (i = 0; i < 1000000; i++) {    
    secp256k1_modinv64_scalar(r, x, &secp256k1_const_mod_scalar);
    CHECK(secp256k1_scalar_eq(r, &rrr));
}
t3 = clock();
printf("Time old: %f\n", ((double)(t2 - t1)) / CLOCKS_PER_SEC);
printf("Time new: %f\n", ((double)(t3 - t2)) / CLOCKS_PER_SEC);
printf("Improvement: %.2f%%\n", (1 - (double)(t3 - t2) / (double)(t2 - t1)) * 100);

=>

Time old: 3.284568
Time new: 2.639177
Improvement: 19.65%

Please help to test in on x86 and x64 architectures, we tested only ARM64.

@k06a k06a marked this pull request as ready for review February 7, 2022 20:50
@sipa
Copy link
Contributor

sipa commented Feb 7, 2022

Benchmarks on AMD Ryzen 9 5950x:

On master (85b00a1):

$ ./bench_internal inverse
Benchmark                     ,    Min(us)    ,    Avg(us)    ,    Max(us)    

scalar_inverse                ,     1.28      ,     1.29      ,     1.44   
scalar_inverse_var            ,     0.873     ,     0.875     ,     0.881  

This PR (879ecba):

$ ./bench_internal inverse
Benchmark                     ,    Min(us)    ,    Avg(us)    ,    Max(us)    

scalar_inverse                ,     1.29      ,     1.31      ,     1.43   
scalar_inverse_var            ,     4.69      ,     4.71      ,     4.72   

So this looks like a ~5x slowdown. Is your benchmark realistic? Is the number being inverted perhaps highly structured in your benchmark?

@sipa
Copy link
Contributor

sipa commented Feb 7, 2022

On Developerbox 1GHz 24-core Cortex-A53 (ARM64):

On master (85b00a1):

$ ./bench_internal inverse
Benchmark                     ,    Min(us)    ,    Avg(us)    ,    Max(us)    

scalar_inverse                ,    12.8       ,    12.8       ,    12.8    
scalar_inverse_var            ,     7.54      ,     7.54      ,     7.55   

This PR (2737a61):

$ ./bench_internal inverse
Benchmark                     ,    Min(us)    ,    Avg(us)    ,    Max(us)    

scalar_inverse                ,    12.8       ,    12.8       ,    12.8    
scalar_inverse_var            ,    36.1       ,    36.1       ,    36.1    

Also a ~5x slowdown.

@k06a
Copy link
Author

k06a commented Feb 7, 2022

@sipa running benchmark for me gave same numbers in both branches somehow, please try to run tests under 64-bit architecture and uncomment following lines:

  1. scalar_4x64_impl.h#L981: https://github.com/1inch/secp256k1/blob/19f524b92fcc0ef3df99c266ce75547ae8b03a3e/src/scalar_4x64_impl.h#L981
  2. scalar_4x64_impl.h#L985-L1006: https://github.com/1inch/secp256k1/blob/19f524b92fcc0ef3df99c266ce75547ae8b03a3e/src/scalar_4x64_impl.h#L985-L1006

@k06a
Copy link
Author

k06a commented Feb 7, 2022

Ah I see you run scalar_inverse_var benchmark directly!

@k06a
Copy link
Author

k06a commented Feb 7, 2022

For my laptop (M1 Max CPU) results are the following:

On master:

./bench_internal inverse
Benchmark                     ,    Min(us)    ,    Avg(us)    ,    Max(us)    
scalar_inverse                ,     1.87      ,     1.91      ,     1.96   
scalar_inverse_var            ,     4.46      ,     4.49      ,     4.52   
field_inverse                 ,     1.87      ,     1.90      ,     1.91   
field_inverse_var             ,     1.15      ,     1.17      ,     1.26   

On this PR:

./bench_internal inverse
Benchmark                     ,    Min(us)    ,    Avg(us)    ,    Max(us)    
scalar_inverse                ,     1.89      ,     1.97      ,     2.32   
scalar_inverse_var            ,     4.49      ,     4.54      ,     4.60   
field_inverse                 ,     1.91      ,     1.93      ,     1.96   
field_inverse_var             ,     1.16      ,     1.17      ,     1.19 

@sipa
Copy link
Contributor

sipa commented Feb 7, 2022

@k06a Sure you recompiled in between running the benchmarks? Vartime scalar inverse should be faster than constant-time, and in your benchmark it seems slower for both master and your PR.

@k06a
Copy link
Author

k06a commented Feb 7, 2022

make && ./bench_internal inverse

On master:

Benchmark                     ,    Min(us)    ,    Avg(us)    ,    Max(us)    
scalar_inverse_var            ,     1.16      ,     1.17      ,     1.17   

On this PR:

Benchmark                     ,    Min(us)    ,    Avg(us)    ,    Max(us)    
scalar_inverse_var            ,     4.45      ,     4.52      ,     4.62   

@k06a
Copy link
Author

k06a commented Feb 7, 2022

Curious why we saw 20% improvement running tests, maybe different compilation options or compiler optimised some iterations...

@sipa
Copy link
Contributor

sipa commented Feb 7, 2022

@k06a My guess is that you were benchmarking inversion of the number 1 or so, or some other highly structured number.

It makes no sense to me why an algorithm like this would be faster than what we have. It seems designed for hardware in which multiplication is expensive, which is increasingly not the case in modern CPUs.

@k06a
Copy link
Author

k06a commented Feb 7, 2022

@sipa it was 265-bit value, actually value from first test.

@k06a
Copy link
Author

k06a commented Feb 7, 2022

Now we at least know how to run benchmarks properly :)

@sipa
Copy link
Contributor

sipa commented Feb 7, 2022

@k06a It's also possible that you're training the CPU's branch predictor non-representatively by inverting the same number over and over again (or alternating between the two same numbers). The scalar_inverse_var benchmark in bench_internal avoids that by adding a constant randomish term after every inversion.

@k06a
Copy link
Author

k06a commented Feb 7, 2022

@sipa we saw strange performance fluctuations during adding some changes, very likely we became victims of branch prediction.

@mratsim
Copy link

mratsim commented Feb 9, 2022

Your algorithm is similar to the classic Euclid algorithm. You can easily use 256-bit full-width numbers, see the (constant-time) algorithm from GMP by Niels Moller.

invmod Niels Moller

It's even simpler than your algorithm because there is no need for absolute value.
There is no check for less than 0, it only operates on the relative difference of unsigned u and v.
You might note a u < 0 on line 12, but this can be tested on the previous line by getting the borrow from u-v and then conditionally add the modulus so it's just regular unsigned modular addition.

Performance

Compared to algorithms based on divsteps/transition matrix (Bernstein-Yang and Pornin's), both yours and Moller full-width operations at each iteration while divsteps only use full-width operations once every 62 iterations. Even if the full-width operations are 4~5x more costly (applying a transition matrix to 4 bigints) that's still a theoretical 12~15x speedup. This leaves a lot of time for "book-keeping" operations like change of base to 2^62.

See more details here #767 (comment) (warning: rabbit hole).

Besides conversion to/from 62-bit representation is costless, x86-64 can issue at least 2 shifts per cycle and there are 10 shifts to issue for a total of 5 cycles at most.

@k06a
Copy link
Author

k06a commented Feb 9, 2022

@mratsim thx for the detailed explanation!

@sipa sipa closed this Apr 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants