-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement the modular inverse using unsigned 256-bit integers addition and shifts #1073
Conversation
Benchmarks on AMD Ryzen 9 5950x: On master (85b00a1):
This PR (879ecba):
So this looks like a ~5x slowdown. Is your benchmark realistic? Is the number being inverted perhaps highly structured in your benchmark? |
On Developerbox 1GHz 24-core Cortex-A53 (ARM64): On master (85b00a1):
This PR (2737a61):
Also a ~5x slowdown. |
@sipa running benchmark for me gave same numbers in both branches somehow, please try to run tests under 64-bit architecture and uncomment following lines: |
Ah I see you run scalar_inverse_var benchmark directly! |
For my laptop (M1 Max CPU) results are the following: On master:
On this PR:
|
@k06a Sure you recompiled in between running the benchmarks? Vartime scalar inverse should be faster than constant-time, and in your benchmark it seems slower for both master and your PR. |
On master:
On this PR:
|
Curious why we saw 20% improvement running tests, maybe different compilation options or compiler optimised some iterations... |
@k06a My guess is that you were benchmarking inversion of the number 1 or so, or some other highly structured number. It makes no sense to me why an algorithm like this would be faster than what we have. It seems designed for hardware in which multiplication is expensive, which is increasingly not the case in modern CPUs. |
@sipa it was 265-bit value, actually value from first test. |
Now we at least know how to run benchmarks properly :) |
@k06a It's also possible that you're training the CPU's branch predictor non-representatively by inverting the same number over and over again (or alternating between the two same numbers). The scalar_inverse_var benchmark in bench_internal avoids that by adding a constant randomish term after every inversion. |
@sipa we saw strange performance fluctuations during adding some changes, very likely we became victims of branch prediction. |
Your algorithm is similar to the classic Euclid algorithm. You can easily use 256-bit full-width numbers, see the (constant-time) algorithm from GMP by Niels Moller. It's even simpler than your algorithm because there is no need for absolute value. PerformanceCompared to algorithms based on divsteps/transition matrix (Bernstein-Yang and Pornin's), both yours and Moller full-width operations at each iteration while divsteps only use full-width operations once every 62 iterations. Even if the full-width operations are 4~5x more costly (applying a transition matrix to 4 bigints) that's still a theoretical 12~15x speedup. This leaves a lot of time for "book-keeping" operations like change of base to 2^62. See more details here #767 (comment) (warning: rabbit hole). Besides conversion to/from 62-bit representation is costless, x86-64 can issue at least 2 shifts per cycle and there are 10 shifts to issue for a total of 5 cycles at most. |
@mratsim thx for the detailed explanation! |
Original algorithm borrowed from this paper (LS3):
https://www.researchgate.net/publication/304417579_Modular_Inverse_Algorithms_Without_Multiplications_for_Cryptographic_Applications
Was improved by Anton Bukov (@k06a) and Mikhail Melnik (@ZumZoom) to use unsigned 256-bit integers: https://gist.github.com/k06a/b990b7c7dda766d4f661e653d6804a53
This would allow to avoid usage of 62-bit signed representation and compute modinv without 256-bit multiplications and division.
Code is passing all the tests, but we injected some code to benchmark new also, need to move it to the bench target for sure, maybe someone could help with this?
Running tests on ARM64 (Apple M1 Max) gives 20% improvement for
modinv()
method:=>
Please help to test in on x86 and x64 architectures, we tested only ARM64.