-
Notifications
You must be signed in to change notification settings - Fork 183
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Optim] SIMD for Pornin's GCD inverse #62
Comments
Take care not to do too many iterations here. If you do t iterations in the inner loop, you need to work with words of at least 2t+2 bits (t "low bits" and t+2 "high bits"). My code with 64-bit words thus uses 31 iterations, for a 33/31 split. The two extra "high" bits are needed for the approximation to remain valid. It is conceivable that a single extra bit would be enough, but the proof in the paper does not cover that. It is known that with no extra bit, there are "bad cases" in which the algorithm does not work. With SSE2, the following are tempting:
Unfortunately, there is no conditional move in SSE2 (or even AVX2); you need to emulate that with boolean operations. Compounding the effect is that there is no unsigned comparison opcode. You have You can make a sort-of unsigned 64-bit comparison by adding an offset (i.e. to compare x with y, both unsigned over 64 bits, you can do a signed comparison between x-2^63 and y-2^63) but that leads to extra additions at some point in the algorithm, which is usually not worth it. I have tried it for a few hours and could not make the inner loop as fast as the version with integer registers, but that does not mean that it is not possible, only that it is likely to require some effort. Also, I tried only on Skylake CPUs; things may be different on other kinds of cores (and especially on ARM/Neon). |
Note that ctq and arm modules do use wider approximations. It's all about delicate balance between how fast you can get the inner loop spin vs. how fast/slow multiplications are in comparison. The inner loop in turn is mostly about instruction-level parallelism that the target processor can achieve, which is also quite delicate matter. Thomas's inner loop happens to achieve close to the maximum and in combination with fast multiplications it turns to be the best choice for mulx-capable processors. You won't do better in SIMD, because the amount of instructions won't be significantly lower in comparison to scalar (because instruction capabilities are not as rich), while maximum possible ILP would be lower. Especially on most Cortex cores. Well, one can mix scalar and vector, but then you'll introduce additional dependencies (and instructions) as you "move" flag values to the vector register. [Also, mixed instruction scheduling tends to be more problematic for out-of-order execution logic on x86, which manifests in slightly lower than expected throughput.] |
Thank you both for the detailed analysis. Closing then. |
The current implementation of inversion follows closely @pornin's paper and uses an inner loop with 4 31-bit integers stored in 2 64-bit registers:
blst/src/asm/ctx_inverse_mod_384-x86_64.pl
Lines 904 to 912 in 575c3dd
https://github.com/pornin/bingcd/blob/700cc4d/src/gf25519.c#L1098-L1145
Instead it's possible to store 4 62-bit integers in 2 128-bit SIMD which are available in x86-64 (which implies SSE2) and ARM Neon.
This allows the inner loop to use 62 iterations instead of 31 and actually should remove the need for 2 distinct loops.
The main benefits would be:
However I do not know if in pure SSE2 it's possible to emulate CMOV efficiently and conditional swap.
In terms of use-cases:
The text was updated successfully, but these errors were encountered: