-
Notifications
You must be signed in to change notification settings - Fork 582
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster pcurves reductions for P-256 and P-384 #4147
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that some standard approach that warrants a reference, perhaps?
With some templates the code could be quite a bit more compact, and IMHO also easier to understand; i.e. self-contained functions vs. classes; arrays vs. long lists of variables. See here: 18dc508
And here's some Godbolt to play with this implementation: https://godbolt.org/z/4WGGzzYn8
sum.accum(S7); | ||
const auto S = sum.final_carry(S8); | ||
|
||
CT::unpoison(S); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to be a stray. Didn't see any corresponding CT::poison()
. Same for secp384r1. Or am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The input is (when running under valgrind) already poisoned. As a result S
is also poisoned, being itself derived from a value that (valgrind is treating as) undefined. This makes the assert trigger an error since it directly jumps if the value is something vs another thing. However the code here is quite resilient should the offset by wrong, so I replaced this with a debug assert.
static constexpr auto P256_4 = | ||
hex_to_words<uint32_t>("0x3fffffffc00000004000000000000000000000003fffffffffffffffffffffffc"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you decide to adopt my suggestion, it would be worthwhile to replace this by a constexpr multiplication (bigmul(4, Params::P)
) to save a magic string/number, if we have that somewhere already.
For 32-bit x86, this reduction results in point arithmetic operations that are 25-35% faster than when using Montgomery. Sadly for 64-bit x86 it is at best about even with using Montgomery, and for Clang 64-bit it's even somewhat slower.
This is about 20-30% faster on both 32 and 64 bit systems
dd0b2b1
to
fa71e70
Compare
Yes this is the standard Solinas reduction (SP 800-186) just done in columns instead of forming each integer directly. I added some references.
I'll admit this is cleaner but it's also slower than the original Montgomery code, at least on my machine. :) |
Interesting! On my machine (M2 MacBook Air, Xcode 15.3) there's no difference to your implementation. Setup./configure.py \
--build-tool=ninja \
--build-targets=static,cli \
--compiler-cache=ccache \
--enable-experimental-features \
--disable-modules=pcurves_secp256r1,pcurves_secp521r1,pcurves_brainpool256r1,pcurves_brainpool384r1,pcurves_brainpool512r1
ninja cli
./botan speed --msec=1000 pcurves Results
|
GCC 14.1.1, Linux, i5-2520M
|
Its what it is then. The world doesn't run on clang alone. :( |
I just checked with LLVM Clang on my machine and there I see similar results between the two approaches. Kind of depressing that GCC regresses so badly. |
Indeed. I think I should have another look at the performance of #4024 w/ gcc. 😨 |
FTR: Seems to be fine. |
No description provided.