Faster pcurves reductions for P-256 and P-384 #4147

randombit · 2024-06-24T02:32:57Z

No description provided.

coveralls · 2024-06-24T02:58:55Z

coverage: 91.743% (+0.005%) from 91.738%
when pulling dd0b2b1 on jack/faster-nist-redc
into d24c2c3 on master.

reneme

Is that some standard approach that warrants a reference, perhaps?

With some templates the code could be quite a bit more compact, and IMHO also easier to understand; i.e. self-contained functions vs. classes; arrays vs. long lists of variables. See here: 18dc508

And here's some Godbolt to play with this implementation: https://godbolt.org/z/4WGGzzYn8

reneme · 2024-06-24T16:34:08Z

src/lib/math/pcurves/pcurves_secp256r1/pcurves_secp256r1.cpp

+         sum.accum(S7);
+         const auto S = sum.final_carry(S8);
+
+         CT::unpoison(S);


This seems to be a stray. Didn't see any corresponding CT::poison(). Same for secp384r1. Or am I missing something?

The input is (when running under valgrind) already poisoned. As a result S is also poisoned, being itself derived from a value that (valgrind is treating as) undefined. This makes the assert trigger an error since it directly jumps if the value is something vs another thing. However the code here is quite resilient should the offset by wrong, so I replaced this with a debug assert.

reneme · 2024-06-24T16:38:14Z

src/lib/math/pcurves/pcurves_secp256r1/pcurves_secp256r1.cpp

+      static constexpr auto P256_4 =
+         hex_to_words<uint32_t>("0x3fffffffc00000004000000000000000000000003fffffffffffffffffffffffc");


If you decide to adopt my suggestion, it would be worthwhile to replace this by a constexpr multiplication (bigmul(4, Params::P)) to save a magic string/number, if we have that somewhere already.

For 32-bit x86, this reduction results in point arithmetic operations that are 25-35% faster than when using Montgomery. Sadly for 64-bit x86 it is at best about even with using Montgomery, and for Clang 64-bit it's even somewhat slower.

This is about 20-30% faster on both 32 and 64 bit systems

randombit · 2024-06-24T21:44:31Z

Is that some standard approach that warrants a reference, perhaps?

Yes this is the standard Solinas reduction (SP 800-186) just done in columns instead of forming each integer directly. I added some references.

With some templates the code could be quite a bit more compact, and IMHO also easier to understand; i.e. self-contained functions vs. classes; arrays vs. long lists of variables. See here: 18dc508

I'll admit this is cleaner but it's also slower than the original Montgomery code, at least on my machine. :)

coveralls · 2024-06-24T22:09:55Z

coverage: 91.745% (+0.007%) from 91.738%
when pulling fa71e70 on jack/faster-nist-redc
into d24c2c3 on master.

reneme · 2024-06-25T06:38:36Z

... but it's also slower than the original Montgomery code, at least on my machine. :)

Interesting! On my machine (M2 MacBook Air, Xcode 15.3) there's no difference to your implementation.

Setup

./configure.py \
   --build-tool=ninja \
   --build-targets=static,cli \
   --compiler-cache=ccache \
   --enable-experimental-features \
   --disable-modules=pcurves_secp256r1,pcurves_secp521r1,pcurves_brainpool256r1,pcurves_brainpool384r1,pcurves_brainpool512r1
ninja cli
./botan speed --msec=1000 pcurves

Results

Exercise	René	Jack	master
secp384r1 base mul	5950/sec	5997/sec	4828 /sec
secp384r1 var mul	1517/sec	1514/sec	1227 /sec
secp384r1 mul2 setup	2034/sec	2029/sec	1594 /sec
secp384r1 mul2	2217/sec	2250/sec	1815 /sec
secp384r1 proj->affine	26603/sec	25764/sec	20005 /sec
secp384r1 scalar invert	19650/sec	19973/sec	20544 /sec

randombit · 2024-06-25T11:22:41Z

GCC 14.1.1, Linux, i5-2520M

Exercise	René	Jack	master
secp384r1 base mul	3350/sec	5149/sec	4159/sec
secp384r1 var mul	879/sec	1366/sec	1117 /sec
secp384r1 mul2 setup	1136/sec	1880/sec	1441 /sec
secp384r1 mul2	1307/sec	2053/sec	1636/sec
secp384r1 proj->affine	15319/sec	26978/sec	17389 /sec
secp384r1 scalar invert	21183/sec	21183/sec	21078 /sec

reneme · 2024-06-25T11:37:08Z

Its what it is then. The world doesn't run on clang alone. :(

randombit · 2024-06-25T11:39:47Z

I just checked with LLVM Clang on my machine and there I see similar results between the two approaches. Kind of depressing that GCC regresses so badly.

reneme · 2024-06-25T11:41:12Z

I just checked with LLVM Clang on my machine and there I see similar results between the two approaches. Kind of depressing that GCC regresses so badly.

Indeed. I think I should have another look at the performance of #4024 w/ gcc. 😨

reneme · 2024-06-25T12:14:01Z

I think I should have another look at the performance of #4024 w/ gcc

FTR: Seems to be fine.

randombit added this to the Botan 3.6.0 milestone Jun 24, 2024

randombit requested review from reneme and FAlbertDev June 24, 2024 02:32

randombit mentioned this pull request Jun 24, 2024

Add specialized reduction for P256 in pcurves #4146

Closed

randombit mentioned this pull request Jun 24, 2024

Replace BigInt based elliptic curve library #4027

Open

37 tasks

reneme approved these changes Jun 24, 2024

View reviewed changes

reneme reviewed Jun 24, 2024

View reviewed changes

randombit added 2 commits June 24, 2024 17:41

Add specialized reduction for P256 in pcurves

6380743

For 32-bit x86, this reduction results in point arithmetic operations that are 25-35% faster than when using Montgomery. Sadly for 64-bit x86 it is at best about even with using Montgomery, and for Clang 64-bit it's even somewhat slower.

Add specialized reduction for P-384

fa71e70

This is about 20-30% faster on both 32 and 64 bit systems

randombit force-pushed the jack/faster-nist-redc branch from dd0b2b1 to fa71e70 Compare June 24, 2024 21:43

randombit mentioned this pull request Jul 9, 2024

Add P-192 to pcurves #4190

Merged

randombit merged commit 7fad1d2 into master Jul 10, 2024
42 checks passed

randombit deleted the jack/faster-nist-redc branch July 10, 2024 07:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster pcurves reductions for P-256 and P-384 #4147

Faster pcurves reductions for P-256 and P-384 #4147

randombit commented Jun 24, 2024

coveralls commented Jun 24, 2024

reneme left a comment

reneme Jun 24, 2024

randombit Jun 24, 2024

reneme Jun 24, 2024 •

edited

Loading

randombit commented Jun 24, 2024

coveralls commented Jun 24, 2024

reneme commented Jun 25, 2024

randombit commented Jun 25, 2024

reneme commented Jun 25, 2024

randombit commented Jun 25, 2024

reneme commented Jun 25, 2024

reneme commented Jun 25, 2024

		static constexpr auto P256_4 =
		hex_to_words<uint32_t>("0x3fffffffc00000004000000000000000000000003fffffffffffffffffffffffc");

Faster pcurves reductions for P-256 and P-384 #4147

Faster pcurves reductions for P-256 and P-384 #4147

Conversation

randombit commented Jun 24, 2024

coveralls commented Jun 24, 2024

reneme left a comment

Choose a reason for hiding this comment

reneme Jun 24, 2024

Choose a reason for hiding this comment

randombit Jun 24, 2024

Choose a reason for hiding this comment

reneme Jun 24, 2024 • edited Loading

Choose a reason for hiding this comment

randombit commented Jun 24, 2024

coveralls commented Jun 24, 2024

reneme commented Jun 25, 2024

Setup

Results

randombit commented Jun 25, 2024

reneme commented Jun 25, 2024

randombit commented Jun 25, 2024

reneme commented Jun 25, 2024

reneme commented Jun 25, 2024

reneme Jun 24, 2024 •

edited

Loading