-
Notifications
You must be signed in to change notification settings - Fork 7.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf(gcm): shrink Shoup table and tune GCM loop (IDFGH-13409) #14314
Conversation
👋 Hello bryghtlabs-richard, we appreciate your contribution to this project! 📘 Please review the project's Contributions Guide for key guidelines on code, documentation, testing, and more. 🖊️ Please also make sure you have read and signed the Contributor License Agreement for this project. Click to see more instructions ...
Review and merge process you can expect ...
|
93cfd23
to
1abf631
Compare
I see some recent improvements in the upstream code too: Mbed-TLS/mbedtls@0767fda. We will check, might as well align to the upstream version. Just fyi. |
It's certainly worth testing the upstream approach. It seems upstream assumes unaligned access is possible, but for ESP32 it is not, so we'll spend more time doing xor, but I haven't measured it. |
New MbedTLS version is slower. Each with IRAM_ATTR, last4 with DRAM_ATTR, counted with xthal_get_ccount():
Edit: added Mbed's New, LargeTable approach, same test setup. Function runtimes depend slightly on caller, and slightly on instruction alignment in memory. |
@bryghtlabs-richard do you mind sharing the benchmark setup? |
I should also include the large table mbedtls approach |
@KaeLL , I've put my cycle-counting code into https://github.com/bryghtlabs-richard/esp-gcm-bench @mahavirj , I've also tested the mbedTLS new large-table function, but it's worse than the old mbedTLS / current ESP-IDF approach. My preshift-unroll approach still seems to be the best for Xtensa. |
Hi @bryghtlabs-richard Thanks for the PR. The changes look good to me. |
@bryghtlabs-richard Can you please squash all the commits into one commit. |
1abf631
to
9b6dab9
Compare
@AdityaHPatwardhan done. I think #14317 should go in first though. |
Okay, #14317 has been merged in the internal code-base, the GitHub PR should be updated once the code is available on GitHub. |
sha=9b6dab9edb71290061e7f718ba48d76a0dd93e13 |
1) pre-shift GCM last4 to use 32-bit shift On 32-bit architectures like Aarch32, RV32, Xtensa, shifting a 64-bit variable by 32-bits is free, since it changes the register representing half of the 64-bit var. Pre-shift the last4 array to take advantage of this. 2) unroll first GCM iteration The first loop of gcm_mult() is different from the others. By unrolling it separately from the others, the other iterations may take advantage of the zero-overhead loop construct, in addition to saving a conditional branch in the loop.
9b6dab9
to
1bb9db8
Compare
sha=1bb9db875896da2605cf96bc0fd29b0111af2283 |
Profiling showed a lot of time in gcm_mult() during downloads.
Tune GCM loop for pure 32-bit processors like Xtensa and RV32.
With ESP32-S3-GCC 12.2.0, -O2: