Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial Intel HEXL integration #312

Merged
merged 3 commits into from
Apr 5, 2021
Merged

Conversation

fboemer
Copy link
Contributor

@fboemer fboemer commented Mar 29, 2021

Initial integration with Intel HEXL (https://github.com/intel/hexl)

Co-authored-by: Gelila Seifu gelila.seifu@intel.com
Co-authored-by: Jeremy Bottleson jeremy.bottleson@intel.com

@fionser
Copy link
Contributor

fionser commented Mar 30, 2021

@fboemer Great boost. HEXL saves ~50% computation time for my program :).
But, my project uses SEAL as a submodulue and it failed to build with HEXL in a submodulue.
I need to build & install SEAL seperately, and then build my own project.

Another concern is that does it possible reuse HEXL's twiddle factors table via the SEAL's NTTTable object,
which seem can save a pretty memory.

@WeiDaiWD
Copy link
Contributor

WeiDaiWD commented Mar 30, 2021

Great boost. HEXL saves ~50% computation time for my program :).

Can you share your CPU spec? Does it have AVX512IFMA?

@fionser
Copy link
Contributor

fionser commented Mar 30, 2021

  • Intel(R) Xeon(R) Platinum 8269CY CPU @ 2.50GHz
  • Does not have AVX512IFMA
  • gcc version 7.2.1 20180104

CPU-Flags:

fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc eagerfpu pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch arat invpcid_single fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1

@WeiDaiWD
Copy link
Contributor

Cool! It's remarkably faster even with avx512dq.

@WeiDaiWD
Copy link
Contributor

@fboemer This is superb. Thanks for making SEAL faster. :)

A few things to add to this PR:

  • reinterpret_cast<const uint64_t *> and reinterpret_cast<uint64_t *> in polyarithsmallmod.h and polyarithsmallmod.cpp are not necessary, since we defined using CoeffIter = PtrIter<std::uint64_t *> in iterator.h.
  • Two CMake options, BUILD_PIC and BUILD_TESTING, are propagated into SEAL from Intel HEXL. It's not obvious to me how they end up there by eyeballing HEXL's CMakeLists.txt. Could you please figure it out and remove or hide them?
  • Could you please provide descriptions (of functionality and important arguments) to the three functions in intel_seal_ext.h?

Questions:

  • SEAL_ALIGNED_ALLOC calls malloc if size is not a multiple of alignment. What's the effect on performance here? My only worry is that a user can be falsely convinced that all allocations are aligned to 64 as long as SEAL_USE_ALIGN_64 is ON.
  • In HEXL's README, it says "Intel HEXL targets integer arithmetic with word-sized primes, typically 40-60 bits." BFV auxiliary primes are 61-bit. Does AVX512IFMA only work on 54-bit or less? What will happen if a prime is larger than the bound (60 or 54)?

Recommendations to Intel HEXL:

  • As fionser said, would it make sense for Intel HEXL to use the powers of root (specifically their allocation) generated by SEAL, or disable SEAL's NTT precomputation if SEAL_USE_INTEL_HEXL=ON? The former choice removes precomputation in the first call to intel::seal_ext::get_ntt.
  • When compiling your branch in debug mode (several warnings enabled), I've got a long list of warnings. They are categorized into followings:
    • sign conversion
    • implicit int float conversion
    • shorten 64 to 32
    • c++14 binary literal: maybe Intel HEXL should use C++14 instead of C++11

fboemer added 2 commits April 2, 2021 12:28
Co-authored-by: Gelila Seifu <gelila.seifu@intel.com>
Co-authored-by: Jeremy Bottleson <jeremy.bottleson@intel.com>

Update to new HEXL

Remove unnecessary casts

Log options
@fboemer fboemer force-pushed the fboemer/hexl branch 2 times, most recently from 9ade672 to 3f5f8a9 Compare April 2, 2021 20:16
@fboemer
Copy link
Contributor Author

fboemer commented Apr 2, 2021

Thanks for the feedback, @WeiDaiWD and @fionser.

A few notes:

  • @fionser , do you mind trying the submodule approach again and reporting the errors? We're happy to support this workflow if possible
  • We've removed the unnecessary reinterpret_casts. Thanks for pointing this out; it's a clean approach!
  • BUILD_PIC and BUILD_TESTING should no longer be leaking (they were stemming from some 3rd-party dependenceis)
  • Within Intel HEXL, I've measured the 64-byte aligned allocations to yield ~5-7% speedup on the NTT. It looks like whenever we allocate memory for cipher/plaintexts, the allocation sizes will be a multiple of 64 bytes. So the current implementation should be efficient. If preferred, for other-sized memory allocations, we could use an approach similar to Boost, which will allocate extra memory.
  • We've added documentation for the intel_seal_ext.h
  • AVX512IFMA52 performs 52-bit integer arithmetic. We need a few extra bits in the NTT, so choosing coefficient moduli < 50 bits should suffice for best performance. For large primes, e.g. the auxiliary prime, HEXL will choose an AVX512DQ implementation, which still yields some speedup (as @fionser observed), but less than the IFMA52 approach. See Tables 1-4 in our arXiv paper for more information
  • Regarding NTT pre-computation: Intel HEXL uses two forms of pre-computation. One is based on Barrett factors floor(2^64/modulus) and today happens to have the same bit-scrambled order as SEAL's pre-computation. A second pre-computation vector is based on floor(2^52/modulus). In general, the pre-computation is not part of HEXL's public API. If it changes down the road, we don't want to break the SEAL integration, hence why we didn't use SEAL's pre-computed factors. Yes, we could omit the SEAL NTT tables pre-computation if you like. We thought the current implementation would be cleanest and safest (not breaking any programs that may rely on SEAL's pre-computed NTT tables.)
  • Thanks for pointing our the warnings. We've updated Intel HEXL v1.0.0, which should have resolved them. Just a note that we plan to update the HEXL v1.0.0 tag a few more times for minor README changes, but will freeze once this PR is approved.

Also a note that this updated PR makes two more changes:

  • Run pre-commit on all the files. Maybe my system is different, but this led to a few minor changes
  • Adds STATUS messages for the CMake options. This makes it easier to tell what options are enabled when building from command line.

@WeiDaiWD
Copy link
Contributor

WeiDaiWD commented Apr 3, 2021

@fionser Does your program use BGV or BFV?

@WeiDaiWD
Copy link
Contributor

WeiDaiWD commented Apr 3, 2021

@fboemer Everything looks good now. Would you let me know when you think 1.0.0 is stable enough to freeze the tag? I'm ready to merge this into SEAL at any time, then why not wait for your upcoming commit hashes.

Side note:
I did some experiment with BFV. I set prime bit limits to 49 and 50 and deleted two default parameters to avoid errors. Compared to your branch, BFV / EvaluateMultCt gets faster in my branch for larger parameters. I'm testing on a Intel® Xeon(R) Silver 4108 @ 1.80 GHz, GCC 7.5.0. You should try on an IceLake processor. If it helps to get BFV faster, I can try to make those prime size limits configurable easier (without errors).

@fboemer
Copy link
Contributor Author

fboemer commented Apr 5, 2021

@WeiDaiWD , we updated the tag to v1.0.1 after all. No more changes will be made to this tag, so feel free to do any final testing and merge.

Thanks for trying out the prime bit-width changes. I'll try them out on an IceLake processor and report my findings here.

Edit: findings below. Still need to investigate if the AVX512IFMA52 instructions are being called or not with the smaller primes.

Benchmark HEXL=OFF HEXL=ON default HEXL=ON smaller primes
n=1024 / log(q)=27 / BFV / EvaluateMulCt/iterations:1000 489 327 292
n=1024 / log(q)=27 / CKKS / EvaluateMulCt/iterations:1000 18.5 3.38 3.35
n=4096 / log(q)=109 / BFV / EvaluateMulCt/iterations:1000 3384 2488 2322
n=4096 / log(q)=109 / CKKS / EvaluateMulCt/iterations:1000 145 55.1 55.7
n=8192 / log(q)=218 / BFV / EvaluateMulCt/iterations:1000 13198 9328 8601
n=8192 / log(q)=218 / CKKS / EvaluateMulCt/iterations:1000 631 217 211
n=16384 / log(q)=438 / BFV / EvaluateMulCt/iterations:1000 56597 40629 38734
n=16384 / log(q)=438 / CKKS / EvaluateMulCt/iterations:1000 2557 902 901

@WeiDaiWD
Copy link
Contributor

WeiDaiWD commented Apr 5, 2021

Cool. I'll just merge this. Let me know if you see important to update the tag/commit before the next SEAL's release. Thanks!

@WeiDaiWD WeiDaiWD merged commit aa476c7 into microsoft:contrib Apr 5, 2021
@WeiDaiWD
Copy link
Contributor

WeiDaiWD commented Apr 5, 2021

One little suggestion: when I build it on a Core i7-10700K that does not have AVX512, I have the following warning:
warning: AVX512F vector return without AVX512F enabled changes the ABI [-Wpsabi]

Maybe you want to explain to users what it means.

@fionser
Copy link
Contributor

fionser commented Apr 7, 2021

@fionser Does your program use BGV or BFV?

CKKS only

@fionser
Copy link
Contributor

fionser commented May 10, 2021

@WeiDaiWD @fboemer

Interestingly, when used HEXL, the BFV decryption slow down quite significantly.

/
| Encryption parameters :
|   scheme: BFV
|   poly_modulus_degree: 4096
|   #moduli: 2
|   #special_primes: 1
|   coeff_modulus size: 109 (59 + 50) bits
|   plain_modulus: 4194304 (23) bits
\

BFV decryption took 3.1ms/ 0.4ms (w and w/o HEXL) in my machine (gcc version 7.2.1, Red Hat 7.2.0-5, Intel(R) Xeon(R) Platinum 8269CY CPU @ 2.50GHz)

I have tried to turn off HEXL in polyarithsmallmod.cpp but it did not help.
I am wondering do you guy know what is going on ?

@fboemer
Copy link
Contributor Author

fboemer commented May 11, 2021

@fiosner, SEAL performs NTT pre-computations during configuration, while HEXL performs NTT pre-computations during the first use of the NTT. So the first run of BFV decryption may be slower, but I would expect repeated runs (e.g. in the benchmark suite using 1000 iterations) using HEXL to be similar to or faster than the SEAL implementation. The default iteration count of 10 seems small enough that a slow first run with HEXL may skew the average runtime.

I just tested on a similar machine (avx512dq, but not avx512ifma) (with gcc-9) and don't see this degradation.

Removing the SEAL_USE_INTEL_HEXL throughout is a good way to debug the slowdown. You could try removing the NTT HEXL integration to nail down where slowdown is coming from.

As another note, I've intermittently seen some very strange slowdowns in the past, similar to https://stackoverflow.com/questions/42358211/adding-a-print-statement-speeds-up-code-by-an-order-of-magnitude. If this problem still persists, perhaps try compiling SEAL with -march=native?

By the way, you may wish to see if the degradation you observe persists in the latest version of HEXL: #332

@fionser
Copy link
Contributor

fionser commented May 12, 2021

@fboemer Nice thank you for the information.

@fboemer fboemer deleted the fboemer/hexl branch November 3, 2021 18:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants