-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make byte[]
vector comparisons faster! (if possible)
#12621
Comments
the type conversions are what makes it slow. for float case it is the equiv of:
for the binary case it is the equivalent of:
You can understand the limitations at the hardware level better by reading https://en.wikichip.org/wiki/x86/avx512_vnni |
Also their suggested replacement of 3 instructions for the
I can tell you this is also not what is happening. We have no ability to write AVX-512-specific code and currently have to support ARM, machines with only AVX-256, etc. |
As far as the ARM goes, the fact it has only 128-bit SIMD is the limiting factor. For e.g. AVX-256, we use 64-bit vector of 8 byte values -> 128 bit vector of 8 short values -> 256 bit vector of 8 int values. For ARM/NEON with only 128-bit, we can't do this as we don't have 256-bit vectors. So instead we use use 64-bit vector of 8 byte values -> 128 bit vector of 8 short values -> 2 128-bit vectors of 4 int values each. It requires splitting the vector in half, it is just all we can do. If you want it to be faster get an ARM with SVE SIMD which has bigger vectors than NEON. |
My recommendation: stop messing around with |
Actually it is worse: Java 20 introduced conversion between short/float, but we got neither a native |
See openjdk/jdk#9422 (Java 20) |
We should at least fix the field we have in the sandbox to use it and start playing with the possibilities/performance? https://github.com/apache/lucene/blob/main/lucene/sandbox/src/java/org/apache/lucene/sandbox/document/HalfFloatPoint.java |
@benwtrent I looked into this more and eeked a bit more out: #12632 |
Thank you @rmuir && @ChrisHegarty for digging into this! The current Panama Vector API makes doing this kind of thing frustrating. Thank y'all for wrestling with it to make us faster. |
@benwtrent I think a big source of confusion is that while the data might be |
From my analysis, code being generated is correct. recommend to explore half-float instead for better performance and space tradeoffs. |
Description
While testing and digging around, I noticed that our float comparisons are way faster than byte on my Macbook (M1) and pretty much the same as our byte comparisons on a GCP Intel Sapphire Rapids CPU.
This seems counter-intuitive to me. I would expect Panama to be able to do more
byte
operations per cycle thanfloat
. My guess is the intrinsics are weird? Panama Vector just doesn't support or detect the required operations?Here are two benchmark results using @rmuir's helpful vectorbench project:
MacBook (Apple Silicon [128bits], JDK21):
GCP (Intel Sapphire Rapids [avx512], JDK21):
cpu-flags
The text was updated successfully, but these errors were encountered: