-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make ForUtil Vectorized #12396
Comments
I think it would be helpful if you could present the results with fewer "significant" digits; as it is it's hard to interpret. And if you are considering changing index format for higher speed, you should also indicate the relative size change. I expect the SIMD-optimized version you are testing doesn't do the same kind of compression that Lucene does? And this will have a significant impact in a whole system due to increased disk storage, memory usage, and paging. |
As of now, I am striving to maintain the compression format that was previously established, and once I'm done, I'll present the results in a more clear and easy-to-understand manner. |
Hi this is a good idea, ForUtil is next to PackedInts a good option for SIMD. If you have something implemented, open a PR, we will think of the best way ho to include it what is currently there. Ideally all should be implemented in VectorUtilProvider, but we may rename this class to be more generic. I would really keep only one provider interface to make it easier pluggable. |
In my opinion, the best would be to proceed here without integrating in Lucene and I can have a look at the code. The provider interfaces in Lucene are still package private, so theres poissibility to do some changes. As said before, for all vector operations there should only be one provider class with basically two implementations in Lucene (VectorUtilDefaultProvider and VectorUtilPanamaProvider, one for each java version). Actually if the code is nicely isolated from the rest of Lucene's code it should not be too complicated to include it. I will take care of that. |
Thanks for looking into this! For reference, I've been separately looking into whether we could vectorize prefix sums, which is one bottleneck of postings decoding today as we managed to get auto-vectorization to help with bit unpacking, but not with computing a prefix sum of doc IDs: https://github.com/jpountz/vectorized-prefix-sum. Unfortunately, Java's vector API doesn't offer a way to perform a shift across lanes efficiently (e.g. |
crazy question: do we really need vectorized prefix sum for the postings list? could we just decode the deltas, and lazily defer computation of accumulated docid sum until its needed, e.g. in the PostingsEnum's I guess i'm just unsure why we need to delta-decode the entire |
I have successfully implemented all encode methods in forutil while keeping the compression format unchanged. Here are the results.
While vectorized code can be advantageous, there are a few drawbacks to consider. Firstly, if the vector bit size is 128bit, it may be slower than scalar code due to we compress the long value(64bit). To address this, it may be worthwhile to check if the user's host supports 256bit vector. If not, scalar code can be used for compression. Secondly, the code can become quite large due to the need for specific code for each bitPerValue. As a result, it can be comparable to writing assembly code, with approximately 6000+ lines of code required. I seek counsel from others regarding these two drawbacks. If they are deemed acceptable, I intend to submit a draft PR this weekend. |
Hi, // This file has been automatically generated, DO NOT EDIT Actually the class is generated by a Python Script and placed in several folders (also for backwards codecs): https://github.com/apache/lucene/blob/main/gradle/generation/forUtil.gradle So there should be similar way to regenerate the vectorized variants (also possibly for multiple Java versions like 20, 21, 22 if those would differ). Of course this is a bit hard to implement this autogen setup with the current multi-release JAR setup, so there's a bit more work needed. I am happy to help or take over here, but give me a few days to think about it. I don't want to rush anything. Also Backwards-codecs JAR file has no support for Multi-Release and the whole panama lookup won't work there, but I would suggest:
Both impls look same, just in different packages, so I tend to move them to Lucene Core and include in VectorUtil. But maybe tehres a difference with byte order. I have some ideas, but we can hide it from public code using the provider interfaces we already have to lookup the instances. If the implementations in 9.0 and the one for 8.4 backwards really differ, I would only implement a generator for the 9.0 version. |
Actually if you could write some python3 like |
You can take the current python script as "basis" and work from there. |
|
To implement more classes like this with vectorization, I will try to add some additional abstraction layer to (move current abstraction to VectorProvider) and allow that one to return instances of several interfaces (VectorUtilProvider, ForUtilProvider, XxxxProvider). I still need to find a way around the package-private ForUtil to stay at its current place in the codecs. But Tthere are solutions, I have to think a bit more. I will also check if the 8.4 and the 9.0 version differ at all. |
The above would be a separate PR to cleanup the adhoc internal implementation of the Panama Integration a bit. The implementation devloped here could then be added to the new Lucene framework in a separate PR. |
But this is just a current hack from my understanding. A hack to try to provoke autovectorization. the things we are decoding are really 32-bit integers. |
Robert is right, let's not try to keep the current format, it's a hack. If you can make incompatible changes that perform better, we can work out the backward compatibility layer afterwards. |
If I'm not mistaken, the specific interface required here to abstract out
This interface does not need to be shared outside of its package - both the default and Panama implementations can reside in the same package. Maybe we need this interface for different codec versions? If so, then it could be copied, otherwise we need to find a "sharable" package. We current have a |
Hey, please for not let me do the integration. I have a plan already because with the current version it wont work. I have a half baked version ready, I am just waiting for input. I tend to split the classes a bit. When you implemented VectorUtilProvider it was not really a provider it was the implementation. My idea is: Have VectorizationProvider with the current code that just have a getter for If we want to have several impls we can do At moment I am thinking of having all that code in a subpackage of util: We have no secrets at moment, so the implementation can only be inside the utils package. if there are different versions per codec we would only implement the latest version with vectorization. The older codecs won't get vectorized. |
One we have some code here ready, I will submit a PR with the refactoring as preparation. |
I've had some things to take care of in the past few days 😅, so I haven't made much progress. I'll be providing specific implementation in the next few days. |
@uschindler Ok, thanks. Sounds like you have this under control. And what you say makes sense. Did the small sketch of the ForUtil interface I provided above make sense? Is that the level where we’re thinking that this will be abstracted? This is key to understanding how much flexibility we have in the underlying implementations. |
I like the idea to put everything under internal, we have that apckage already. But I would keep the implementations all under one package and let codes use them (specific impl per codec is ok, just different class names/getters in the provider interface). Let me provide the general setup later this weekend or begin of week. |
Thanks @uschindler, no great urgency on this part. As you have said earlier, work on the actual vector implementation can proceed independently of the general infrastructure - we can just copy the bits that we need for now, then refactor them out later. |
@tang-hi thanks for the update. I have some time later this weekend, I’ll see if I can get this started, and take a look at your code. |
I had some ideas in mind that the general lookup part just makes sure theres is a paname impl available and then delegates to the panama provider to decide which implementation to use. I plan to open an issue on JDK for 22 to have some way to detect which features are avail (like FMA). So we can also let the panama provider decide based on (future) features to return implementations based (like a full FMA impl of our current vector util) based on what the JDK reports back. We should really start and work on asking the Panama OpenJDK community to have some information about what features are available. The current state of not knowing anything is far from useful. |
@uschindler |
Great! I am in weekend mode, but I have a rewrite of the provider lookup almost ready. I will check how to include it. |
I've updated this Python file, added some comments, and made it more comprehensible. Feel free to raise any questions if you have. 😊 |
Here is my PR about the general setup to easily plug in different interfaces to support vectorization at other places (not only VectorUtil). This uses a non-exported package and extra checks to prevent misuse from downstream code: #12410 Before working on this I like this to be merged first. |
P.S.: I checked: ForUtil in 9.x is different than in 8.x because the byte order and some other slight changes. So I would only vectorize the core one (9.x) and leave out the backwards compatibility impl. |
I agree with that. The problem is that we need the panama-vectorized implementation and the scalar one behave identically, otherwise indexes would not be compatible. In future, when we vectorized everything hopefully with the answer to all questions, Java 42, we can change the format :-) |
What about changing the format now, as part of this change? This would mean that users who do not enable the Parama vector API would get a bit worse performance compared to today, but users who care about performance would most likely be able to do that? |
Perhaps I could attempt to implement both scalar and vectorized versions of different compression formats in order to assess the extent of performance degradation in the scalar version and performance enhancement in the vectorized version. We could then make a decision based on the results of the experiment. |
The framework to plug in in custom formats into Hints:
I can mock it up for ForUtil (looks simple, just a bit copy-paste and defining interface) and replacing |
++ Let's use the format that vectorizes the best. And yes, we will need a scalar implementation of that too, but to get started let's assume that the format changeable. |
We could start and go back to the original Lucene 5.0 format. Plain simple... It is still in backwards-codecs. |
I am also not sure if we really need to unroll all that loops and have different impls for each size. |
One small suggestion: Please stay with classical switch statement, as backporting the python code and java code to Java 11 is getting harder (please remember: also the vectorized code is compiled with Java 11 compiler in 9.x branch!!!). I just noticed that in @tang-hi's code. |
Thank you for your suggestion! I will take note of it. |
I believe that the Lucene 5.0 format may not be appropriate due to its rounding up of bitsPerValue. For example, it uses 8 bits to compress a 3-bit value, resulting in larger index files. However, I have already implemented a vectorized version of differrent compression formats that maintain the same size. The outcome appears promising
the code is straightforward, as shown below public void encode(long[] values, int bitsPerValue, long[] output) {
int MASK = (int) ((1L << bitsPerValue) - 1);
int bitsRemaining = 64;
int upto = 0;
int totalCompressedLine = 2 * bitsPerValue;
int next = 0;
LongVector input = LongVector.zero(LANE4_SPECIES);
while (next < 128) {
if (bitsRemaining >= bitsPerValue) {
input = input.or(LongVector.fromArray(LANE4_SPECIES, values, next).and(MASK)
.lanewise(VectorOperators.LSHL, bitsRemaining - bitsPerValue));
bitsRemaining -= bitsPerValue;
} else {
LongVector valueVector = LongVector.fromArray(LANE4_SPECIES, values, next).and(MASK);
input = input.or(valueVector.lanewise(VectorOperators.LSHR, bitsPerValue - bitsRemaining));
input.intoArray(output, upto);
upto += numEncodeLength;
input = valueVector.lanewise(VectorOperators.LSHL, 64 - bitsPerValue + bitsRemaining);
bitsRemaining -= bitsPerValue;
bitsRemaining += 64;
}
if (bitsRemaining == 0) {
input.intoArray(output, upto);
upto += numEncodeLength;
input = LongVector.zero(LANE4_SPECIES);
bitsRemaining = 64;
}
next += 4;
}
if (totalCompressedLine % 4 != 0) {
input.intoArray(output, upto);
output[totalCompressedLine -2] |= (output[totalCompressedLine ] >>> 32);
output[totalCompressedLine - 1] |= (output[totalCompressedLine + 1] >>> 32);
}
}
public void decode(int bitsPerValue, long[] input, long[] output) {
long MASK = (int) ((1L << bitsPerValue) - 1);
int upto = 0;
int next = 0;
int totalCompressedLine = 2 * bitsPerValue;
int bitsRemaining = 64;
LongVector inputVector = LongVector.fromArray(LANE4_SPECIES, input, next);
next += 4;
if (totalCompressedLine % 4 != 0) {
input[totalCompressedLine] = ((input[totalCompressedLine - 2] & LOW) << 32);
input[totalCompressedLine + 1] = ((input[totalCompressedLine - 1] & LOW) << 32);
input[totalCompressedLine -2] &= HIGH;
input[totalCompressedLine - 1] &= HIGH;
}
while (upto < 128) {
if (bitsRemaining >= bitsPerValue) {
LongVector res = inputVector.and(MASK << (bitsRemaining - bitsPerValue))
.lanewise(VectorOperators.LSHR, bitsRemaining - bitsPerValue);
bitsRemaining -= bitsPerValue;
res.intoArray(output, upto);
upto += 4;
} else {
int bitDiff = bitsPerValue - bitsRemaining;
LongVector res = inputVector.and(MASK >> (bitsPerValue - bitsRemaining))
.lanewise(VectorOperators.LSHL, bitDiff);
inputVector = LongVector.fromArray(LANE4_SPECIES, input, next);
next += 4;
var temp = inputVector.and((MASK >> bitsRemaining) << (64 - bitDiff) );
res = res.or(temp.lanewise(VectorOperators.LSHR, 64 - bitDiff));
res.intoArray(output, upto);
upto += 4;
bitsRemaining -= bitsPerValue;
bitsRemaining += 64;
}
if (bitsRemaining == 0) {
inputVector = LongVector.fromArray(LANE4_SPECIES, input, next);
next += 4;
bitsRemaining = 64;
}
}
} I will proceed with testing the scalar version and decoding. Once everything is prepared, I will submit a pull request this week |
Hi, I just noticed that the forUtil.gradle had a bug in Lucene 9. It does not regenerate anything because the actual python script is never executed. I will povide a separate fix in a minute. Interestingly the script does not execute correctly:
Thats not good! |
Here is the bugfix for the generator: #12411 This bug caused a lot trouble here because I did not understand why the script wasn't executed at all. |
Hi, What I am wondering about: The generated PanamaForUtil is immensely huge (258 KiB) and it looks only encoding is vectorized, while decoding looks like an very old version of the code, it explodes somehow - where did you get that code from? The vectorized encode seems to work, the indexes and tests are working. I also noticed a small problem with the new framework regarding package private caller classes. I will think about a good solution. Anywys, I will make the branch available as Draft PR so you can see how it was done. It was a bit more work because ForUtil was used at many places and each call to PForUtil needs a ForUtil instance. I refactored this and was able to make it work. |
I have currently only provided a vectorized version for encoding, while keeping decoding unchanged. This is because I wanted to ensure that the encoding part is working before implementing a compressed version for decoding. However, based on @jpountz's response, I am now trying to use a more concise version (with a different compression format but the same space size). I will update the Python script with this version later. |
…9.0 codec (only encode!). It should show as example how it works, this is not ready for productions, although all tests pass
Here is the example what I did with your code: #12412 It should demonstrate how I inlcuded it. Most interesting is the cleanup inside the lucene90 codec. I also defined the interface very quicky, but actually I did not check that all is used there. I defined some of the constants in the interface ( |
Thank you so much! I will take a look at this PR tomorrow. It's getting late for me now, so I need to go to sleep. 😊 |
I would like to pause a little, double check where we're going, and reset if needed. It's become clear to me that we're not quite aligned. The main issue I see is with the interface. All the code I've seen so far is using What we want is If we're in agreement with the above, then I'd like to propose that we go back to the basics - implement all the various bitsPerValue variants for Int128Vector species. This will effectively be a port of the non-masking pack/unpack from https://github.com/lemire/simdcomp/blob/master/src/simdbitpacking.c. We can then compare these to the existing ForUtil implementations and tweak the actual code as necessary. I've started this in this repo https://github.com/ChrisHegarty/bitpacking (a pure port for now, we can analyse the benchmarks and tweak the code once complete ) Clearly, there is a lot more to do: a scalar implementation will be required of the new format (and with Have I missed something? What do you think? |
I agree. There are more complications: DataInput does not have a read method for int[], only one for float[] and long[]. So changing this is a bigger task. I tend to think that we should take some time to try and possibly also delay that till Lucene 10.0. This was the reason why I said: let's keep index Format as is and reimplement current format with vectors, but also invent a new one with new methods in IO layer based on ints for later. |
Switching from LongVector to IntVector is feasible, especially for 128-bit, as there are only a few modifications needed. Special handling may be required for 256-bit and 512-bit, but it shouldn't be a big issue. Additionally, when the output is also int[], the handling will be simpler. For example, if it were long[] and bpv was 3, we would need to store it as six longs, which cannot be evenly divided by four. However, with int[], this problem doesn't exist. Currently, I am more concerned about the scalar code. Yesterday, I tested it and found that the original scalar decode code was amazingly fast, almost on par with the SIMD speed at 50 op/us. On the other hand, the code using the new compression format only achieved around 10+ op/us. I would like to know if we can tolerate some performance loss in the scalar code if we switch the compression format. What is the minimum performance threshold we can accept? |
To me the numbers that matter are more the numbers for end-to-end search latency rather than pure decoding speed, e.g. the same benchmark as @mikemccand 's nightlies. I'd be good with this change if there is a noticeable speedup with the new vector code, and the scalar code only degrades performance by a few percents. (It would be very easy to test this by plugging this new ForUtil into a new codec, I can help with this if needed). |
I have attempted to implement an Int version of the scalar and vectorized forutil. I have submitted a draft PR as a simple starting point for those interested in this issue. Even if it is not ultimately merged, it can still be useful. Here is my current progress:
However, I am not sure why many of the tests are failing, even though the tests for Pforutil and forutil are passing. I will take a closer look at the specific reasons when I have time. Could changing certain members from long to int affect correctness? I am not completely familiar with the index structure of Lucene, so please let me know if you have any discoveries. It is also possible that there are issues with my code implementation. |
When using the int type, there is a significant performance improvement compared to the long type, approximately 2-3 times. You can refer to link for more details. The performance of decoding and encoding is similar, except that the original scalar decoding code has a speed of around 50op/us, while the vectorized version has a performance improvement of over twice as much. |
I just noticed: We have |
This could be because of different memory size. Lucene calculates the memory usage of the codec's on heap data. Maybe that now a bit wrong as you have not adapted ramBytesUsed() methods in some of the classes. Without knowing what tests failed (I will try later) I can't give a good guess. I can check out later. |
Description
Since the introduction of Vector API into Lucene via #12311, I have found it to be an interesting tool. As a result, I have attempted to use it to rewrite the ForUtil.java file. Initially, I attempted to port simdcomp and implement pack and unpack for bitPerValue values that were less than or equal to 8. The results of my efforts can be seen below.
However, I noticed that the compression format used in ForUtil.java was different. It employed some tricks to speed up the process, such as simd. Therefore, I attempted to vectorize it while maintaining the compression format. The results can be seen below.
I have only implemented bitPerValue values of 1, 2, 4, and 8. I am curious if it is possible to change the compression format. Additionally, do you have any best practices for integrating vectorized code into Lucene? Any suggestions would be appreciated.
Currently, I am working on my own repo. However, the code is still in a rough state and lacks documentation.
The text was updated successfully, but these errors were encountered: