Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make ForUtil Vectorized #12396

Open
tang-hi opened this issue Jun 25, 2023 · 53 comments · May be fixed by #12417
Open

Make ForUtil Vectorized #12396

tang-hi opened this issue Jun 25, 2023 · 53 comments · May be fixed by #12417

Comments

@tang-hi
Copy link
Contributor

tang-hi commented Jun 25, 2023

Description

Since the introduction of Vector API into Lucene via #12311, I have found it to be an interesting tool. As a result, I have attempted to use it to rewrite the ForUtil.java file. Initially, I attempted to port simdcomp and implement pack and unpack for bitPerValue values that were less than or equal to 8. The results of my efforts can be seen below.

Benchmark Mode Cnt Score Error Units
VectorizeBenchmark.decode1 thrpt 15 58293945.447 ± 1524549.587 ops/s
VectorizeBenchmark.decode2 thrpt 15 55598229.538 ± 4516920.135 ops/s
VectorizeBenchmark.decode3 thrpt 15 57163871.965 ± 1566246.490 ops/s
VectorizeBenchmark.decode4 thrpt 15 55128874.528 ± 4752397.170 ops/s
VectorizeBenchmark.decode5 thrpt 15 53822335.729 ± 4599217.489 ops/s
VectorizeBenchmark.decode6 thrpt 15 48155246.120 ± 7519360.551 ops/s
VectorizeBenchmark.decode7 thrpt 15 50253799.192 ± 820075.648 ops/s
VectorizeBenchmark.decode8 thrpt 15 68849728.856 ± 1818468.973 ops/s
VectorizeBenchmark.encode1 thrpt 15 33998510.772 ± 2924618.992 ops/s
VectorizeBenchmark.encode2 thrpt 15 43238190.552 ± 810373.966 ops/s
VectorizeBenchmark.encode3 thrpt 15 36613553.485 ± 483115.838 ops/s
VectorizeBenchmark.encode4 thrpt 15 45675726.831 ± 1081153.655 ops/s
VectorizeBenchmark.encode5 thrpt 15 33591855.278 ± 1084009.112 ops/s
VectorizeBenchmark.encode6 thrpt 15 36110726.127 ± 767075.709 ops/s
VectorizeBenchmark.encode7 thrpt 15 34754339.379 ± 275025.123 ops/s
VectorizeBenchmark.encode8 thrpt 15 55075742.358 ± 991165.320 ops/s
VectorizeBenchmark.vectorizedDecode1 thrpt 15 43878020.796 ± 7148545.623 ops/s
VectorizeBenchmark.vectorizedDecode2 thrpt 15 103091446.773 ± 44115190.011 ops/s
VectorizeBenchmark.vectorizedDecode3 thrpt 15 83168059.373 ± 24930903.852 ops/s
VectorizeBenchmark.vectorizedDecode4 thrpt 15 63156089.355 ± 15039408.293 ops/s
VectorizeBenchmark.vectorizedDecode5 thrpt 15 96567546.695 ± 37142784.493 ops/s
VectorizeBenchmark.vectorizedDecode6 thrpt 15 73897063.180 ± 11549757.437 ops/s
VectorizeBenchmark.vectorizedDecode7 thrpt 15 79716185.567 ± 29990852.039 ops/s
VectorizeBenchmark.vectorizedDecode8 thrpt 15 92621676.617 ± 29702056.667 ops/s
VectorizeBenchmark.vectorizedEncode1 thrpt 15 51140300.852 ± 139758.385 ops/s
VectorizeBenchmark.vectorizedEncode2 thrpt 15 82646100.574 ± 1289600.954 ops/s
VectorizeBenchmark.vectorizedEncode3 thrpt 15 88124485.953 ± 742170.198 ops/s
VectorizeBenchmark.vectorizedEncode4 thrpt 15 91029285.467 ± 5594858.437 ops/s
VectorizeBenchmark.vectorizedEncode5 thrpt 15 96843051.648 ± 8024430.836 ops/s
VectorizeBenchmark.vectorizedEncode6 thrpt 15 98596724.128 ± 10068466.227 ops/s
VectorizeBenchmark.vectorizedEncode7 thrpt 15 85885746.715 ± 6031740.563 ops/s
VectorizeBenchmark.vectorizedEncode8 thrpt 15 117139889.194 ± 8721517.095 ops/s

However, I noticed that the compression format used in ForUtil.java was different. It employed some tricks to speed up the process, such as simd. Therefore, I attempted to vectorize it while maintaining the compression format. The results can be seen below.

Benchmark Mode Cnt Score Error Units
Benchmark.encode1 thrpt 15 38017254.040 ± 3905466.628 ops/s
Benchmark.encode2 thrpt 15 45170109.395 ± 1539203.478 ops/s
Benchmark.encode3 thrpt 15 38757256.653 ± 1044709.221 ops/s
Benchmark.encode4 thrpt 15 49307206.891 ± 799168.007 ops/s
Benchmark.encode5 thrpt 15 35130626.548 ± 792210.817 ops/s
Benchmark.encode6 thrpt 15 38326892.073 ± 981865.963 ops/s
Benchmark.encode7 thrpt 15 37372342.721 ± 1177478.683 ops/s
Benchmark.encode8 thrpt 15 60757390.416 ± 458876.638 ops/s
Benchmark.vectorizedEncode1 thrpt 15 56413094.655 ± 435917.201 ops/s
Benchmark.vectorizedEncode2 thrpt 15 88770400.646 ± 11183716.176 ops/s
Benchmark.vectorizedEncode3 thrpt 15 39932842.378 ± 2366921.921 ops/s
Benchmark.vectorizedEncode4 thrpt 15 85888128.739 ± 5499354.172 ops/s
Benchmark.vectorizedEncode5 thrpt 15 34402027.732 ± 1414839.159 ops/s
Benchmark.vectorizedEncode6 thrpt 15 35794303.501 ± 782940.005 ops/s
Benchmark.vectorizedEncode7 thrpt 15 33845690.180 ± 2586648.353 ops/s
Benchmark.vectorizedEncode8 thrpt 15 97914288.675 ± 8971857.035 ops/s

I have only implemented bitPerValue values of 1, 2, 4, and 8. I am curious if it is possible to change the compression format. Additionally, do you have any best practices for integrating vectorized code into Lucene? Any suggestions would be appreciated.

Currently, I am working on my own repo. However, the code is still in a rough state and lacks documentation.

@msokolov
Copy link
Contributor

I think it would be helpful if you could present the results with fewer "significant" digits; as it is it's hard to interpret. And if you are considering changing index format for higher speed, you should also indicate the relative size change. I expect the SIMD-optimized version you are testing doesn't do the same kind of compression that Lucene does? And this will have a significant impact in a whole system due to increased disk storage, memory usage, and paging.

@tang-hi
Copy link
Contributor Author

tang-hi commented Jun 26, 2023

I think it would be helpful if you could present the results with fewer "significant" digits; as it is it's hard to interpret. And if you are considering changing index format for higher speed, you should also indicate the relative size change. I expect the SIMD-optimized version you are testing doesn't do the same kind of compression that Lucene does? And this will have a significant impact in a whole system due to increased disk storage, memory usage, and paging.

As of now, I am striving to maintain the compression format that was previously established, and once I'm done, I'll present the results in a more clear and easy-to-understand manner.

@uschindler
Copy link
Contributor

Hi this is a good idea, ForUtil is next to PackedInts a good option for SIMD.

If you have something implemented, open a PR, we will think of the best way ho to include it what is currently there. Ideally all should be implemented in VectorUtilProvider, but we may rename this class to be more generic. I would really keep only one provider interface to make it easier pluggable.

@uschindler
Copy link
Contributor

In my opinion, the best would be to proceed here without integrating in Lucene and I can have a look at the code. The provider interfaces in Lucene are still package private, so theres poissibility to do some changes. As said before, for all vector operations there should only be one provider class with basically two implementations in Lucene (VectorUtilDefaultProvider and VectorUtilPanamaProvider, one for each java version).

Actually if the code is nicely isolated from the rest of Lucene's code it should not be too complicated to include it. I will take care of that.

@jpountz
Copy link
Contributor

jpountz commented Jun 28, 2023

Thanks for looking into this! For reference, I've been separately looking into whether we could vectorize prefix sums, which is one bottleneck of postings decoding today as we managed to get auto-vectorization to help with bit unpacking, but not with computing a prefix sum of doc IDs: https://github.com/jpountz/vectorized-prefix-sum. Unfortunately, Java's vector API doesn't offer a way to perform a shift across lanes efficiently (e.g. IntVector#unslice doesn't compile to vpslldq with SPECIES_128) while it's the traditional way how prefix sum is vectorized. I've looked into other approaches, but they're all slower than a scalar prefix sum. @ChrisHegarty looked into it and had a few interesting observations, in particular vectorized prefix sums suddenly become several times faster when bound checks are disabled: https://github.com/jpountz/vectorized-prefix-sum/issues, so there might be room for improvement in the JDK that would allow us to eventually take advantage of vectorization for prefix sums?

@rmuir
Copy link
Member

rmuir commented Jun 28, 2023

crazy question: do we really need vectorized prefix sum for the postings list? could we just decode the deltas, and lazily defer computation of accumulated docid sum until its needed, e.g. in the PostingsEnum's next() method?

I guess i'm just unsure why we need to delta-decode the entire int[] up-front, when the access is all DAAT. Maybe its a stupid idea, but its something we could try out independently of any vectorization.

@tang-hi
Copy link
Contributor Author

tang-hi commented Jun 28, 2023

I have successfully implemented all encode methods in forutil while keeping the compression format unchanged. Here are the results.

Benchmark Mode Cnt Score (ops/s) Error (ops/s)
Encode1 thrpt 15 30.81M 7.76M
vectorizedEncode1 thrpt 15 52.11M 9.15M
Encode2 thrpt 15 44.97M 972.53K
vectorizedEncode2 thrpt 15 84.76M 11.56M
Encode3 thrpt 15 39.16M 389.41K
vectorizedEncode3 thrpt 15 49.50M 319.99K
Encode4 thrpt 15 49.35M 895.58K
vectorizedEncode4 thrpt 15 84.05M 1.10M
Encode5 thrpt 15 35.39M 1.76M
vectorizedEncode5 thrpt 15 39.75M 2.49M
Encode6 thrpt 15 38.26M 1.16M
vectorizedEncode6 thrpt 15 52.07M 2.29M
Encode7 thrpt 15 37.20M 556.95K
vectorizedEncode7 thrpt 15 48.99M 1.34M
Encode8 thrpt 15 59.05M 1.74M
vectorizedEncode8 thrpt 15 95.48M 7.30M
Encode9 thrpt 15 24.64M 135.16K
vectorizedEncode9 thrpt 15 27.45M 1.29M
Encode10 thrpt 15 24.94M 358.51K
vectorizedEncode10 thrpt 15 27.22M 2.40M
Encode11 thrpt 15 25.11M 819.30K
vectorizedEncode11 thrpt 15 23.89M 1.28M
Encode12 thrpt 15 29.45M 490.67K
vectorizedEncode12 thrpt 15 41.90M 2.30M
Encode13 thrpt 15 24.38M 226.59K
vectorizedEncode13 thrpt 15 24.83M 1.39M
Encode14 thrpt 15 27.44M 282.45K
vectorizedEncode14 thrpt 15 36.28M 3.62M
Encode15 thrpt 15 27.64M 504.77K
vectorizedEncode15 thrpt 15 39.15M 2.88M
Encode16 thrpt 15 55.31M 4.87M
vectorizedEncode16 thrpt 15 84.89M 7.14M
Encode17 thrpt 15 16.27M 152.66K
vectorizedEncode17 thrpt 15 15.04M 740.80K
Encode18 thrpt 15 15.45M 69.44K
vectorizedEncode18 thrpt 15 15.49M 654.85K
Encode19 thrpt 15 12.67M 119.34K
vectorizedEncode19 thrpt 15 13.17M 539.82K
Encode20 thrpt 15 15.90M 184.34K
vectorizedEncode20 thrpt 15 16.03M 1.29M
Encode21 thrpt 15 15.95M 87.33K
vectorizedEncode21 thrpt 15 15.01M 274.69K
Encode22 thrpt 15 16.04M 127.17K
vectorizedEncode22 thrpt 15 15.34M 404.39K
Encode23 thrpt 15 15.62M 180.99K
vectorizedEncode23 thrpt 15 15.11M 615.33K
Encode24 thrpt 15 17.79M 673.54K
vectorizedEncode24 thrpt 15 22.53M 2.13M
Encode25 thrpt 15 15.42M 203.42K
vectorizedEncode25 thrpt 15 13.45M 1.34M
Encode26 thrpt 15 15.20M 187.19K
vectorizedEncode26 thrpt 15 14.91M 402.92K
Encode27 thrpt 15 14.98M 81.71K
vectorizedEncode27 thrpt 15 14.71M 740.17K
Encode28 thrpt 15 17.84M 166.66K
vectorizedEncode28 thrpt 15 20.75M 836.14K
Encode29 thrpt 15 14.66M 177.74K
vectorizedEncode29 thrpt 15 13.89M 392.35
Encode30 thrpt 15 17.43M 462.43K
vectorizedEncode30 thrpt 15 23.42M 2.45M
Encode31 thrpt 15 18.49M 280.37K
vectorizedEncode31 thrpt 15 26.03M 1.40M
Encode32 thrpt 15 50.39M 1.99M
vectorizedEncode32 thrpt 15 38.64M 51.39M

While vectorized code can be advantageous, there are a few drawbacks to consider. Firstly, if the vector bit size is 128bit, it may be slower than scalar code due to we compress the long value(64bit). To address this, it may be worthwhile to check if the user's host supports 256bit vector. If not, scalar code can be used for compression. Secondly, the code can become quite large due to the need for specific code for each bitPerValue. As a result, it can be comparable to writing assembly code, with approximately 6000+ lines of code required.

I seek counsel from others regarding these two drawbacks. If they are deemed acceptable, I intend to submit a draft PR this weekend.

@uschindler
Copy link
Contributor

uschindler commented Jun 28, 2023

Hi,
if you look at the first line of ForUtil.java, you will see the following comment:

// This file has been automatically generated, DO NOT EDIT

Actually the class is generated by a Python Script and placed in several folders (also for backwards codecs): https://github.com/apache/lucene/blob/main/gradle/generation/forUtil.gradle

So there should be similar way to regenerate the vectorized variants (also possibly for multiple Java versions like 20, 21, 22 if those would differ). Of course this is a bit hard to implement this autogen setup with the current multi-release JAR setup, so there's a bit more work needed. I am happy to help or take over here, but give me a few days to think about it. I don't want to rush anything.

Also Backwards-codecs JAR file has no support for Multi-Release and the whole panama lookup won't work there, but I would suggest:

  • For the backwards compatibility codec we keep the code as is.
  • Or move ForUtil to Lucene core's util package and make it public.

Both impls look same, just in different packages, so I tend to move them to Lucene Core and include in VectorUtil. But maybe tehres a difference with byte order. I have some ideas, but we can hide it from public code using the provider interfaces we already have to lookup the instances.

If the implementations in 9.0 and the one for 8.4 backwards really differ, I would only implement a generator for the 9.0 version.

@uschindler
Copy link
Contributor

Actually if you could write some python3 like gen_vectorized20_ForUtil.py (for Java 20) script that generates your VectorizedForUtil, I could check how to include that with the current build system.

@uschindler
Copy link
Contributor

You can take the current python script as "basis" and work from there.

@tang-hi
Copy link
Contributor Author

tang-hi commented Jun 28, 2023

You can take the current python script as "basis" and work from there.
Great, I will give it a try! 😄

@uschindler
Copy link
Contributor

To implement more classes like this with vectorization, I will try to add some additional abstraction layer to (move current abstraction to VectorProvider) and allow that one to return instances of several interfaces (VectorUtilProvider, ForUtilProvider, XxxxProvider).

I still need to find a way around the package-private ForUtil to stay at its current place in the codecs. But Tthere are solutions, I have to think a bit more. I will also check if the 8.4 and the 9.0 version differ at all.

@uschindler
Copy link
Contributor

The above would be a separate PR to cleanup the adhoc internal implementation of the Panama Integration a bit. The implementation devloped here could then be added to the new Lucene framework in a separate PR.

@rmuir
Copy link
Member

rmuir commented Jun 28, 2023

While vectorized code can be advantageous, there are a few drawbacks to consider. Firstly, if the vector bit size is 128bit, it may be slower than scalar code due to we compress the long value(64bit).

But this is just a current hack from my understanding. A hack to try to provoke autovectorization. the things we are decoding are really 32-bit integers.

@jpountz
Copy link
Contributor

jpountz commented Jun 28, 2023

Robert is right, let's not try to keep the current format, it's a hack. If you can make incompatible changes that perform better, we can work out the backward compatibility layer afterwards.

@ChrisHegarty
Copy link
Contributor

If I'm not mistaken, the specific interface required here to abstract out ForUtil would look something like this?

interface ForUtilXXX {
  /** Number of bytes required to encode 128 integers of {@code bitsPerValue} bits per value. */
  int numBytes(int bitsPerValue);

  /** Encodes integers from {@code values} into {@code out}. */
  void encode(int[] values, int bitsPerValue, DataOutput out) throws IOException;

  /** Decodes integers into {@code values}. */
  void decode(int bitsPerValue, DataInput in, int[] values) throws IOException;
}

This interface does not need to be shared outside of its package - both the default and Panama implementations can reside in the same package. Maybe we need this interface for different codec versions? If so, then it could be copied, otherwise we need to find a "sharable" package.

We current have a TestSecrets, that supports cross-package access, but even a more general non-test specific o.a.l.internal.SharedSecrets is not sufficient here, since we need to share both the Vector lookup code, and (maybe?) the interface, ForUtilXXX. The most straightforward, but less defensive, way I see is to double down on org.apache.lucene.internal as location to house such things that are Lucene internal not to be used by external developers. While not a very novel approach, it is well understood, commonly used, and adds the least amount of friction to the code.

@uschindler
Copy link
Contributor

uschindler commented Jul 1, 2023

Hey, please for not let me do the integration. I have a plan already because with the current version it wont work. I have a half baked version ready, I am just waiting for input. I tend to split the classes a bit. When you implemented VectorUtilProvider it was not really a provider it was the implementation.

My idea is: Have VectorizationProvider with the current code that just have a getter for VectorUtilImpl getVectorUtilImpl() and more for the other classes. The provider interface could also be used to get instances of classes with state (like a full FOR decoder).

If we want to have several impls we can do VectorizationProvider.getLucene96ForUtil() returning a version for this codec version. But in reality I would rmeove the vectorized ones in backwards codecs.

At moment I am thinking of having all that code in a subpackage of util: org.apache.lucene.util.vectorization (you had something similar at beginning). We can hide that package in module-info, but classpath appls would see it so the interfaces need javadocs.

We have no secrets at moment, so the implementation can only be inside the utils package. if there are different versions per codec we would only implement the latest version with vectorization. The older codecs won't get vectorized.

@uschindler
Copy link
Contributor

One we have some code here ready, I will submit a PR with the refactoring as preparation.

@tang-hi
Copy link
Contributor Author

tang-hi commented Jul 1, 2023

One we have some code here ready, I will submit a PR with the refactoring as preparation.

I've had some things to take care of in the past few days 😅, so I haven't made much progress. I'll be providing specific implementation in the next few days.

@ChrisHegarty
Copy link
Contributor

@uschindler Ok, thanks. Sounds like you have this under control. And what you say makes sense.

Did the small sketch of the ForUtil interface I provided above make sense? Is that the level where we’re thinking that this will be abstracted? This is key to understanding how much flexibility we have in the underlying implementations.

@uschindler
Copy link
Contributor

I like the idea to put everything under internal, we have that apckage already.

But I would keep the implementations all under one package and let codes use them (specific impl per codec is ok, just different class names/getters in the provider interface).

Let me provide the general setup later this weekend or begin of week.

@ChrisHegarty
Copy link
Contributor

Thanks @uschindler, no great urgency on this part. As you have said earlier, work on the actual vector implementation can proceed independently of the general infrastructure - we can just copy the bits that we need for now, then refactor them out later.

@ChrisHegarty
Copy link
Contributor

@tang-hi thanks for the update. I have some time later this weekend, I’ll see if I can get this started, and take a look at your code.

@uschindler
Copy link
Contributor

I had some ideas in mind that the general lookup part just makes sure theres is a paname impl available and then delegates to the panama provider to decide which implementation to use. I plan to open an issue on JDK for 22 to have some way to detect which features are avail (like FMA). So we can also let the panama provider decide based on (future) features to return implementations based (like a full FMA impl of our current vector util) based on what the JDK reports back.

We should really start and work on asking the Panama OpenJDK community to have some information about what features are available. The current state of not knowing anything is far from useful.

@tang-hi
Copy link
Contributor Author

tang-hi commented Jul 2, 2023

@uschindler
Here is the draft version of gen_vectorized20_ForUtil.py. I plan to enhance the code quality and add some comments in the near future.

@uschindler
Copy link
Contributor

uschindler commented Jul 2, 2023

Great! I am in weekend mode, but I have a rewrite of the provider lookup almost ready.

I will check how to include it.

@tang-hi
Copy link
Contributor Author

tang-hi commented Jul 2, 2023

@uschindler Here is the draft version of gen_vectorized20_ForUtil.py. I plan to enhance the code quality and add some comments in the near future.

I've updated this Python file, added some comments, and made it more comprehensible. Feel free to raise any questions if you have. 😊

@uschindler
Copy link
Contributor

Here is my PR about the general setup to easily plug in different interfaces to support vectorization at other places (not only VectorUtil). This uses a non-exported package and extra checks to prevent misuse from downstream code: #12410

Before working on this I like this to be merged first.

@uschindler
Copy link
Contributor

P.S.: I checked: ForUtil in 9.x is different than in 8.x because the byte order and some other slight changes. So I would only vectorize the core one (9.x) and leave out the backwards compatibility impl.

@uschindler
Copy link
Contributor

While vectorized code can be advantageous, there are a few drawbacks to consider. Firstly, if the vector bit size is 128bit, it may be slower than scalar code due to we compress the long value(64bit).

But this is just a current hack from my understanding. A hack to try to provoke autovectorization. the things we are decoding are really 32-bit integers.

I agree with that. The problem is that we need the panama-vectorized implementation and the scalar one behave identically, otherwise indexes would not be compatible.

In future, when we vectorized everything hopefully with the answer to all questions, Java 42, we can change the format :-)

@jpountz
Copy link
Contributor

jpountz commented Jul 3, 2023

What about changing the format now, as part of this change? This would mean that users who do not enable the Parama vector API would get a bit worse performance compared to today, but users who care about performance would most likely be able to do that?

@tang-hi
Copy link
Contributor Author

tang-hi commented Jul 3, 2023

What about changing the format now, as part of this change? This would mean that users who do not enable the Parama vector API would get a bit worse performance compared to today, but users who care about performance would most likely be able to do that?

Perhaps I could attempt to implement both scalar and vectorized versions of different compression formats in order to assess the extent of performance degradation in the scalar version and performance enhancement in the vectorized version. We could then make a decision based on the results of the experiment.

@uschindler
Copy link
Contributor

uschindler commented Jul 3, 2023

The framework to plug in in custom formats into org.apache.lucene.internal.vectorization was merged, so stage is open to implement something.

Hints:

  • Both the scalar and the vectorized implementation need to be moved to that package and share a common interface.
  • A getter needs to be added to VectorizationProvider to return a singleton (if stateless) or an instance (if stateful like ForUtil, like newForUtil98() for the Lucene 9.8 codec).
  • The calling class that gets the provider needs to be added for caller sensitive checks to prevent misuse.
  • A test comparing both impementations with random data extending from BaseVectorizationTestCase needs to be added. See TestVectorUtilSupport for an example.

I can mock it up for ForUtil (looks simple, just a bit copy-paste and defining interface) and replacing new ForUtil() by defining a constant somewhere in the codec like private static final VectorizationProvider PROVIDER = VectorizationProvider.getInstance() and then calling PROVIDER.newForUtil98().

@ChrisHegarty
Copy link
Contributor

++ Let's use the format that vectorizes the best. And yes, we will need a scalar implementation of that too, but to get started let's assume that the format changeable.

@uschindler
Copy link
Contributor

What about changing the format now, as part of this change? This would mean that users who do not enable the Parama vector API would get a bit worse performance compared to today, but users who care about performance would most likely be able to do that?

We could start and go back to the original Lucene 5.0 format. Plain simple... It is still in backwards-codecs.

@uschindler
Copy link
Contributor

I am also not sure if we really need to unroll all that loops and have different impls for each size.

@uschindler
Copy link
Contributor

One small suggestion: Please stay with classical switch statement, as backporting the python code and java code to Java 11 is getting harder (please remember: also the vectorized code is compiled with Java 11 compiler in 9.x branch!!!). I just noticed that in @tang-hi's code.

@tang-hi
Copy link
Contributor Author

tang-hi commented Jul 3, 2023

Thank you for your suggestion! I will take note of it.

@tang-hi
Copy link
Contributor Author

tang-hi commented Jul 3, 2023

I believe that the Lucene 5.0 format may not be appropriate due to its rounding up of bitsPerValue. For example, it uses 8 bits to compress a 3-bit value, resulting in larger index files. However, I have already implemented a vectorized version of differrent compression formats that maintain the same size. The outcome appears promising

Benchmark Type Iterations Mean ops/us Error
Encode1 thrpt 15 37.023 5.060
VectorizedEncode1 thrpt 15 53.261 0.850
Encode2 thrpt 15 45.017 0.887
VectorizedEncode2 thrpt 15 56.111 2.234
Encode3 thrpt 15 38.750 0.521
VectorizedEncode3 thrpt 15 49.589 2.518
Encode4 thrpt 15 48.074 1.390
VectorizedEncode4 thrpt 15 55.654 2.580
Encode5 thrpt 15 35.517 0.815
VectorizedEncode5 thrpt 15 50.508 0.887
Encode6 thrpt 15 38.698 0.301
VectorizedEncode6 thrpt 15 48.876 0.926
Encode7 thrpt 15 36.697 0.942
VectorizedEncode7 thrpt 15 50.016 2.425
Encode8 thrpt 15 59.180 2.208
VectorizedEncode8 thrpt 15 54.895 0.497
Encode9 thrpt 15 24.215 0.670
VectorizedEncode9 thrpt 15 49.292 0.616
Encode10 thrpt 15 25.509 0.274
VectorizedEncode10 thrpt 15 46.777 0.762
Encode11 thrpt 15 25.165 0.635
VectorizedEncode11 thrpt 15 46.798 2.554
Encode12 thrpt 15 29.170 0.671
VectorizedEncode12 thrpt 15 47.331 0.994
Encode13 thrpt 15 23.749 1.126
VectorizedEncode13 thrpt 15 46.587 2.468
Encode14 thrpt 15 27.283 0.235
VectorizedEncode14 thrpt 15 44.704 0.805
Encode15 thrpt 15 27.459 1.035
VectorizedEncode15 thrpt 15 45.335 3.178
Encode16 thrpt 15 58.192 0.557
VectorizedEncode16 thrpt 15 52.698 0.918
Encode17 thrpt 15 16.265 0.168
VectorizedEncode17 thrpt 15 45.757 2.126
Encode18 thrpt 15 15.261 0.167
VectorizedEncode18 thrpt 15 44.386 0.807
Encode19 thrpt 15 12.531 0.138
VectorizedEncode19 thrpt 15 45.403 0.854
Encode20 thrpt 15 15.863 0.351
VectorizedEncode20 thrpt 15 42.607 3.698
Encode21 thrpt 15 15.772 0.154
VectorizedEncode21 thrpt 15 45.122 0.777
Encode22 thrpt 15 15.863 0.210
VectorizedEncode22 thrpt 15 42.802 1.240
Encode23 thrpt 15 15.638 0.095
VectorizedEncode23 thrpt 15 44.411 0.536
Encode24 thrpt 15 17.091 0.713
VectorizedEncode24 thrpt 15 42.151 2.151
Encode25 thrpt 15 15.206 0.163
VectorizedEncode25 thrpt 15 43.440 2.078
Encode26 thrpt 15 15.110 0.188
VectorizedEncode26 thrpt 15 40.758 0.416
Encode27 thrpt 15 14.794 0.192
VectorizedEncode27 thrpt 15 43.261 0.494
Encode28 thrpt 15 17.531 0.393
VectorizedEncode28 thrpt 15 41.578 0.838
Encode29 thrpt 15 14.423 0.173
VectorizedEncode29 thrpt 15 36.044 10.191
Encode30 thrpt 15 17.426 0.297
VectorizedEncode30 thrpt 15 40.087 0.791
Encode31 thrpt 15 18.489 0.180
VectorizedEncode31 thrpt 15 42.166 0.625
Encode32 thrpt 15 47.742 4.446
VectorizedEncode32 thrpt 15 54.260 1.806

the code is straightforward, as shown below

public void encode(long[] values, int bitsPerValue, long[] output) {
        int MASK = (int) ((1L << bitsPerValue) - 1);


        int bitsRemaining = 64;
        int upto = 0;
        int totalCompressedLine = 2 * bitsPerValue;
        int next = 0;

        LongVector input = LongVector.zero(LANE4_SPECIES);
        while (next < 128) {
            if (bitsRemaining >= bitsPerValue) {
                input = input.or(LongVector.fromArray(LANE4_SPECIES, values, next).and(MASK)
                        .lanewise(VectorOperators.LSHL, bitsRemaining - bitsPerValue));
                bitsRemaining -= bitsPerValue;
            } else {
                LongVector valueVector = LongVector.fromArray(LANE4_SPECIES, values, next).and(MASK);
                input = input.or(valueVector.lanewise(VectorOperators.LSHR, bitsPerValue - bitsRemaining));
                input.intoArray(output, upto);
                upto += numEncodeLength;
                input = valueVector.lanewise(VectorOperators.LSHL, 64 - bitsPerValue + bitsRemaining);
                bitsRemaining -= bitsPerValue;
                bitsRemaining += 64;
            }

            if (bitsRemaining == 0) {
                input.intoArray(output, upto);
                upto += numEncodeLength;
                input = LongVector.zero(LANE4_SPECIES);
                bitsRemaining = 64;
            }
            next += 4;
        }


        if (totalCompressedLine % 4 != 0) {
            input.intoArray(output, upto);
           output[totalCompressedLine -2] |= (output[totalCompressedLine ] >>> 32);
           output[totalCompressedLine - 1] |= (output[totalCompressedLine + 1] >>> 32);
        }

    }


    public void decode(int bitsPerValue, long[] input, long[] output) {
        long MASK = (int) ((1L << bitsPerValue) - 1);


        int upto = 0;
        int next = 0;
        int totalCompressedLine = 2 * bitsPerValue;
        int bitsRemaining = 64;
        LongVector inputVector = LongVector.fromArray(LANE4_SPECIES, input, next);
        next += 4;

        if (totalCompressedLine % 4 != 0) {
            input[totalCompressedLine] = ((input[totalCompressedLine - 2] & LOW) << 32);
            input[totalCompressedLine + 1] = ((input[totalCompressedLine - 1] & LOW) << 32);
            input[totalCompressedLine -2] &= HIGH;
            input[totalCompressedLine - 1] &= HIGH;
        }

        while (upto < 128) {
            if (bitsRemaining >= bitsPerValue) {
                LongVector res = inputVector.and(MASK << (bitsRemaining - bitsPerValue))
                        .lanewise(VectorOperators.LSHR, bitsRemaining - bitsPerValue);
                bitsRemaining -= bitsPerValue;
                res.intoArray(output, upto);
                upto += 4;
            } else {
                int bitDiff = bitsPerValue - bitsRemaining;
                LongVector res = inputVector.and(MASK >> (bitsPerValue - bitsRemaining))
                        .lanewise(VectorOperators.LSHL, bitDiff);

                inputVector = LongVector.fromArray(LANE4_SPECIES, input, next);
                next += 4;
                var temp = inputVector.and((MASK >> bitsRemaining) << (64 - bitDiff) );
                res = res.or(temp.lanewise(VectorOperators.LSHR, 64 - bitDiff));
                res.intoArray(output, upto);
                upto += 4;
                bitsRemaining -= bitsPerValue;
                bitsRemaining += 64;
            }

            if (bitsRemaining  == 0) {
                inputVector = LongVector.fromArray(LANE4_SPECIES, input, next);
                next += 4;
                bitsRemaining = 64;
            }
        }

    }

I will proceed with testing the scalar version and decoding. Once everything is prepared, I will submit a pull request this week

@uschindler
Copy link
Contributor

Hi, I just noticed that the forUtil.gradle had a bug in Lucene 9. It does not regenerate anything because the actual python script is never executed.

I will povide a separate fix in a minute. Interestingly the script does not execute correctly:

> Task :lucene:core:generateForUtilInternal FAILED
Caching disabled for task ':lucene:core:generateForUtilInternal' because:
  Build cache is disabled
Task ':lucene:core:generateForUtilInternal' is not up-to-date because:
  Task has failed previously.
External tool 'python3' resolved to: C:/cygwin/bin/python3.6m.exe
Starting process 'command 'C:/cygwin/bin/python3.6m.exe''. Working directory: C:\Users\Uwe Schindler\Projects\lucene\lucene\lucene\core\src\java\org\apache\lucene\codecs\lucene90 Command: C:/cygwin/bin/python3.6m.exe C:\Users\Uwe Schindler\Projects\lucene\lucene\lucene\core\src\java\org\apache\lucene\codecs\lucene90\gen_ForUtil.py
Successfully started process 'command 'C:/cygwin/bin/python3.6m.exe''
Traceback (most recent call last):
  File "C:\Users\Uwe Schindler\Projects\lucene\lucene\lucene\core\src\java\org\apache\lucene\codecs\lucene90\gen_ForUtil.py", line 438, in <module>
    writeDecode(i, f)
  File "C:\Users\Uwe Schindler\Projects\lucene\lucene\lucene\core\src\java\org\apache\lucene\codecs\lucene90\gen_ForUtil.py", line 362, in writeDecode
    writeRemainder(bpv, next_primitive, shift + bpv, o, 128/num_values_per_long - o, f)
  File "C:\Users\Uwe Schindler\Projects\lucene\lucene\lucene\core\src\java\org\apache\lucene\codecs\lucene90\gen_ForUtil.py", line 320, in writeRemainder
    for i in range(num_values):
TypeError: 'float' object cannot be interpreted as an integer

Thats not good!

@uschindler
Copy link
Contributor

uschindler commented Jul 3, 2023

Here is the bugfix for the generator: #12411

This bug caused a lot trouble here because I did not understand why the script wasn't executed at all.

@uschindler
Copy link
Contributor

Hi,
the bug in the generator is fixed. I was able to include the ForUtil generator into a local branch and made everything working.

What I am wondering about: The generated PanamaForUtil is immensely huge (258 KiB) and it looks only encoding is vectorized, while decoding looks like an very old version of the code, it explodes somehow - where did you get that code from? The vectorized encode seems to work, the indexes and tests are working.

I also noticed a small problem with the new framework regarding package private caller classes. I will think about a good solution.

Anywys, I will make the branch available as Draft PR so you can see how it was done. It was a bit more work because ForUtil was used at many places and each call to PForUtil needs a ForUtil instance. I refactored this and was able to make it work.

@tang-hi
Copy link
Contributor Author

tang-hi commented Jul 3, 2023

What I am wondering about: The generated PanamaForUtil is immensely huge (258 KiB) and it looks only encoding is vectorized, while decoding looks like an very old version of the code, it explodes somehow - where did you get that code from? The vectorized encode seems to work, the indexes and tests are working.

I have currently only provided a vectorized version for encoding, while keeping decoding unchanged. This is because I wanted to ensure that the encoding part is working before implementing a compressed version for decoding. However, based on @jpountz's response, I am now trying to use a more concise version (with a different compression format but the same space size). I will update the Python script with this version later.

uschindler added a commit to uschindler/lucene that referenced this issue Jul 3, 2023
…9.0 codec (only encode!). It should show as example how it works, this is not ready for productions, although all tests pass
@uschindler
Copy link
Contributor

Here is the example what I did with your code: #12412

It should demonstrate how I inlcuded it. Most interesting is the cleanup inside the lucene90 codec. I also defined the interface very quicky, but actually I did not check that all is used there. I defined some of the constants in the interface (BLOCK_SIZE), too.

@tang-hi
Copy link
Contributor Author

tang-hi commented Jul 3, 2023

Thank you so much! I will take a look at this PR tomorrow. It's getting late for me now, so I need to go to sleep. 😊

@ChrisHegarty
Copy link
Contributor

ChrisHegarty commented Jul 4, 2023

I would like to pause a little, double check where we're going, and reset if needed. It's become clear to me that we're not quite aligned.

The main issue I see is with the interface. All the code I've seen so far is using long[], and then consequently LongVector in the Panama implementation. After chatting a little with @jpountz offline, and re-reading previous comments, it's clear that this is not what we want.

What we want is int[] and IntVector. Using longs in the new implementation is wasteful, both in terms of memory, but also (and more importantly) vector lanes. My initial experiments with ints are showing even bigger speed ups, approx 4x (but I've not implemented it all yet).

If we're in agreement with the above, then I'd like to propose that we go back to the basics - implement all the various bitsPerValue variants for Int128Vector species. This will effectively be a port of the non-masking pack/unpack from https://github.com/lemire/simdcomp/blob/master/src/simdbitpacking.c. We can then compare these to the existing ForUtil implementations and tweak the actual code as necessary. I've started this in this repo https://github.com/ChrisHegarty/bitpacking (a pure port for now, we can analyse the benchmarks and tweak the code once complete )

Clearly, there is a lot more to do: a scalar implementation will be required of the new format (and with int[]), we'll need to benchmark it too; the calling code including, PForUtil, etc, will need to be updated to use int[]; provide 256 and 512 species variants, consider what (if any) source generation makes sense, etc. But that should be relatively straightforward, let's just put that aside until we get the full 1-32 encode/decode all implemented for int[] and Int128Vector first. (which I'll continue to do in bitpacking).

Have I missed something? What do you think?

@uschindler
Copy link
Contributor

uschindler commented Jul 4, 2023

I agree. There are more complications: DataInput does not have a read method for int[], only one for float[] and long[]. So changing this is a bigger task. I tend to think that we should take some time to try and possibly also delay that till Lucene 10.0.

This was the reason why I said: let's keep index Format as is and reimplement current format with vectors, but also invent a new one with new methods in IO layer based on ints for later.

@tang-hi
Copy link
Contributor Author

tang-hi commented Jul 5, 2023

Switching from LongVector to IntVector is feasible, especially for 128-bit, as there are only a few modifications needed. Special handling may be required for 256-bit and 512-bit, but it shouldn't be a big issue. Additionally, when the output is also int[], the handling will be simpler. For example, if it were long[] and bpv was 3, we would need to store it as six longs, which cannot be evenly divided by four. However, with int[], this problem doesn't exist.

Currently, I am more concerned about the scalar code. Yesterday, I tested it and found that the original scalar decode code was amazingly fast, almost on par with the SIMD speed at 50 op/us. On the other hand, the code using the new compression format only achieved around 10+ op/us. I would like to know if we can tolerate some performance loss in the scalar code if we switch the compression format. What is the minimum performance threshold we can accept?

@jpountz
Copy link
Contributor

jpountz commented Jul 5, 2023

I would like to know if we can tolerate some performance loss in the scalar code if we switch the compression format. What is the minimum performance threshold we can accept?

To me the numbers that matter are more the numbers for end-to-end search latency rather than pure decoding speed, e.g. the same benchmark as @mikemccand 's nightlies. I'd be good with this change if there is a noticeable speedup with the new vector code, and the scalar code only degrades performance by a few percents. (It would be very easy to test this by plugging this new ForUtil into a new codec, I can help with this if needed).

@tang-hi tang-hi linked a pull request Jul 5, 2023 that will close this issue
@tang-hi
Copy link
Contributor Author

tang-hi commented Jul 5, 2023

I have attempted to implement an Int version of the scalar and vectorized forutil. I have submitted a draft PR as a simple starting point for those interested in this issue. Even if it is not ultimately merged, it can still be useful.

Here is my current progress:

  1. Implemented the decode and encode functions for vectorization.
  2. Implemented the decode and encode functions for scalars, but they are not well implemented.
  3. Changed the function interface to int[].

However, I am not sure why many of the tests are failing, even though the tests for Pforutil and forutil are passing. I will take a closer look at the specific reasons when I have time. Could changing certain members from long to int affect correctness? I am not completely familiar with the index structure of Lucene, so please let me know if you have any discoveries. It is also possible that there are issues with my code implementation.
#12417

@tang-hi
Copy link
Contributor Author

tang-hi commented Jul 5, 2023

When using the int type, there is a significant performance improvement compared to the long type, approximately 2-3 times. You can refer to link for more details. The performance of decoding and encoding is similar, except that the original scalar decoding code has a speed of around 50op/us, while the vectorized version has a performance improvement of over twice as much.

@uschindler
Copy link
Contributor

uschindler commented Jul 5, 2023

I agree. There are more complications: DataInput does not have a read method for int[], only one for float[] and long[]. So changing this is a bigger task.

I just noticed: We have readInts() already in DataInput and it is also implemeted in MemorySegmentIndexInput.

@uschindler
Copy link
Contributor

However, I am not sure why many of the tests are failing, even though the tests for Pforutil and forutil are passing. I will take a closer look at the specific reasons when I have time. Could changing certain members from long to int affect correctness?

This could be because of different memory size. Lucene calculates the memory usage of the codec's on heap data. Maybe that now a bit wrong as you have not adapted ramBytesUsed() methods in some of the classes.

Without knowing what tests failed (I will try later) I can't give a good guess. I can check out later.

@uschindler uschindler linked a pull request Jul 5, 2023 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants