Make ForUtil Vectorized #12396

tang-hi · 2023-06-25T16:50:25Z

Description

Since the introduction of Vector API into Lucene via #12311, I have found it to be an interesting tool. As a result, I have attempted to use it to rewrite the ForUtil.java file. Initially, I attempted to port simdcomp and implement pack and unpack for bitPerValue values that were less than or equal to 8. The results of my efforts can be seen below.

Benchmark	Mode	Cnt	Score	Error	Units
VectorizeBenchmark.decode1	thrpt	15	58293945.447	± 1524549.587	ops/s
VectorizeBenchmark.decode2	thrpt	15	55598229.538	± 4516920.135	ops/s
VectorizeBenchmark.decode3	thrpt	15	57163871.965	± 1566246.490	ops/s
VectorizeBenchmark.decode4	thrpt	15	55128874.528	± 4752397.170	ops/s
VectorizeBenchmark.decode5	thrpt	15	53822335.729	± 4599217.489	ops/s
VectorizeBenchmark.decode6	thrpt	15	48155246.120	± 7519360.551	ops/s
VectorizeBenchmark.decode7	thrpt	15	50253799.192	± 820075.648	ops/s
VectorizeBenchmark.decode8	thrpt	15	68849728.856	± 1818468.973	ops/s
VectorizeBenchmark.encode1	thrpt	15	33998510.772	± 2924618.992	ops/s
VectorizeBenchmark.encode2	thrpt	15	43238190.552	± 810373.966	ops/s
VectorizeBenchmark.encode3	thrpt	15	36613553.485	± 483115.838	ops/s
VectorizeBenchmark.encode4	thrpt	15	45675726.831	± 1081153.655	ops/s
VectorizeBenchmark.encode5	thrpt	15	33591855.278	± 1084009.112	ops/s
VectorizeBenchmark.encode6	thrpt	15	36110726.127	± 767075.709	ops/s
VectorizeBenchmark.encode7	thrpt	15	34754339.379	± 275025.123	ops/s
VectorizeBenchmark.encode8	thrpt	15	55075742.358	± 991165.320	ops/s
VectorizeBenchmark.vectorizedDecode1	thrpt	15	43878020.796	± 7148545.623	ops/s
VectorizeBenchmark.vectorizedDecode2	thrpt	15	103091446.773	± 44115190.011	ops/s
VectorizeBenchmark.vectorizedDecode3	thrpt	15	83168059.373	± 24930903.852	ops/s
VectorizeBenchmark.vectorizedDecode4	thrpt	15	63156089.355	± 15039408.293	ops/s
VectorizeBenchmark.vectorizedDecode5	thrpt	15	96567546.695	± 37142784.493	ops/s
VectorizeBenchmark.vectorizedDecode6	thrpt	15	73897063.180	± 11549757.437	ops/s
VectorizeBenchmark.vectorizedDecode7	thrpt	15	79716185.567	± 29990852.039	ops/s
VectorizeBenchmark.vectorizedDecode8	thrpt	15	92621676.617	± 29702056.667	ops/s
VectorizeBenchmark.vectorizedEncode1	thrpt	15	51140300.852	± 139758.385	ops/s
VectorizeBenchmark.vectorizedEncode2	thrpt	15	82646100.574	± 1289600.954	ops/s
VectorizeBenchmark.vectorizedEncode3	thrpt	15	88124485.953	± 742170.198	ops/s
VectorizeBenchmark.vectorizedEncode4	thrpt	15	91029285.467	± 5594858.437	ops/s
VectorizeBenchmark.vectorizedEncode5	thrpt	15	96843051.648	± 8024430.836	ops/s
VectorizeBenchmark.vectorizedEncode6	thrpt	15	98596724.128	± 10068466.227	ops/s
VectorizeBenchmark.vectorizedEncode7	thrpt	15	85885746.715	± 6031740.563	ops/s
VectorizeBenchmark.vectorizedEncode8	thrpt	15	117139889.194	± 8721517.095	ops/s

However, I noticed that the compression format used in ForUtil.java was different. It employed some tricks to speed up the process, such as simd. Therefore, I attempted to vectorize it while maintaining the compression format. The results can be seen below.

Benchmark	Mode	Cnt	Score	Error	Units
Benchmark.encode1	thrpt	15	38017254.040	± 3905466.628	ops/s
Benchmark.encode2	thrpt	15	45170109.395	± 1539203.478	ops/s
Benchmark.encode3	thrpt	15	38757256.653	± 1044709.221	ops/s
Benchmark.encode4	thrpt	15	49307206.891	± 799168.007	ops/s
Benchmark.encode5	thrpt	15	35130626.548	± 792210.817	ops/s
Benchmark.encode6	thrpt	15	38326892.073	± 981865.963	ops/s
Benchmark.encode7	thrpt	15	37372342.721	± 1177478.683	ops/s
Benchmark.encode8	thrpt	15	60757390.416	± 458876.638	ops/s
Benchmark.vectorizedEncode1	thrpt	15	56413094.655	± 435917.201	ops/s
Benchmark.vectorizedEncode2	thrpt	15	88770400.646	± 11183716.176	ops/s
Benchmark.vectorizedEncode3	thrpt	15	39932842.378	± 2366921.921	ops/s
Benchmark.vectorizedEncode4	thrpt	15	85888128.739	± 5499354.172	ops/s
Benchmark.vectorizedEncode5	thrpt	15	34402027.732	± 1414839.159	ops/s
Benchmark.vectorizedEncode6	thrpt	15	35794303.501	± 782940.005	ops/s
Benchmark.vectorizedEncode7	thrpt	15	33845690.180	± 2586648.353	ops/s
Benchmark.vectorizedEncode8	thrpt	15	97914288.675	± 8971857.035	ops/s

I have only implemented bitPerValue values of 1, 2, 4, and 8. I am curious if it is possible to change the compression format. Additionally, do you have any best practices for integrating vectorized code into Lucene? Any suggestions would be appreciated.

Currently, I am working on my own repo. However, the code is still in a rough state and lacks documentation.

msokolov · 2023-06-25T18:37:10Z

I think it would be helpful if you could present the results with fewer "significant" digits; as it is it's hard to interpret. And if you are considering changing index format for higher speed, you should also indicate the relative size change. I expect the SIMD-optimized version you are testing doesn't do the same kind of compression that Lucene does? And this will have a significant impact in a whole system due to increased disk storage, memory usage, and paging.

tang-hi · 2023-06-26T03:29:05Z

I think it would be helpful if you could present the results with fewer "significant" digits; as it is it's hard to interpret. And if you are considering changing index format for higher speed, you should also indicate the relative size change. I expect the SIMD-optimized version you are testing doesn't do the same kind of compression that Lucene does? And this will have a significant impact in a whole system due to increased disk storage, memory usage, and paging.

As of now, I am striving to maintain the compression format that was previously established, and once I'm done, I'll present the results in a more clear and easy-to-understand manner.

uschindler · 2023-06-26T06:51:33Z

Hi this is a good idea, ForUtil is next to PackedInts a good option for SIMD.

If you have something implemented, open a PR, we will think of the best way ho to include it what is currently there. Ideally all should be implemented in VectorUtilProvider, but we may rename this class to be more generic. I would really keep only one provider interface to make it easier pluggable.

uschindler · 2023-06-26T07:05:07Z

In my opinion, the best would be to proceed here without integrating in Lucene and I can have a look at the code. The provider interfaces in Lucene are still package private, so theres poissibility to do some changes. As said before, for all vector operations there should only be one provider class with basically two implementations in Lucene (VectorUtilDefaultProvider and VectorUtilPanamaProvider, one for each java version).

Actually if the code is nicely isolated from the rest of Lucene's code it should not be too complicated to include it. I will take care of that.

jpountz · 2023-06-28T13:54:53Z

Thanks for looking into this! For reference, I've been separately looking into whether we could vectorize prefix sums, which is one bottleneck of postings decoding today as we managed to get auto-vectorization to help with bit unpacking, but not with computing a prefix sum of doc IDs: https://github.com/jpountz/vectorized-prefix-sum. Unfortunately, Java's vector API doesn't offer a way to perform a shift across lanes efficiently (e.g. IntVector#unslice doesn't compile to vpslldq with SPECIES_128) while it's the traditional way how prefix sum is vectorized. I've looked into other approaches, but they're all slower than a scalar prefix sum. @ChrisHegarty looked into it and had a few interesting observations, in particular vectorized prefix sums suddenly become several times faster when bound checks are disabled: https://github.com/jpountz/vectorized-prefix-sum/issues, so there might be room for improvement in the JDK that would allow us to eventually take advantage of vectorization for prefix sums?

rmuir · 2023-06-28T14:39:49Z

crazy question: do we really need vectorized prefix sum for the postings list? could we just decode the deltas, and lazily defer computation of accumulated docid sum until its needed, e.g. in the PostingsEnum's next() method?

I guess i'm just unsure why we need to delta-decode the entire int[] up-front, when the access is all DAAT. Maybe its a stupid idea, but its something we could try out independently of any vectorization.

tang-hi · 2023-06-28T16:57:00Z

I have successfully implemented all encode methods in forutil while keeping the compression format unchanged. Here are the results.

Benchmark	Mode	Cnt	Score (ops/s)	Error (ops/s)
Encode1	thrpt	15	30.81M	7.76M
vectorizedEncode1	thrpt	15	52.11M	9.15M
Encode2	thrpt	15	44.97M	972.53K
vectorizedEncode2	thrpt	15	84.76M	11.56M
Encode3	thrpt	15	39.16M	389.41K
vectorizedEncode3	thrpt	15	49.50M	319.99K
Encode4	thrpt	15	49.35M	895.58K
vectorizedEncode4	thrpt	15	84.05M	1.10M
Encode5	thrpt	15	35.39M	1.76M
vectorizedEncode5	thrpt	15	39.75M	2.49M
Encode6	thrpt	15	38.26M	1.16M
vectorizedEncode6	thrpt	15	52.07M	2.29M
Encode7	thrpt	15	37.20M	556.95K
vectorizedEncode7	thrpt	15	48.99M	1.34M
Encode8	thrpt	15	59.05M	1.74M
vectorizedEncode8	thrpt	15	95.48M	7.30M
Encode9	thrpt	15	24.64M	135.16K
vectorizedEncode9	thrpt	15	27.45M	1.29M
Encode10	thrpt	15	24.94M	358.51K
vectorizedEncode10	thrpt	15	27.22M	2.40M
Encode11	thrpt	15	25.11M	819.30K
vectorizedEncode11	thrpt	15	23.89M	1.28M
Encode12	thrpt	15	29.45M	490.67K
vectorizedEncode12	thrpt	15	41.90M	2.30M
Encode13	thrpt	15	24.38M	226.59K
vectorizedEncode13	thrpt	15	24.83M	1.39M
Encode14	thrpt	15	27.44M	282.45K
vectorizedEncode14	thrpt	15	36.28M	3.62M
Encode15	thrpt	15	27.64M	504.77K
vectorizedEncode15	thrpt	15	39.15M	2.88M
Encode16	thrpt	15	55.31M	4.87M
vectorizedEncode16	thrpt	15	84.89M	7.14M
Encode17	thrpt	15	16.27M	152.66K
vectorizedEncode17	thrpt	15	15.04M	740.80K
Encode18	thrpt	15	15.45M	69.44K
vectorizedEncode18	thrpt	15	15.49M	654.85K
Encode19	thrpt	15	12.67M	119.34K
vectorizedEncode19	thrpt	15	13.17M	539.82K
Encode20	thrpt	15	15.90M	184.34K
vectorizedEncode20	thrpt	15	16.03M	1.29M
Encode21	thrpt	15	15.95M	87.33K
vectorizedEncode21	thrpt	15	15.01M	274.69K
Encode22	thrpt	15	16.04M	127.17K
vectorizedEncode22	thrpt	15	15.34M	404.39K
Encode23	thrpt	15	15.62M	180.99K
vectorizedEncode23	thrpt	15	15.11M	615.33K
Encode24	thrpt	15	17.79M	673.54K
vectorizedEncode24	thrpt	15	22.53M	2.13M
Encode25	thrpt	15	15.42M	203.42K
vectorizedEncode25	thrpt	15	13.45M	1.34M
Encode26	thrpt	15	15.20M	187.19K
vectorizedEncode26	thrpt	15	14.91M	402.92K
Encode27	thrpt	15	14.98M	81.71K
vectorizedEncode27	thrpt	15	14.71M	740.17K
Encode28	thrpt	15	17.84M	166.66K
vectorizedEncode28	thrpt	15	20.75M	836.14K
Encode29	thrpt	15	14.66M	177.74K
vectorizedEncode29	thrpt	15	13.89M	392.35
Encode30	thrpt	15	17.43M	462.43K
vectorizedEncode30	thrpt	15	23.42M	2.45M
Encode31	thrpt	15	18.49M	280.37K
vectorizedEncode31	thrpt	15	26.03M	1.40M
Encode32	thrpt	15	50.39M	1.99M
vectorizedEncode32	thrpt	15	38.64M	51.39M

While vectorized code can be advantageous, there are a few drawbacks to consider. Firstly, if the vector bit size is 128bit, it may be slower than scalar code due to we compress the long value(64bit). To address this, it may be worthwhile to check if the user's host supports 256bit vector. If not, scalar code can be used for compression. Secondly, the code can become quite large due to the need for specific code for each bitPerValue. As a result, it can be comparable to writing assembly code, with approximately 6000+ lines of code required.

I seek counsel from others regarding these two drawbacks. If they are deemed acceptable, I intend to submit a draft PR this weekend.

uschindler · 2023-06-28T17:25:19Z

Hi,
if you look at the first line of ForUtil.java, you will see the following comment:

// This file has been automatically generated, DO NOT EDIT

Actually the class is generated by a Python Script and placed in several folders (also for backwards codecs): https://github.com/apache/lucene/blob/main/gradle/generation/forUtil.gradle

So there should be similar way to regenerate the vectorized variants (also possibly for multiple Java versions like 20, 21, 22 if those would differ). Of course this is a bit hard to implement this autogen setup with the current multi-release JAR setup, so there's a bit more work needed. I am happy to help or take over here, but give me a few days to think about it. I don't want to rush anything.

Also Backwards-codecs JAR file has no support for Multi-Release and the whole panama lookup won't work there, but I would suggest:

For the backwards compatibility codec we keep the code as is.
Or move ForUtil to Lucene core's util package and make it public.

Both impls look same, just in different packages, so I tend to move them to Lucene Core and include in VectorUtil. But maybe tehres a difference with byte order. I have some ideas, but we can hide it from public code using the provider interfaces we already have to lookup the instances.

If the implementations in 9.0 and the one for 8.4 backwards really differ, I would only implement a generator for the 9.0 version.

uschindler · 2023-06-28T17:44:31Z

Actually if you could write some python3 like gen_vectorized20_ForUtil.py (for Java 20) script that generates your VectorizedForUtil, I could check how to include that with the current build system.

uschindler · 2023-06-28T17:49:56Z

You can take the current python script as "basis" and work from there.

tang-hi · 2023-06-28T17:54:01Z

You can take the current python script as "basis" and work from there.
Great, I will give it a try! 😄

uschindler · 2023-06-28T18:00:34Z

To implement more classes like this with vectorization, I will try to add some additional abstraction layer to (move current abstraction to VectorProvider) and allow that one to return instances of several interfaces (VectorUtilProvider, ForUtilProvider, XxxxProvider).

I still need to find a way around the package-private ForUtil to stay at its current place in the codecs. But Tthere are solutions, I have to think a bit more. I will also check if the 8.4 and the 9.0 version differ at all.

uschindler · 2023-06-28T18:02:43Z

The above would be a separate PR to cleanup the adhoc internal implementation of the Panama Integration a bit. The implementation devloped here could then be added to the new Lucene framework in a separate PR.

rmuir · 2023-06-28T19:08:21Z

While vectorized code can be advantageous, there are a few drawbacks to consider. Firstly, if the vector bit size is 128bit, it may be slower than scalar code due to we compress the long value(64bit).

But this is just a current hack from my understanding. A hack to try to provoke autovectorization. the things we are decoding are really 32-bit integers.

jpountz · 2023-06-28T19:14:14Z

Robert is right, let's not try to keep the current format, it's a hack. If you can make incompatible changes that perform better, we can work out the backward compatibility layer afterwards.

ChrisHegarty · 2023-07-01T10:57:29Z

If I'm not mistaken, the specific interface required here to abstract out ForUtil would look something like this?

interface ForUtilXXX {
  /** Number of bytes required to encode 128 integers of {@code bitsPerValue} bits per value. */
  int numBytes(int bitsPerValue);

  /** Encodes integers from {@code values} into {@code out}. */
  void encode(int[] values, int bitsPerValue, DataOutput out) throws IOException;

  /** Decodes integers into {@code values}. */
  void decode(int bitsPerValue, DataInput in, int[] values) throws IOException;
}

This interface does not need to be shared outside of its package - both the default and Panama implementations can reside in the same package. Maybe we need this interface for different codec versions? If so, then it could be copied, otherwise we need to find a "sharable" package.

We current have a TestSecrets, that supports cross-package access, but even a more general non-test specific o.a.l.internal.SharedSecrets is not sufficient here, since we need to share both the Vector lookup code, and (maybe?) the interface, ForUtilXXX. The most straightforward, but less defensive, way I see is to double down on org.apache.lucene.internal as location to house such things that are Lucene internal not to be used by external developers. While not a very novel approach, it is well understood, commonly used, and adds the least amount of friction to the code.

uschindler · 2023-07-01T11:06:03Z

Hey, please for not let me do the integration. I have a plan already because with the current version it wont work. I have a half baked version ready, I am just waiting for input. I tend to split the classes a bit. When you implemented VectorUtilProvider it was not really a provider it was the implementation.

My idea is: Have VectorizationProvider with the current code that just have a getter for VectorUtilImpl getVectorUtilImpl() and more for the other classes. The provider interface could also be used to get instances of classes with state (like a full FOR decoder).

If we want to have several impls we can do VectorizationProvider.getLucene96ForUtil() returning a version for this codec version. But in reality I would rmeove the vectorized ones in backwards codecs.

At moment I am thinking of having all that code in a subpackage of util: org.apache.lucene.util.vectorization (you had something similar at beginning). We can hide that package in module-info, but classpath appls would see it so the interfaces need javadocs.

We have no secrets at moment, so the implementation can only be inside the utils package. if there are different versions per codec we would only implement the latest version with vectorization. The older codecs won't get vectorized.

uschindler · 2023-07-01T11:10:15Z

One we have some code here ready, I will submit a PR with the refactoring as preparation.

tang-hi · 2023-07-01T12:14:17Z

One we have some code here ready, I will submit a PR with the refactoring as preparation.

I've had some things to take care of in the past few days 😅, so I haven't made much progress. I'll be providing specific implementation in the next few days.

ChrisHegarty · 2023-07-01T12:39:34Z

@uschindler Ok, thanks. Sounds like you have this under control. And what you say makes sense.

Did the small sketch of the ForUtil interface I provided above make sense? Is that the level where we’re thinking that this will be abstracted? This is key to understanding how much flexibility we have in the underlying implementations.

uschindler · 2023-07-01T12:42:14Z

I like the idea to put everything under internal, we have that apckage already.

But I would keep the implementations all under one package and let codes use them (specific impl per codec is ok, just different class names/getters in the provider interface).

Let me provide the general setup later this weekend or begin of week.

ChrisHegarty · 2023-07-01T12:57:07Z

Thanks @uschindler, no great urgency on this part. As you have said earlier, work on the actual vector implementation can proceed independently of the general infrastructure - we can just copy the bits that we need for now, then refactor them out later.

ChrisHegarty · 2023-07-01T12:58:52Z

@tang-hi thanks for the update. I have some time later this weekend, I’ll see if I can get this started, and take a look at your code.

uschindler · 2023-07-01T13:02:23Z

I had some ideas in mind that the general lookup part just makes sure theres is a paname impl available and then delegates to the panama provider to decide which implementation to use. I plan to open an issue on JDK for 22 to have some way to detect which features are avail (like FMA). So we can also let the panama provider decide based on (future) features to return implementations based (like a full FMA impl of our current vector util) based on what the JDK reports back.

We should really start and work on asking the Panama OpenJDK community to have some information about what features are available. The current state of not knowing anything is far from useful.

tang-hi · 2023-07-02T07:58:28Z

@uschindler
Here is the draft version of gen_vectorized20_ForUtil.py. I plan to enhance the code quality and add some comments in the near future.

uschindler · 2023-07-02T08:00:05Z

Great! I am in weekend mode, but I have a rewrite of the provider lookup almost ready.

I will check how to include it.

tang-hi · 2023-07-02T14:04:37Z

@uschindler Here is the draft version of gen_vectorized20_ForUtil.py. I plan to enhance the code quality and add some comments in the near future.

I've updated this Python file, added some comments, and made it more comprehensible. Feel free to raise any questions if you have. 😊

uschindler · 2023-07-02T17:38:57Z

Here is my PR about the general setup to easily plug in different interfaces to support vectorization at other places (not only VectorUtil). This uses a non-exported package and extra checks to prevent misuse from downstream code: #12410

Before working on this I like this to be merged first.

uschindler · 2023-07-03T09:05:17Z

P.S.: I checked: ForUtil in 9.x is different than in 8.x because the byte order and some other slight changes. So I would only vectorize the core one (9.x) and leave out the backwards compatibility impl.

uschindler · 2023-07-03T09:27:42Z

While vectorized code can be advantageous, there are a few drawbacks to consider. Firstly, if the vector bit size is 128bit, it may be slower than scalar code due to we compress the long value(64bit).

But this is just a current hack from my understanding. A hack to try to provoke autovectorization. the things we are decoding are really 32-bit integers.

I agree with that. The problem is that we need the panama-vectorized implementation and the scalar one behave identically, otherwise indexes would not be compatible.

In future, when we vectorized everything hopefully with the answer to all questions, Java 42, we can change the format :-)

jpountz · 2023-07-03T09:32:27Z

What about changing the format now, as part of this change? This would mean that users who do not enable the Parama vector API would get a bit worse performance compared to today, but users who care about performance would most likely be able to do that?

tang-hi · 2023-07-03T09:38:29Z

What about changing the format now, as part of this change? This would mean that users who do not enable the Parama vector API would get a bit worse performance compared to today, but users who care about performance would most likely be able to do that?

Perhaps I could attempt to implement both scalar and vectorized versions of different compression formats in order to assess the extent of performance degradation in the scalar version and performance enhancement in the vectorized version. We could then make a decision based on the results of the experiment.

uschindler · 2023-07-03T09:51:38Z

The framework to plug in in custom formats into org.apache.lucene.internal.vectorization was merged, so stage is open to implement something.

Hints:

Both the scalar and the vectorized implementation need to be moved to that package and share a common interface.
A getter needs to be added to VectorizationProvider to return a singleton (if stateless) or an instance (if stateful like ForUtil, like newForUtil98() for the Lucene 9.8 codec).
The calling class that gets the provider needs to be added for caller sensitive checks to prevent misuse.
A test comparing both impementations with random data extending from BaseVectorizationTestCase needs to be added. See TestVectorUtilSupport for an example.

I can mock it up for ForUtil (looks simple, just a bit copy-paste and defining interface) and replacing new ForUtil() by defining a constant somewhere in the codec like private static final VectorizationProvider PROVIDER = VectorizationProvider.getInstance() and then calling PROVIDER.newForUtil98().

ChrisHegarty · 2023-07-03T10:10:39Z

++ Let's use the format that vectorizes the best. And yes, we will need a scalar implementation of that too, but to get started let's assume that the format changeable.

uschindler · 2023-07-03T10:12:21Z

What about changing the format now, as part of this change? This would mean that users who do not enable the Parama vector API would get a bit worse performance compared to today, but users who care about performance would most likely be able to do that?

We could start and go back to the original Lucene 5.0 format. Plain simple... It is still in backwards-codecs.

uschindler · 2023-07-03T10:13:57Z

I am also not sure if we really need to unroll all that loops and have different impls for each size.

uschindler · 2023-07-03T10:16:48Z

One small suggestion: Please stay with classical switch statement, as backporting the python code and java code to Java 11 is getting harder (please remember: also the vectorized code is compiled with Java 11 compiler in 9.x branch!!!). I just noticed that in @tang-hi's code.

tang-hi · 2023-07-03T10:26:54Z

Thank you for your suggestion! I will take note of it.

tang-hi · 2023-07-03T12:58:01Z

I believe that the Lucene 5.0 format may not be appropriate due to its rounding up of bitsPerValue. For example, it uses 8 bits to compress a 3-bit value, resulting in larger index files. However, I have already implemented a vectorized version of differrent compression formats that maintain the same size. The outcome appears promising

Benchmark	Type	Iterations	Mean ops/us	Error
Encode1	thrpt	15	37.023	5.060
VectorizedEncode1	thrpt	15	53.261	0.850
Encode2	thrpt	15	45.017	0.887
VectorizedEncode2	thrpt	15	56.111	2.234
Encode3	thrpt	15	38.750	0.521
VectorizedEncode3	thrpt	15	49.589	2.518
Encode4	thrpt	15	48.074	1.390
VectorizedEncode4	thrpt	15	55.654	2.580
Encode5	thrpt	15	35.517	0.815
VectorizedEncode5	thrpt	15	50.508	0.887
Encode6	thrpt	15	38.698	0.301
VectorizedEncode6	thrpt	15	48.876	0.926
Encode7	thrpt	15	36.697	0.942
VectorizedEncode7	thrpt	15	50.016	2.425
Encode8	thrpt	15	59.180	2.208
VectorizedEncode8	thrpt	15	54.895	0.497
Encode9	thrpt	15	24.215	0.670
VectorizedEncode9	thrpt	15	49.292	0.616
Encode10	thrpt	15	25.509	0.274
VectorizedEncode10	thrpt	15	46.777	0.762
Encode11	thrpt	15	25.165	0.635
VectorizedEncode11	thrpt	15	46.798	2.554
Encode12	thrpt	15	29.170	0.671
VectorizedEncode12	thrpt	15	47.331	0.994
Encode13	thrpt	15	23.749	1.126
VectorizedEncode13	thrpt	15	46.587	2.468
Encode14	thrpt	15	27.283	0.235
VectorizedEncode14	thrpt	15	44.704	0.805
Encode15	thrpt	15	27.459	1.035
VectorizedEncode15	thrpt	15	45.335	3.178
Encode16	thrpt	15	58.192	0.557
VectorizedEncode16	thrpt	15	52.698	0.918
Encode17	thrpt	15	16.265	0.168
VectorizedEncode17	thrpt	15	45.757	2.126
Encode18	thrpt	15	15.261	0.167
VectorizedEncode18	thrpt	15	44.386	0.807
Encode19	thrpt	15	12.531	0.138
VectorizedEncode19	thrpt	15	45.403	0.854
Encode20	thrpt	15	15.863	0.351
VectorizedEncode20	thrpt	15	42.607	3.698
Encode21	thrpt	15	15.772	0.154
VectorizedEncode21	thrpt	15	45.122	0.777
Encode22	thrpt	15	15.863	0.210
VectorizedEncode22	thrpt	15	42.802	1.240
Encode23	thrpt	15	15.638	0.095
VectorizedEncode23	thrpt	15	44.411	0.536
Encode24	thrpt	15	17.091	0.713
VectorizedEncode24	thrpt	15	42.151	2.151
Encode25	thrpt	15	15.206	0.163
VectorizedEncode25	thrpt	15	43.440	2.078
Encode26	thrpt	15	15.110	0.188
VectorizedEncode26	thrpt	15	40.758	0.416
Encode27	thrpt	15	14.794	0.192
VectorizedEncode27	thrpt	15	43.261	0.494
Encode28	thrpt	15	17.531	0.393
VectorizedEncode28	thrpt	15	41.578	0.838
Encode29	thrpt	15	14.423	0.173
VectorizedEncode29	thrpt	15	36.044	10.191
Encode30	thrpt	15	17.426	0.297
VectorizedEncode30	thrpt	15	40.087	0.791
Encode31	thrpt	15	18.489	0.180
VectorizedEncode31	thrpt	15	42.166	0.625
Encode32	thrpt	15	47.742	4.446
VectorizedEncode32	thrpt	15	54.260	1.806

the code is straightforward, as shown below

public void encode(long[] values, int bitsPerValue, long[] output) {
        int MASK = (int) ((1L << bitsPerValue) - 1);


        int bitsRemaining = 64;
        int upto = 0;
        int totalCompressedLine = 2 * bitsPerValue;
        int next = 0;

        LongVector input = LongVector.zero(LANE4_SPECIES);
        while (next < 128) {
            if (bitsRemaining >= bitsPerValue) {
                input = input.or(LongVector.fromArray(LANE4_SPECIES, values, next).and(MASK)
                        .lanewise(VectorOperators.LSHL, bitsRemaining - bitsPerValue));
                bitsRemaining -= bitsPerValue;
            } else {
                LongVector valueVector = LongVector.fromArray(LANE4_SPECIES, values, next).and(MASK);
                input = input.or(valueVector.lanewise(VectorOperators.LSHR, bitsPerValue - bitsRemaining));
                input.intoArray(output, upto);
                upto += numEncodeLength;
                input = valueVector.lanewise(VectorOperators.LSHL, 64 - bitsPerValue + bitsRemaining);
                bitsRemaining -= bitsPerValue;
                bitsRemaining += 64;
            }

            if (bitsRemaining == 0) {
                input.intoArray(output, upto);
                upto += numEncodeLength;
                input = LongVector.zero(LANE4_SPECIES);
                bitsRemaining = 64;
            }
            next += 4;
        }


        if (totalCompressedLine % 4 != 0) {
            input.intoArray(output, upto);
           output[totalCompressedLine -2] |= (output[totalCompressedLine ] >>> 32);
           output[totalCompressedLine - 1] |= (output[totalCompressedLine + 1] >>> 32);
        }

    }


    public void decode(int bitsPerValue, long[] input, long[] output) {
        long MASK = (int) ((1L << bitsPerValue) - 1);


        int upto = 0;
        int next = 0;
        int totalCompressedLine = 2 * bitsPerValue;
        int bitsRemaining = 64;
        LongVector inputVector = LongVector.fromArray(LANE4_SPECIES, input, next);
        next += 4;

        if (totalCompressedLine % 4 != 0) {
            input[totalCompressedLine] = ((input[totalCompressedLine - 2] & LOW) << 32);
            input[totalCompressedLine + 1] = ((input[totalCompressedLine - 1] & LOW) << 32);
            input[totalCompressedLine -2] &= HIGH;
            input[totalCompressedLine - 1] &= HIGH;
        }

        while (upto < 128) {
            if (bitsRemaining >= bitsPerValue) {
                LongVector res = inputVector.and(MASK << (bitsRemaining - bitsPerValue))
                        .lanewise(VectorOperators.LSHR, bitsRemaining - bitsPerValue);
                bitsRemaining -= bitsPerValue;
                res.intoArray(output, upto);
                upto += 4;
            } else {
                int bitDiff = bitsPerValue - bitsRemaining;
                LongVector res = inputVector.and(MASK >> (bitsPerValue - bitsRemaining))
                        .lanewise(VectorOperators.LSHL, bitDiff);

                inputVector = LongVector.fromArray(LANE4_SPECIES, input, next);
                next += 4;
                var temp = inputVector.and((MASK >> bitsRemaining) << (64 - bitDiff) );
                res = res.or(temp.lanewise(VectorOperators.LSHR, 64 - bitDiff));
                res.intoArray(output, upto);
                upto += 4;
                bitsRemaining -= bitsPerValue;
                bitsRemaining += 64;
            }

            if (bitsRemaining  == 0) {
                inputVector = LongVector.fromArray(LANE4_SPECIES, input, next);
                next += 4;
                bitsRemaining = 64;
            }
        }

    }

I will proceed with testing the scalar version and decoding. Once everything is prepared, I will submit a pull request this week

uschindler · 2023-07-03T13:07:57Z

Hi, I just noticed that the forUtil.gradle had a bug in Lucene 9. It does not regenerate anything because the actual python script is never executed.

I will povide a separate fix in a minute. Interestingly the script does not execute correctly:

> Task :lucene:core:generateForUtilInternal FAILED
Caching disabled for task ':lucene:core:generateForUtilInternal' because:
  Build cache is disabled
Task ':lucene:core:generateForUtilInternal' is not up-to-date because:
  Task has failed previously.
External tool 'python3' resolved to: C:/cygwin/bin/python3.6m.exe
Starting process 'command 'C:/cygwin/bin/python3.6m.exe''. Working directory: C:\Users\Uwe Schindler\Projects\lucene\lucene\lucene\core\src\java\org\apache\lucene\codecs\lucene90 Command: C:/cygwin/bin/python3.6m.exe C:\Users\Uwe Schindler\Projects\lucene\lucene\lucene\core\src\java\org\apache\lucene\codecs\lucene90\gen_ForUtil.py
Successfully started process 'command 'C:/cygwin/bin/python3.6m.exe''
Traceback (most recent call last):
  File "C:\Users\Uwe Schindler\Projects\lucene\lucene\lucene\core\src\java\org\apache\lucene\codecs\lucene90\gen_ForUtil.py", line 438, in <module>
    writeDecode(i, f)
  File "C:\Users\Uwe Schindler\Projects\lucene\lucene\lucene\core\src\java\org\apache\lucene\codecs\lucene90\gen_ForUtil.py", line 362, in writeDecode
    writeRemainder(bpv, next_primitive, shift + bpv, o, 128/num_values_per_long - o, f)
  File "C:\Users\Uwe Schindler\Projects\lucene\lucene\lucene\core\src\java\org\apache\lucene\codecs\lucene90\gen_ForUtil.py", line 320, in writeRemainder
    for i in range(num_values):
TypeError: 'float' object cannot be interpreted as an integer

Thats not good!

uschindler · 2023-07-03T13:27:29Z

Here is the bugfix for the generator: #12411

This bug caused a lot trouble here because I did not understand why the script wasn't executed at all.

uschindler · 2023-07-03T14:58:33Z

Hi,
the bug in the generator is fixed. I was able to include the ForUtil generator into a local branch and made everything working.

What I am wondering about: The generated PanamaForUtil is immensely huge (258 KiB) and it looks only encoding is vectorized, while decoding looks like an very old version of the code, it explodes somehow - where did you get that code from? The vectorized encode seems to work, the indexes and tests are working.

I also noticed a small problem with the new framework regarding package private caller classes. I will think about a good solution.

Anywys, I will make the branch available as Draft PR so you can see how it was done. It was a bit more work because ForUtil was used at many places and each call to PForUtil needs a ForUtil instance. I refactored this and was able to make it work.

tang-hi · 2023-07-03T15:07:43Z

What I am wondering about: The generated PanamaForUtil is immensely huge (258 KiB) and it looks only encoding is vectorized, while decoding looks like an very old version of the code, it explodes somehow - where did you get that code from? The vectorized encode seems to work, the indexes and tests are working.

I have currently only provided a vectorized version for encoding, while keeping decoding unchanged. This is because I wanted to ensure that the encoding part is working before implementing a compressed version for decoding. However, based on @jpountz's response, I am now trying to use a more concise version (with a different compression format but the same space size). I will update the Python script with this version later.

…9.0 codec (only encode!). It should show as example how it works, this is not ready for productions, although all tests pass

uschindler · 2023-07-03T15:33:48Z

Here is the example what I did with your code: #12412

It should demonstrate how I inlcuded it. Most interesting is the cleanup inside the lucene90 codec. I also defined the interface very quicky, but actually I did not check that all is used there. I defined some of the constants in the interface (BLOCK_SIZE), too.

tang-hi · 2023-07-03T16:02:43Z

Thank you so much! I will take a look at this PR tomorrow. It's getting late for me now, so I need to go to sleep. 😊

ChrisHegarty · 2023-07-04T20:17:20Z

I would like to pause a little, double check where we're going, and reset if needed. It's become clear to me that we're not quite aligned.

The main issue I see is with the interface. All the code I've seen so far is using long[], and then consequently LongVector in the Panama implementation. After chatting a little with @jpountz offline, and re-reading previous comments, it's clear that this is not what we want.

What we want is int[] and IntVector. Using longs in the new implementation is wasteful, both in terms of memory, but also (and more importantly) vector lanes. My initial experiments with ints are showing even bigger speed ups, approx 4x (but I've not implemented it all yet).

If we're in agreement with the above, then I'd like to propose that we go back to the basics - implement all the various bitsPerValue variants for Int128Vector species. This will effectively be a port of the non-masking pack/unpack from https://github.com/lemire/simdcomp/blob/master/src/simdbitpacking.c. We can then compare these to the existing ForUtil implementations and tweak the actual code as necessary. I've started this in this repo https://github.com/ChrisHegarty/bitpacking (a pure port for now, we can analyse the benchmarks and tweak the code once complete )

Clearly, there is a lot more to do: a scalar implementation will be required of the new format (and with int[]), we'll need to benchmark it too; the calling code including, PForUtil, etc, will need to be updated to use int[]; provide 256 and 512 species variants, consider what (if any) source generation makes sense, etc. But that should be relatively straightforward, let's just put that aside until we get the full 1-32 encode/decode all implemented for int[] and Int128Vector first. (which I'll continue to do in bitpacking).

Have I missed something? What do you think?

uschindler · 2023-07-04T21:09:42Z

I agree. There are more complications: DataInput does not have a read method for int[], only one for float[] and long[]. So changing this is a bigger task. I tend to think that we should take some time to try and possibly also delay that till Lucene 10.0.

This was the reason why I said: let's keep index Format as is and reimplement current format with vectors, but also invent a new one with new methods in IO layer based on ints for later.

tang-hi · 2023-07-05T03:28:49Z

Switching from LongVector to IntVector is feasible, especially for 128-bit, as there are only a few modifications needed. Special handling may be required for 256-bit and 512-bit, but it shouldn't be a big issue. Additionally, when the output is also int[], the handling will be simpler. For example, if it were long[] and bpv was 3, we would need to store it as six longs, which cannot be evenly divided by four. However, with int[], this problem doesn't exist.

Currently, I am more concerned about the scalar code. Yesterday, I tested it and found that the original scalar decode code was amazingly fast, almost on par with the SIMD speed at 50 op/us. On the other hand, the code using the new compression format only achieved around 10+ op/us. I would like to know if we can tolerate some performance loss in the scalar code if we switch the compression format. What is the minimum performance threshold we can accept?

jpountz · 2023-07-05T06:59:07Z

I would like to know if we can tolerate some performance loss in the scalar code if we switch the compression format. What is the minimum performance threshold we can accept?

To me the numbers that matter are more the numbers for end-to-end search latency rather than pure decoding speed, e.g. the same benchmark as @mikemccand 's nightlies. I'd be good with this change if there is a noticeable speedup with the new vector code, and the scalar code only degrades performance by a few percents. (It would be very easy to test this by plugging this new ForUtil into a new codec, I can help with this if needed).

tang-hi · 2023-07-05T12:34:21Z

I have attempted to implement an Int version of the scalar and vectorized forutil. I have submitted a draft PR as a simple starting point for those interested in this issue. Even if it is not ultimately merged, it can still be useful.

Here is my current progress:

Implemented the decode and encode functions for vectorization.
Implemented the decode and encode functions for scalars, but they are not well implemented.
Changed the function interface to int[].

However, I am not sure why many of the tests are failing, even though the tests for Pforutil and forutil are passing. I will take a closer look at the specific reasons when I have time. Could changing certain members from long to int affect correctness? I am not completely familiar with the index structure of Lucene, so please let me know if you have any discoveries. It is also possible that there are issues with my code implementation.
#12417

tang-hi · 2023-07-05T12:38:20Z

When using the int type, there is a significant performance improvement compared to the long type, approximately 2-3 times. You can refer to link for more details. The performance of decoding and encoding is similar, except that the original scalar decoding code has a speed of around 50op/us, while the vectorized version has a performance improvement of over twice as much.

uschindler · 2023-07-05T16:43:01Z

I agree. There are more complications: DataInput does not have a read method for int[], only one for float[] and long[]. So changing this is a bigger task.

I just noticed: We have readInts() already in DataInput and it is also implemeted in MemorySegmentIndexInput.

uschindler · 2023-07-05T16:46:43Z

However, I am not sure why many of the tests are failing, even though the tests for Pforutil and forutil are passing. I will take a closer look at the specific reasons when I have time. Could changing certain members from long to int affect correctness?

This could be because of different memory size. Lucene calculates the memory usage of the codec's on heap data. Maybe that now a bit wrong as you have not adapted ramBytesUsed() methods in some of the classes.

Without knowing what tests failed (I will try later) I can't give a good guess. I can check out later.

tang-hi added the type:enhancement label Jun 25, 2023

uschindler mentioned this issue Jul 2, 2023

Refactor vectorization support in Lucene #12410

Merged

uschindler mentioned this issue Jul 3, 2023

DRAFT: Vectorize ForUtil encoding for the 9.0 codec (same format) #12412

Draft

tang-hi linked a pull request Jul 5, 2023 that will close this issue

forutil add vectorized and scalar code #12417

Draft

uschindler linked a pull request Jul 5, 2023 that will close this issue

forutil add vectorized and scalar code #12417

Draft

Make ForUtil Vectorized #12396

Make ForUtil Vectorized #12396

Comments

tang-hi commented Jun 25, 2023

Description

msokolov commented Jun 25, 2023

tang-hi commented Jun 26, 2023

uschindler commented Jun 26, 2023

uschindler commented Jun 26, 2023

jpountz commented Jun 28, 2023

rmuir commented Jun 28, 2023

tang-hi commented Jun 28, 2023

uschindler commented Jun 28, 2023 • edited Loading

uschindler commented Jun 28, 2023

uschindler commented Jun 28, 2023

tang-hi commented Jun 28, 2023

uschindler commented Jun 28, 2023

uschindler commented Jun 28, 2023

rmuir commented Jun 28, 2023

jpountz commented Jun 28, 2023

ChrisHegarty commented Jul 1, 2023

uschindler commented Jul 1, 2023 • edited Loading

uschindler commented Jul 1, 2023

tang-hi commented Jul 1, 2023

ChrisHegarty commented Jul 1, 2023

uschindler commented Jul 1, 2023

ChrisHegarty commented Jul 1, 2023

ChrisHegarty commented Jul 1, 2023

uschindler commented Jul 1, 2023

tang-hi commented Jul 2, 2023

uschindler commented Jul 2, 2023 • edited Loading

tang-hi commented Jul 2, 2023

uschindler commented Jul 2, 2023

uschindler commented Jul 3, 2023

uschindler commented Jul 3, 2023

jpountz commented Jul 3, 2023

tang-hi commented Jul 3, 2023

uschindler commented Jul 3, 2023 • edited Loading

ChrisHegarty commented Jul 3, 2023

uschindler commented Jul 3, 2023

uschindler commented Jul 3, 2023

uschindler commented Jul 3, 2023

tang-hi commented Jul 3, 2023

tang-hi commented Jul 3, 2023 • edited Loading

uschindler commented Jul 3, 2023

uschindler commented Jul 3, 2023 • edited Loading

uschindler commented Jul 3, 2023

tang-hi commented Jul 3, 2023

uschindler commented Jul 3, 2023

tang-hi commented Jul 3, 2023

ChrisHegarty commented Jul 4, 2023 • edited Loading

uschindler commented Jul 4, 2023 • edited Loading

tang-hi commented Jul 5, 2023 • edited Loading

jpountz commented Jul 5, 2023

tang-hi commented Jul 5, 2023 • edited Loading

tang-hi commented Jul 5, 2023

uschindler commented Jul 5, 2023 • edited Loading

uschindler commented Jul 5, 2023

uschindler commented Jun 28, 2023 •

edited

Loading

uschindler commented Jul 1, 2023 •

edited

Loading

uschindler commented Jul 2, 2023 •

edited

Loading

uschindler commented Jul 3, 2023 •

edited

Loading

tang-hi commented Jul 3, 2023 •

edited

Loading

uschindler commented Jul 3, 2023 •

edited

Loading

ChrisHegarty commented Jul 4, 2023 •

edited

Loading

uschindler commented Jul 4, 2023 •

edited

Loading

tang-hi commented Jul 5, 2023 •

edited

Loading

tang-hi commented Jul 5, 2023 •

edited

Loading

uschindler commented Jul 5, 2023 •

edited

Loading