Leverage accelerated vector hardware instructions in Vector Search #96370

ChrisHegarty · 2023-05-26T08:14:56Z

The upcoming Lucene 9.7.0 release has support for SIMD vectorized implementations of the low-level primitives used by Vector Search. The vectorized implementations use the currently Incubating Panama Vector API, see apache/lucene#12311 apache/lucene#12327. The Lucene changelog notes say it all [1]

We should evaluate the impact of enabling this in Elasticsearch. Specifically,

Refactor similar usages in Elasticsearch to use the Lucene VectorUtil functions - since they are much faster, e.g. DenseVectorFieldMapper.java can use VectorUtil::dotProduct (rather than its own slower scalar implementation). 96617
Merge Lucene 9.7.0 without enabling the new Panamaized vectorized implementations. We want to validate and baseline the upgrade to 9.7.0 independently of this change. Allow, say 24+ hours, to get at least one nightly benchmark run.
- Upgrade to 9.7.0 snapshot
- Evaluate nightly benchmarks
Add --add-modules jdk.incubator.vector to the Elasticsearch startup - this will enable the faster Lucene VectorUtil implementation. https://github.com/elastic/elasticsearch/blob/main/distribution/tools/server-cli/src/main/java/org/elasticsearch/server/cli/ServerProcess.java#L222. This will raise a warning at startup, document that this is ok (similar to the security manager warning - yes, we know, it is ok! )
- conditionally add the module, only for JDK 20+ Enable the Panama Vector module #96453
- check JVM flags, see later comment, test environments, etc
- check log output contains the expected Vector bit width Enable the Panama Vector module #96453 (comment)

[1] GITHUB#12302, GITHUB#12311: Add vectorized implementations of VectorUtil.dotProduct(), squareDistance(), cosine() with Java 20 jdk.incubator.vector APIs. Applications started with command line parameter "java --add-modules jdk.incubator.vector" on exactly Java 20 will automatically use the new vectorized implementations if running on a supported platform (x86 AVX2 or later, ARM SVE or later). This is an opt-in feature and requires explicit Java command line flag! When enabled, Lucene logs a notice using java.util.logging. Please test thoroughly and report bugs/slowness to Lucene's mailing list.

The text was updated successfully, but these errors were encountered:

elasticsearchmachine · 2023-05-26T08:15:19Z

Pinging @elastic/es-search (Team:Search)

ChrisHegarty · 2023-05-26T08:29:25Z

Sample microbenchmark results here to give a sense of the potential performance impact from this change.

Model name: 11th Gen Intel(R) Core(TM) i9-11900F @ 2.50GHz : AVX-512

java -jar target/vectorbench.jar -p size=1024
...
Benchmark                                (size)   Mode  Cnt   Score   Error   Units
BinaryCosineBenchmark.cosineDistanceNew    1024  thrpt    5  10.637 ± 0.068  ops/us
BinaryCosineBenchmark.cosineDistanceOld    1024  thrpt    5   1.115 ± 0.008  ops/us
BinaryDotProductBenchmark.dotProductNew    1024  thrpt    5  22.050 ± 0.007  ops/us
BinaryDotProductBenchmark.dotProductOld    1024  thrpt    5   3.349 ± 0.041  ops/us
BinarySquareBenchmark.squareDistanceNew    1024  thrpt    5  16.215 ± 0.129  ops/us
BinarySquareBenchmark.squareDistanceOld    1024  thrpt    5   2.479 ± 0.032  ops/us
FloatCosineBenchmark.cosineNew             1024  thrpt    5   9.394 ± 0.048  ops/us
FloatCosineBenchmark.cosineOld             1024  thrpt    5   0.750 ± 0.002  ops/us
FloatDotProductBenchmark.dotProductNew     1024  thrpt    5  25.657 ± 2.105  ops/us
FloatDotProductBenchmark.dotProductOld     1024  thrpt    5   3.320 ± 0.079  ops/us
FloatSquareBenchmark.squareNew             1024  thrpt    5  19.437 ± 0.122  ops/us
FloatSquareBenchmark.squareOld             1024  thrpt    5   2.355 ± 0.003  ops/us

javanna · 2023-05-26T08:39:11Z

Thanks @ChrisHegarty for the great work and for opening this issue. I think the first step is to upgrade Elasticsearch to a Lucene 9.7 snapshot. I will open a PR.

uschindler · 2023-05-26T12:37:23Z

I need to mention: At moment this ONLY works with Java 20. Support for Java 21 is coming a bit later when Java 21 goes into RC phase and all APIs are finalized (this is the same for the new memory segment MMapDirectory support).

We won't backport to Java 19. It does not hurt to enable the module also on other java versions, but it should be best be done conditionally.

As with MMapDirectory the code uses java.util.logging to report status and warnings. Those warnings should be fed to normal Elasticsearch logs with a wrapper (@ChrisHegarty: did you add the glue code log4j-jul?) to give users a feedback that all is sane - don't hide the messages! There are also some combinations of JVM settigs that autoatically disable the vector support (like missing or disabled AVX2 on x86) or non-tiered compilation (e.g., C2 disabled).

The startup warning about the incubator module will be printed to stderr and is unrelated to Lucene and cannot be disabled.

uschindler · 2023-05-26T12:45:19Z

In addition when doing performance tests, use large indexes and many queries and let benchmarks run with appropiate warmup for longer time. The vector features take a bit longer until they enable the optimizations in the JVM. The first few queries will be way slower than with the legacy Lucene vector implementation (we measured 40 times slower dotProduct until the optimizations kick in and then it gets up to 2-5 times faster (than the classical lucene code) depending on hardware capabilities and vector size).

Also keep in mind to update the documentation that there should be vector sizes fitting the hardware infrastructure. Vector dimensions of 100 are bad, 128 is perfect!

We have some formulas depending on the hardware. As a good default: For AVX2 CPUs with 256bit vector support, the calculation says (4 bytes/floatvector, 4 lanes to be executed in parallel) means dimension of float vectors should be multiples of 32 (so 32, 64, 96, 128,.... are good sizes). With AVX3 it is multiples of 64. For byte/binary vectors it is a bit different. So possibly tell users to have vector dimensions that are multiples of 64 to catch all variants.

ChrisHegarty · 2023-05-26T13:26:14Z

I need to mention: At moment this ONLY works with Java 20.

Yeah, and this is fine. Elasticsearch currently bundles JDK 20.0.1 - and will bundle 20.0.2 when it comes out. You can set your own JDK, but ... too bad if you set it something older! ;-)

Support for Java 21 is coming a bit later when Java 21 goes into RC phase and all APIs are finalized (this is the same for the new memory segment MMapDirectory support).

Yeah, we should certainly do this soon, as you say. 21 RDP 1 is early June, and we wanna make sure that all this works well.

We won't backport to Java 19. It does not hurt to enable the module also on other java versions, but it should be best be done conditionally.

Good point. I had overlooked this.

As with MMapDirectory the code uses java.util.logging to report status and warnings. Those warnings should be fed to normal Elasticsearch logs with a wrapper (@ChrisHegarty: did you add the glue code log4j-jul?) to give users a feedback that all is sane - don't hide the messages! There are also some combinations of JVM settigs that autoatically disable the vector support (like missing or disabled AVX2 on x86) or non-tiered compilation (e.g., C2 disabled).

The log messages for MMAap do end up in the ES logs. I assume it will be the same for Vector, but I'll need to double check. I think that our usage of JVM flags is ok, but this could be worth checking, especially in different test environments.

The startup warning about the incubator module will be printed to stderr and is unrelated to Lucene and cannot be disabled.

I am sorry that this is not suppressible - yes, it is my fault, I added it to the JDK. In general, I think that these kinda warnings should be suppressible from the command line. But that is another discussion, best left for another day.

ChrisHegarty · 2023-06-02T10:31:42Z

A note on vector dimension sizes.

There are some dimensions that perform better than others, but of course all dimension sizes behave correctly. E.g. 1536 will perform better than 1535.

The Lucene VectorUtilPanamaProvider (where the JDK Vector API is leveraged) has support for 128, 256, and 512 preferred bit sizes. For best alignment on hardware that supports:

128 bit sizes (e.g. Mac m1, NEON), we pack 4 individual values of element_type float or 16 of element_type byte;
256 bit sizes (e.g. Sky Lake, AVX 2), we pack 8 individual values of element_type float or 32 of element_type byte;
512 bit sizes (e.g. Rocket Lake, AVX 512), we pack 16 individual values of element_type float or 64 of element_type byte.

Given this, vector dimensions that are multiples of the number of per element_type packed values, for the given hardware preferred bit size, perform best. We've not tried to over optimise for "downsizing" on tail cases, rather preferring to keep the code simple.

The logs contain a message noting the preferred bit size that is in use for the hardware that we're running on. E.g. running on my mac I can see the following in the Elasticsearch logs:

[..][WARN ][stderr ..] [y..] Jun 01, 2023 4:01:26 PM org.apache.lucene.util.VectorUtilPanamaProvider <init>
[..][WARN ][stderr ..] [y..] INFO: Java vector incubator API enabled; uses preferredBitSize=128

ChrisHegarty · 2023-06-02T11:19:40Z

A note on existing / current Elasticsearch Vector Search benchmarks.

The SO Vector benchmark track tests vector search performance with vectors (of element_type float) with 768 dimensions.

 "titleVector": {
    "type": "dense_vector",
    "dims" : 768,
    "index" : true,
    "similarity": "dot_product"
  }

The Dense Vector benchmark track tests vector search performance with vectors (of element_type float) with 96 dimensions.

 "vector": {
   "type": "dense_vector",
   "dims" : 96,
   "index" : true,
   "similarity": "dot_product",
   "index_options": {
     "type": "hnsw",
     "m": 32,
     "ef_construction": 100
   }
 }

ChrisHegarty · 2023-06-05T09:13:05Z

The ES Nightly benchmark updates the jvm.options file, it needed a change to enable the Vector API, see elastic/rally-teams#79

ChrisHegarty · 2023-06-06T08:31:50Z

(This benchmark was run on an N2 Ice Lake VM instance on GCP)

Affects on the SO Vector benchmark:

indexing-throughput improved by 30%
merge-time decreased by 40%
script-score-query-java-latency improved by 40%

mayya-sharipova · 2023-06-06T11:41:11Z

@ChrisHegarty Impressive results!

…6617) Lucene has integrated hardware accelerated vector calculations. Meaning, calculations like `dot_product` can be much faster when using the Lucene defined functions. When a `dense_vector` is indexed, we already support this. However, when `index: false` we store float vectors as binary fields in Lucene and decode them ourselves. Meaning, we don't use the underlying Lucene structures or functions. To take advantage of the large performance boost, this PR refactors the binary vector values in the following way: - Eagerly decode the binary blobs when iterated - Call the Lucene defined VectorUtil functions when possible related to: #96370

javanna · 2023-06-09T07:59:47Z

@ChrisHegarty this is completed, right? Can we close this issue?

ChrisHegarty added :Search/Search Search-related issues that do not fall into other categories Team:Search Meta label for search team labels May 26, 2023

ChrisHegarty self-assigned this May 26, 2023

javanna added the >feature label May 30, 2023

This was referenced May 30, 2023

Upgrade Lucene to a 9.7.0 snapshot #96433

Merged

Enable the Panama Vector module #96453

Merged

benwtrent mentioned this issue Jun 6, 2023

Improve brute force vector search speed by using Lucene functions #96617

Merged

giladgal added the release highlight label Jun 7, 2023

ChrisHegarty closed this as completed Jun 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Leverage accelerated vector hardware instructions in Vector Search #96370

Leverage accelerated vector hardware instructions in Vector Search #96370

ChrisHegarty commented May 26, 2023 •

edited by benwtrent

Loading

elasticsearchmachine commented May 26, 2023

ChrisHegarty commented May 26, 2023 •

edited

Loading

javanna commented May 26, 2023

uschindler commented May 26, 2023 •

edited

Loading

uschindler commented May 26, 2023 •

edited

Loading

ChrisHegarty commented May 26, 2023

ChrisHegarty commented Jun 2, 2023 •

edited

Loading

ChrisHegarty commented Jun 2, 2023 •

edited

Loading

ChrisHegarty commented Jun 5, 2023

ChrisHegarty commented Jun 6, 2023 •

edited

Loading

mayya-sharipova commented Jun 6, 2023

javanna commented Jun 9, 2023

Leverage accelerated vector hardware instructions in Vector Search #96370

Leverage accelerated vector hardware instructions in Vector Search #96370

Comments

ChrisHegarty commented May 26, 2023 • edited by benwtrent Loading

elasticsearchmachine commented May 26, 2023

ChrisHegarty commented May 26, 2023 • edited Loading

javanna commented May 26, 2023

uschindler commented May 26, 2023 • edited Loading

uschindler commented May 26, 2023 • edited Loading

ChrisHegarty commented May 26, 2023

ChrisHegarty commented Jun 2, 2023 • edited Loading

ChrisHegarty commented Jun 2, 2023 • edited Loading

ChrisHegarty commented Jun 5, 2023

ChrisHegarty commented Jun 6, 2023 • edited Loading

mayya-sharipova commented Jun 6, 2023

javanna commented Jun 9, 2023

ChrisHegarty commented May 26, 2023 •

edited by benwtrent

Loading

ChrisHegarty commented May 26, 2023 •

edited

Loading

uschindler commented May 26, 2023 •

edited

Loading

uschindler commented May 26, 2023 •

edited

Loading

ChrisHegarty commented Jun 2, 2023 •

edited

Loading

ChrisHegarty commented Jun 2, 2023 •

edited

Loading

ChrisHegarty commented Jun 6, 2023 •

edited

Loading