Optimize HNSW diversity calculation #12235

zhaih · 2023-04-19T03:31:50Z

Description

Currently when we pop out the worst diverse node we go from the most distant node and calculate diversity one by one. But actually when we add the first batch of neighbours, we have already checked the diversity and rejected the non-diverse node. So the only nodes that are unchecked are only the ones that inserted by neighbours (using insertSorted), we should be able to only check those "unchecked" nodes but not every node.

I rely on current unit test to check the correctness but if that's not enough I'm pretty happy to write some test as well.

KNNGraphTester bench based on mikemccand/luceneutil#218:

This patch

recall	latency	nDoc	fanout	maxConn	beamWidth visited index ms
0.802	13.30	1000000	20	32	250	  120	  630614	1.00	post-filter
0.907	13.77	1000000	250	32	250	  350	  621157	1.00	post-filter
0.830	13.95	1000000	20	32	500	  120	  1242756	1.00	post-filter
0.926	14.33	1000000	250	32	500	  350	  1246090	1.00	post-filter
0.804	12.73	1000000	20	96	250	  120	  672480	1.00	post-filter
0.910	13.75	1000000	250	96	250	  350	  671602	1.00	post-filter
0.832	14.01	1000000	20	96	500	  120	  1376617	1.00	post-filter
0.928	14.36	1000000	250	96	500	  350	  1381215	1.00	post-filter

Mainline

recall	latency	nDoc	fanout	maxConn	beamWidth visited index ms
0.802	12.82	1000000	20	32	250	  120	  686167	1.00	post-filter
0.907	13.58	1000000	250	32	250	  350	  633083	1.00	post-filter
0.831	13.73	1000000	20	32	500	  120	  1296331	1.00	post-filter
0.927	14.59	1000000	250	32	500	  350	  1275096	1.00	post-filter
0.804	12.79	1000000	20	96	250	  120	  672598	1.00	post-filter
0.910	13.27	1000000	250	96	250	  350	  670174	1.00	post-filter
0.832	13.76	1000000	20	96	500	  120	  1381401	1.00	post-filter
0.928	14.71	1000000	250	96	500	  350	  1393253	1.00	post-filter

msokolov

I made some small comments about the internal APIs, but my main questions here are about performance tradeoffs. This will use more RAM, maybe even a lot, and it should improve indexing time. This could be a good tradeoff, but we want to know how much of each, and what the bounds if any on RAM growth would be.

Before we commit this, we definitely need to run some perf tests. I have done this in the past using KnnGraphTester. It's designed to measure latency/recall but it also does report on graph build time and I think we could instrument to see how RAM is growing? To test you need a set of vectors that is large enough to be realistic. up to 1M is probably enough. There are vectors available via luceneutil.

One thought I had is whether we can limit the RAM growth by using a bitset tied to node index in the array rather than an array of int node ids? This would require reshuffling the bits when nodes are sorted in the NeighborArray. Could be a good choice to limit memory

lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborArray.java

zhaih · 2023-04-21T23:01:30Z

@msokolov Thanks for your feedback. Indeed the memory consumption is a concern and after I thought it more I found that it might be better to lazily sort the nodes and figure out which nodes are unchecked all at once.
In this way we still able to avoid the extra computation at the cost of 1 extra int, which I believe is more or less negligible now.

For sure I will do the KnnGraphTester (and thank you for pointing me to there, TBH I haven't done any test with KNN yet...) still to just see whether the performance gain is observable, but let me know whether the latter approach addresses your concern.
Thank you!

zhaih · 2023-04-24T03:34:28Z

I ran the KnnGraphTester and pasted the result to the description, basically when max-conn is low (there's high possiblility of popping out a node) the improvement seems to be obvious (2-5% on indexing time), when max-conn is high then the indexing time is similar or only slightly better, which is also expected.

msokolov · 2023-04-24T12:34:24Z

I see; well there seems like some benefit. Also, the impact is going to vary depending on the nature of the data. I guess we'd expect that vectors with a high degree of "clustering" would tend to have more ejections due to diversity checks.Let me take one more pass and see if there's any more polishing to do, otherwise it seems good to me.

MarcusSorealheis · 2023-04-24T18:40:12Z

lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphBuilder.java

@@ -306,7 +308,7 @@ private void addDiverseNeighbors(int level, int node, NeighborQueue candidates)
    for (int i = 0; i < size; i++) {
      int nbr = neighbors.node[i];
      NeighborArray nbrNbr = hnsw.getNeighbors(level, nbr);
-      nbrNbr.insertSorted(node, neighbors.score[i]);
+      nbrNbr.addOutOfOrder(node, neighbors.score[i]);


Nit: I know we use nbrNbr in a few more places, but as you have reminded me, the nondescript names could make it more difficult for others to understand the code and participate.

Is there another way to express this to make it easier for people to read/scan? neighborArray to be most explicit, candidateNeighbors or something else? You would only need to change two other files to preserve consistency if you elected to move from nbrNbr

Yeah this is not quite descriptive but I'm not sure whether I can come up with a better one.. Maybe nbrsOfNbr? neighborArray is not exactly right since it is actually "the neighbor array of the neighbor" LOL

I think neighborArray would be an improvement but not completely accurate. neighborsArray would be implicitly good, too, yet implicit is not good enough and it's not clear about the relationship with the neighbor in focus.

nbrsOfNbr is even more clear. I'd recommend going with your suggestion.

Sure, changed!

Is the change unneeded in the Lucene90 file if it's changed here? I was thinking we could rename nbrNbr wherever it is used because it is confusing.

I can do it for the other files because there are not many of them, however I don't usually do a text search for the whole repo to see whether the rename can be applied even to the files I haven't touched originally...

As a principal, that totally makes sense. For this specific project, it makes even more sense because it's pretty complicated in some areas.

The other two files are codec files with identical logic in this section, so it probably would not pose much of a problem. Although, I'm sure others could weigh in or -1 if adamant about nbrNbr. Most of the time, I favor readability over anything else. That could be because I am lazy. 😛 nbrsOfNbr is really clear while also being few characters to type. Still, no pressure.

zhaih · 2023-04-26T01:06:01Z

The test error seems due to a flaw introduced in #12169, since it seems they're anyway gonna rework on it for thread safety reason I left a comment there.

msokolov

I think this is ready? Is there anything you were still planning to improve / test? Oh I see there is a precommit test failing ...

msokolov · 2023-04-27T13:45:22Z

lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborArray.java

+   * @return indexes of newly sorted (unchecked) nodes, in ascending order, or null if the array is
+   *     already fully sorted
+   */
+  public int[] sort() {


LGTM. After reading I think I see what's going on. The name is deceptively simple; I wonder if it shouldn't hint that there is something tricky going on in here? Like sortAndReturnPreviouslyUnsorted() :) but it's fine to leave as is - the javadoc anyway explains it

Haha I'll just leave it as is. It claims itself is "sort" and it does the sort, but just return something different :)

zhaih · 2023-04-28T06:04:51Z

Oh I see there is a precommit test failing...

Yeah I want to wait for that fix a bit before I merge this one... Since anyway this won't make into Lucene 9.6 LOL

benwtrent · 2023-05-12T16:54:12Z

@zhaih I verified your change doesn't hurt recall. Is this change good to continue working on? I tried replicating your performance improvements, but did not see that big of a change. My search time latency might be drowning out the index build latency.

zhaih · 2023-05-13T05:50:46Z

@benwtrent I was originally waiting for #12246 to be merged first so that this PR won't break their test, but seems that one is blocked somehow by other PR so I'll just fix the problem here.

I think if this time all the checks passed I'll just merge this PR

…arches (#12235) We observed this change was not ported previously from main in an old cherry-pick

…dc8ca633e8bcf`) (#20) * Add next minor version 9.7.0 * Fix SynonymQuery equals implementation (apache#12260) The term member of TermAndBoost used to be a Term instance and became a BytesRef with apache#11941, which means its equals impl won't take the field name into account. The SynonymQuery equals impl needs to be updated accordingly to take the field into account as well, otherwise synonym queries with same term and boost across different fields are equal which is a bug. * Fix MMapDirectory documentation for Java 20 (apache#12265) * Don't generate stacktrace in CollectionTerminatedException (apache#12270) CollectionTerminatedException is always caught and never exposed to users so there's no point in filling in a stack-trace for it. * add missing changelog entry for apache#12260 * Add missing author to changelog entry for apache#12220 * Make query timeout members final in ExitableDirectoryReader (apache#12274) There's a couple of places in the Exitable wrapper classes where queryTimeout is set within the constructor and never modified. This commit makes such members final. * Update javadocs for QueryTimeout (apache#12272) QueryTimeout was introduced together with ExitableDirectoryReader but is now also optionally set to the IndexSearcher to wrap the bulk scorer with a TimeLimitingBulkScorer. Its javadocs needs updating. * Make TimeExceededException members final (apache#12271) TimeExceededException has three members that are set within its constructor and never modified. They can be made final. * DOAP changes for release 9.6.0 * Add back-compat indices for 9.6.0 * `ToParentBlockJoinQuery` Explain Support Score Mode (apache#12245) (apache#12283) * `ToParentBlockJoinQuery` Explain Support Score Mode --------- Co-authored-by: Marcus <marcuseagan@gmail.com> * Simplify SliceExecutor and QueueSizeBasedExecutor (apache#12285) The only behaviour that QueueSizeBasedExecutor overrides from SliceExecutor is when to execute on the caller thread. There is no need to override the whole invokeAll method for that. Instead, this commit introduces a shouldExecuteOnCallerThread method that can be overridden. * [Backport] GITHUB-11838 Add api to allow concurrent query rewrite (apache#12197) * GITHUB-11838 Change API to allow concurrent query rewrite (apache#11840) Replace Query#rewrite(IndexReader) with Query#rewrite(IndexSearcher) Co-authored-by: Patrick Zhai <zhaih@users.noreply.github.com> Co-authored-by: Adrien Grand <jpountz@gmail.com> Backport of apache#11840 Changes from original: - Query keeps `rewrite(IndexReader)`, but it is now deprecated - VirtualMethod is used to correct delegate to the overridden methods - The changes to `RewriteMethod` type classes are reverted, this increased the backwards compatibility impact. ------------------------------ ### Description Issue: apache#11838 #### Updated Proposal * Change signature of rewrite to `rewrite(IndexSearcher)` * How did I migrate the usage: * Use Intellij to do preliminary refactoring for me * For test usage, use searcher whenever is available, otherwise create one using `newSearcher(reader)` * For very few non-test classes which doesn't have IndexSearcher available but called rewrite, create a searcher using `new IndexSearcher(reader)`, tried my best to avoid creating it recurrently (Especially in `FieldQuery`) * For queries who have implemented the rewrite and uses some part of reader's functionality, use shortcut method when possible, otherwise pull out the reader from indexSearcher. * Backport: Concurrent rewrite for KnnVectorQuery (apache#12160) (apache#12288) * Concurrent rewrite for KnnVectorQuery (apache#12160) - Reduce overhead of non-concurrent search by preserving original execution - Improve readability by factoring into separate functions --------- Co-authored-by: Kaival Parikh <kaivalp2000@gmail.com> * adjusting for backport --------- Co-authored-by: Kaival Parikh <46070017+kaivalnp@users.noreply.github.com> Co-authored-by: Kaival Parikh <kaivalp2000@gmail.com> * toposort use iterator to avoid stackoverflow (apache#12286) Co-authored-by: tangdonghai <tangdonghai@meituan.com> # Conflicts: # lucene/CHANGES.txt * Fix test to compile with Java 11 after backport of apache#12286 * Update Javadoc for topoSortStates method after apache#12286 (apache#12292) * Optimize HNSW diversity calculation (apache#12235) * Minor cleanup and improvements to DaciukMihovAutomatonBuilder (apache#12305) * GITHUB-12291: Skip blank lines from stopwords list. (apache#12299) * Wrap Query rewrite backwards layer with AccessController (apache#12308) * Make sure APIJAR reproduces with different timezone (unfortunately java encodes the date using local timezone) (apache#12315) * Add multi-thread searchability to OnHeapHnswGraph (apache#12257) * Fix backport error * [MINOR] Update javadoc in Query class (apache#12233) - add a few missing full stops - update wording in the description of Query#equals method * [Backport] Integrate the Incubating Panama Vector API apache#12311 (apache#12327) Leverage accelerated vector hardware instructions in Vector Search. Lucene already has a mechanism that enables the use of non-final JDK APIs, currently used for the Previewing Pamana Foreign API. This change expands this mechanism to include the Incubating Pamana Vector API. When the jdk.incubator.vector module is present at run time the Panamaized version of the low-level primitives used by Vector Search is enabled. If not present, the default scalar version of these low-level primitives is used (as it was previously). Currently, we're only targeting support for JDK 20. A subsequent PR should evaluate JDK 21. --------- Co-authored-by: Uwe Schindler <uschindler@apache.org> Co-authored-by: Robert Muir <rmuir@apache.org> * Parallelize knn query rewrite across slices rather than segments (apache#12325) The concurrent query rewrite for knn vectory query introduced with apache#12160 requests one thread per segment to the executor. To align this with the IndexSearcher parallel behaviour, we should rather parallelize across slices. Also, we can reuse the same slice executor instance that the index searcher already holds, in that way we are using a QueueSizeBasedExecutor when a thread pool executor is provided. * Optimize ConjunctionDISI.createConjunction (apache#12328) This method is showing up as a little hot when profiling some queries. Almost all the time spent in this method is just burnt on ceremony around stream indirections that don't inline. Moving this to iterators, simplifying the check for same doc id and also saving one iteration (for the min cost) makes this method far cheaper and easier to read. * Update changes to be correct with ARM (it is called NEON there) * GH#12321: Marked DaciukMihovAutomatonBuilder as deprecated (apache#12332) Preparing to reduce visibility of this class in a future release * add BitSet.clear() (apache#12268) # Conflicts: # lucene/CHANGES.txt * Clenaup and update changes and synchronize with 9.x * Update TestVectorUtilProviders.java (apache#12338) * Don't generate stacktrace for TimeExceededException (apache#12335) The exception is package private and never rethrown, we can avoid generating a stacktrace for it. * Introduced the Word2VecSynonymFilter (apache#12169) Co-authored-by: Alessandro Benedetti <a.benedetti@sease.io> * Word2VecSynonymFilter constructor null check (apache#12169) * Use thread-safe search version of HnswGraphSearcher (apache#12246) Addressing comment received in the PR apache#12246 * Word2VecSynonymProvider to use standard Integer max value for hnsw searches (apache#12235) We observed this change was not ported previously from main in an old cherry-pick * Fix searchafter high latency when after value is out of range for segment (apache#12334) * Make memory fence in `ByteBufferGuard` explicit (apache#12290) * Add "direct to binary" option for DaciukMihovAutomatonBuilder and use it in TermInSetQuery#visit (apache#12320) * Add updateDocuments API which accept a query (reopen) (apache#12346) * GITHUB#11350: Handle backward compatibility when merging segments with different FieldInfo This commits restores Lucene 9's ability to handle indices created with Lucene 8 where there are discrepancies in FieldInfos, such as different IndexOptions * [Tessellator] Improve the checks that validate the diagonal between two polygon nodes (apache#12353) # Conflicts: # lucene/CHANGES.txt * feat: soft delete optimize (apache#12339) * Better paging when random reads go backwards (apache#12357) When reading data from outside the buffer, BufferedIndexInput always resets its buffer to start at the new read position. If we are reading backwards (for example, using an OffHeapFSTStore for a terms dictionary) then this can have the effect of re-reading the same data over and over again. This commit changes BufferedIndexInput to use paging when reading backwards, so that if we ask for a byte that is before the current buffer, we read a block of data of bufferSize that ends at the previous buffer start. Fixes apache#12356 * Work around SecurityManager issues during initialization of vector api (JDK-8309727) (apache#12362) * Restrict GraphTokenStreamFiniteStrings#articulationPointsRecurse recursion depth (apache#12249) * Implement MMapDirectory with Java 21 Project Panama Preview API (apache#12294) Backport incl JDK21 apijar file with java.util.Objects regenerated * remove relic in apijar folder caused by vector additions * Speed up IndexedDISI Sparse #AdvanceExactWithinBlock for tiny step advance (apache#12324) * Add checks in KNNVectorField / KNNVectorQuery to only allow non-null, non-empty and finite vectors (apache#12281) --------- Co-authored-by: Uwe Schindler <uschindler@apache.org> * Implement VectorUtilProvider with Java 21 Project Panama Vector API (apache#12363) (apache#12365) This commit enables the Panama Vector API for Java 21. The version of VectorUtilPanamaProvider for Java 21 is identical to that of Java 20. As such, there is no specific 21 version - the Java 20 version will be loaded from the MRJAR. * Add CHANGES.txt for apache#12334 Honor after value for skipping documents even if queue is not full for PagingFieldCollector (apache#12368) Signed-off-by: gashutos <gashutos@amazon.com> * Move TermAndBoost back to its original location. (apache#12366) PR apache#12169 accidentally moved the `TermAndBoost` class to a different location, which would break custom sub-classes of `QueryBuilder`. This commit moves it back to its original location. * GITHUB-12252: Add function queries for computing similarity scores between knn vectors (apache#12253) Co-authored-by: Alessandro Benedetti <a.benedetti@sease.io> * hunspell (minor): reduce allocations when processing compound rules (apache#12316) (cherry picked from commit a454388) * hunspell (minor): reduce allocations when reading the dictionary's morphological data (apache#12323) there can be many entries with morph data, so we'd better avoid compiling and matching regexes and even stream allocation (cherry picked from commit 4bf1b94) * TestHunspell: reduce the flakiness probability (apache#12351) * TestHunspell: reduce the flakiness probability We need to check how the timeout interacts with custom exception-throwing checkCanceled. The default timeout seems not enough for some CI agents, so let's increase it. Co-authored-by: Dawid Weiss <dawid.weiss@gmail.com> (cherry picked from commit 5b63a18) * This allows VectorUtilProvider tests to be executed although hardware may not fully support vectorization or if C2 is not enabled (apache#12376) --------- Signed-off-by: gashutos <gashutos@amazon.com> Co-authored-by: Alan Woodward <romseygeek@apache.org> Co-authored-by: Luca Cavanna <javanna@apache.org> Co-authored-by: Uwe Schindler <uschindler@apache.org> Co-authored-by: Armin Braun <me@obrown.io> Co-authored-by: Mikhail Khludnev <mkhludnev@users.noreply.github.com> Co-authored-by: Marcus <marcuseagan@gmail.com> Co-authored-by: Benjamin Trent <ben.w.trent@gmail.com> Co-authored-by: Kaival Parikh <46070017+kaivalnp@users.noreply.github.com> Co-authored-by: Kaival Parikh <kaivalp2000@gmail.com> Co-authored-by: tang donghai <tangdhcs@gmail.com> Co-authored-by: Patrick Zhai <zhaih@users.noreply.github.com> Co-authored-by: Greg Miller <gsmiller@gmail.com> Co-authored-by: Jerry Chin <metrxqin@gmail.com> Co-authored-by: Patrick Zhai <zhai7631@gmail.com> Co-authored-by: Andrey Bozhko <andybozhko@gmail.com> Co-authored-by: Chris Hegarty <62058229+ChrisHegarty@users.noreply.github.com> Co-authored-by: Robert Muir <rmuir@apache.org> Co-authored-by: Jonathan Ellis <jbellis@datastax.com> Co-authored-by: Daniele Antuzi <daniele.antuzi@gmail.com> Co-authored-by: Alessandro Benedetti <a.benedetti@sease.io> Co-authored-by: Chaitanya Gohel <104654647+gashutos@users.noreply.github.com> Co-authored-by: Petr Portnov | PROgrm_JARvis <pportnov@ozon.ru> Co-authored-by: Tomas Eduardo Fernandez Lobbe <tflobbe@apache.org> Co-authored-by: Ignacio Vera <ivera@apache.org> Co-authored-by: fudongying <30896830+fudongyingluck@users.noreply.github.com> Co-authored-by: Chris Fournier <chris.fournier@shopify.com> Co-authored-by: gf2121 <52390227+gf2121@users.noreply.github.com> Co-authored-by: Adrien Grand <jpountz@gmail.com> Co-authored-by: Elia Porciani <e.porciani@sease.io> Co-authored-by: Peter Gromov <peter@jetbrains.com>

zhaih requested a review from msokolov April 19, 2023 03:31

msokolov reviewed Apr 20, 2023

View reviewed changes

lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborArray.java Outdated Show resolved Hide resolved

lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborArray.java Outdated Show resolved Hide resolved

zhaih requested a review from msokolov April 24, 2023 03:34

MarcusSorealheis reviewed Apr 24, 2023

View reviewed changes

zhaih force-pushed the main branch 3 times, most recently from c61460e to 08ca98a Compare April 27, 2023 04:22

msokolov approved these changes Apr 27, 2023

View reviewed changes

zhaih and others added 10 commits May 12, 2023 22:44

Optimize HNSW diversity calculation

6f6844c

Add CHNAGES.txt entry

c647e67

Update ramBytesUsed for OnHeapHnswGraph

1b5fa2f

tidy

33871b9

mark all reused nodes as unchecked

bf806fd

Sort nodes lazily

b219ada

Rename variable

cdff89a

more rename

7dbb7a8

Move CHANGES entry to 9.7

69d4605

Fix failing test

f56df4a

zhaih force-pushed the main branch from 08ca98a to f56df4a Compare May 13, 2023 05:48

Merge branch 'main' into main

b0eafad

zhaih merged commit 8af3058 into apache:main May 16, 2023

zhaih added a commit that referenced this pull request May 16, 2023

Optimize HNSW diversity calculation (#12235)

310360c

zhaih mentioned this pull request May 16, 2023

Set word2vec getSynonyms method thread-safe #12246

Merged

asfgit pushed a commit that referenced this pull request May 30, 2023

Word2VecSynonymProvider to use standard Integer max value for hnsw se…

e134578

…arches (#12235) We observed this change was not ported previously from main in an old cherry-pick

benwtrent mentioned this pull request Nov 15, 2023

HnwsGraph creates disconnected components #12627

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize HNSW diversity calculation #12235

Optimize HNSW diversity calculation #12235

zhaih commented Apr 19, 2023 •

edited

Loading

msokolov left a comment

zhaih commented Apr 21, 2023

zhaih commented Apr 24, 2023

msokolov commented Apr 24, 2023

MarcusSorealheis Apr 24, 2023

zhaih Apr 24, 2023

MarcusSorealheis Apr 24, 2023

zhaih Apr 25, 2023

MarcusSorealheis Apr 26, 2023

zhaih Apr 27, 2023

MarcusSorealheis Apr 27, 2023

zhaih commented Apr 26, 2023

msokolov left a comment •

edited

Loading

msokolov Apr 27, 2023

zhaih Apr 28, 2023

zhaih commented Apr 28, 2023

benwtrent commented May 12, 2023

zhaih commented May 13, 2023

Optimize HNSW diversity calculation #12235

Optimize HNSW diversity calculation #12235

Conversation

zhaih commented Apr 19, 2023 • edited Loading

Description

This patch

Mainline

msokolov left a comment

Choose a reason for hiding this comment

zhaih commented Apr 21, 2023

zhaih commented Apr 24, 2023

msokolov commented Apr 24, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhaih commented Apr 26, 2023

msokolov left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhaih commented Apr 28, 2023

benwtrent commented May 12, 2023

zhaih commented May 13, 2023

zhaih commented Apr 19, 2023 •

edited

Loading

msokolov left a comment •

edited

Loading