add ConcurrentOnHeapHnswGraph and Builder #12254

jbellis · 2023-04-28T15:05:56Z

Motivation

It is faster to query a single on-heap hnsw index, than to query multiple such indexes and combine the result.
Even with some contention necessarily occurring during building of the index, we still come out way ahead in terms of total efficiency vs creating per-thread indexes and combining them, since combining such indexes boils down to "pick the largest and then add all the other nodes normally," you don't really benefit from having computed the others previously.

Performance

On my i9-12900 (8 performance cores and 8 efficiency) I get a bit less than 10x speedup of wall clock time for building and querying the "siftsmall" and "sift" datasets from http://corpus-texmex.irisa.fr/. The small dataset is 10k vectors while the large is 1M. This speedup feels pretty good for a data structure that isn't completely parallelizable, and it's good to see that it's consistent as the dataset gets larger.

The concurrent classes achieve identical recall compared to the non-concurrent versions within my ability to test it, and are drop-in replacements for OnHeapHnswGraph and HnswGraphBuilder; I use threadlocals to work around the places where the existing API assumes no concurrency.

The timing harness (which includes recall checking) is here: https://github.com/jbellis/hnswdemo/tree/concurrent

Design

ConcurrentOnHeapHnswGraph is basically the same design as OnHeapHnswGraph but with the neighbor lists changed to use ConcurrentHashMap at both level 0 and higher levels. The internal guarantees change a bit along with this, e.g., addNode only initializes a neighbor list for node N when N is added, rather than all nodes <= N.

ConcurrentHnswGraphBuilder changes more, because it’s designed around “all the logic is in Builder and it calls primitive operations on the Graph and the Neighbors as necessary.” Unfortunately you can’t build threadsafe concurrent structures this way, not without massive synchronized blocks and a ton of contention. So I’ve moved a lot of the logic around adding neighbors into ConcurrentNeighborSet.

The key to achieving good concurrency without synchronization is, we track in-progress node additions in a ConcurrentSkipListSet. After adding ourselves, we take a snapshot of this set, and consider all other in-progress updates as neighbor candidates (subject to normal level constraints).

Switch OnHeapHnswGraph from representing a single node's neighbors as a TreeMap, to a HashMap, and update its callers to no longer assume that NodeIterator results are ordered by ordinal.

…nd Collections.sort instead

jbellis · 2023-04-28T15:06:29Z

The first four commits are from this pull request: #12248

jbellis · 2023-04-28T15:06:41Z

TODO: get ram usage working

jbellis · 2023-04-28T15:07:25Z

TODO: update test suite to test both Concurrent and normal classes.

Tests run against the new Concurrent classes, except where that doesn't make sense

msokolov · 2023-04-28T20:22:26Z

Very exciting!

After a very quick glance through the code, a few high-level comments/questions:

To use this when building indexes in Lucene, I think we'd need a way to control the threads used, so e.g. I don't think the parallel streams would allow for that. Have you considered accepting an Executor and using that to run the updates?

TBH I am not an expert on Lucene's IndexWriter threading model, but here is my stab at some background info: We need to maintain free threads that enable indexing to go on concurrent with flush / merge that would invoke this. ConcurrentMergeScheduler, for one, enables callers to specify the number of threads used for merging. Typically these are allocated per-merged-segment, but in theory it can be free to allow a single merge to use multiple threads. As for flush (initial creation of a new segment), I believe this is always single-threaded today. Perhaps we could use concurrency there too somehow, but it's a secondary target I think.

I don't see the need for these classes to conform to the existing HnswGraph API. If we decide we want to use this implementation when writing the index, we can switch LucenexxHnswWriter to do so. Given that, could we drop the use of ThreadLocal?

I wonder what the single-threaded performance of these new classes is as compared to the existing impl. Have you tested that?

…ng L0 anymore

jbellis · 2023-05-01T14:54:10Z

I've made a couple optimizations and merged to main.

With the optimizations it's still about 38% slower than the non-concurrent code (including the HashMap optimization for the non-concurrent that went in). Nothing stands out in the profiler as a problem anymore; I guess that's what you get replacing HashMap with ConcurrentHashMap and more especially replacing ArrayList with ConcurrentHashMap in L0.

I still do not know how to fix the ram usage estimates; any suggestions?

jbellis · 2023-05-01T16:13:31Z

Replaced parallel stream with ExecutorService. Only ThreadPoolExecutor is supported, since it knows how many thread it has available. After a quick look that is the only type I see Lucene creating.

I was pleasantly surprised that the TPE is no slower than parallel streams in my testing.

tang-hi · 2023-05-01T16:19:01Z

I've made a couple optimizations and merged to main.

With the optimizations it's still about 38% slower than the non-concurrent code (including the HashMap optimization for the non-concurrent that went in). Nothing stands out in the profiler as a problem anymore; I guess that's what you get replacing HashMap with ConcurrentHashMap and more especially replacing ArrayList with ConcurrentHashMap in L0.

I still do not know how to fix the ram usage estimates; any suggestions?

I believe the issue is related to the Java Platform Module System, which was introduced in Java 9. You can modify the following section in the gradlew script to avoid the exception:

lucene/gradlew

Line 221 in 3c16374

    
           eval set -- $JAVA_OPTS $GRADLE_OPTS "\"-Dorg.gradle.appname=$APP_BASE_NAME\"" -classpath "\"$CLASSPATH\"" org.gradle.wrapper.GradleWrapperMain "$APP_ARGS"

modified version

JVM_ARGS=" -Ptests.jvmargs=\"--add-opens java.base/java.lang=ALL-UNNAMED --add-opens java.base/java.util.concurrent.atomic=ALL-UNNAMED\""

# Collect all arguments for the java command, following the shell quoting and substitution rules

eval set -- $JAVA_OPTS $GRADLE_OPTS "\"-Dorg.gradle.appname=$APP_BASE_NAME\"" -classpath "\"$CLASSPATH\"" org.gradle.wrapper.GradleWrapperMain "$APP_ARGS" "$JVM_ARGS"

After making this change, the exception should no longer be thrown. However, the test may still fail due to an AssertionError, as shown below:

  java.lang.AssertionError: expected:<1185336.0> but was:<486483.0>
   >         at __randomizedtesting.SeedInfo.seed([A41F62241729FCF:CF0592627AF94878]:0)
   >         at org.junit.Assert.fail(Assert.java:89)
   >         at org.junit.Assert.failNotEquals(Assert.java:835)
   >         at org.junit.Assert.assertEquals(Assert.java:555)
   >         at org.junit.Assert.assertEquals(Assert.java:685)
   >         at org.apache.lucene.util.hnsw.HnswGraphTestCase.testRamUsageEstimate(HnswGraphTestCase.java:789)
   >         at org.apache.lucene.util.hnsw.TestHnswFloatVectorGraph.testRamUsageEstimate(TestHnswFloatVectorGraph.java:37)
   >         at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   >         at java.base/jdk.internal.reflect.NativeMethodAccesso

jbellis · 2023-05-01T17:04:24Z

Thanks! For me it is working:

$ ./gradlew -p lucene/core test --tests "*Hnsw*"
...
:lucene:core:test (SUCCESS): 75 test(s)

Note that adding these args do not seem to be picked up by intellij when running tests, is there a better place to add them?

…ttom-down

… ability to grow itself extract JavaUtilBitSet -> GrowableBitSet to handle this use case this makes concurrent build about 8% faster (~162s on Texmex benchmark to ~150)

…pListSet CSLS handles concurrent inserts better, but reads are much more expensive: - Iterate references instead of array - Unbox from Long - decode to int Even in a construction-only workload, there are so many more reads than updates that it's better to optimize for reads

…tire set

…led in

…hMap<Integer, Integer>

…ing at the end, so we can just check size instead of the performance hit of filling unused slots with -1

… by a new node

…h may get added to as we search

…lace the array reference while other threads updated the old one. Fix using StampedLock

jbellis · 2023-07-06T15:25:14Z

Closing in favor of #12421.

jbellis added 4 commits April 27, 2023 11:42

Use HashMap for OnHeapHnswGraph neighbors

21676c8

Switch OnHeapHnswGraph from representing a single node's neighbors as a TreeMap, to a HashMap, and update its callers to no longer assume that NodeIterator results are ordered by ordinal.

run tidy

3641519

add pretty-printing to graph equality tests

851fa73

using IntStream.iterate to sort doesn't work here, use a while loop a…

23875d4

…nd Collections.sort instead

jbellis marked this pull request as draft April 28, 2023 15:06

jbellis added 4 commits April 28, 2023 14:48

move prettyPrint to HnswGraphTestCase

c8d1484

replace wildcard import with individual classes

ef57b74

add comment

ec7c9be

add ConcurrentOnHeapHnswGraph and Builder

889413f

Tests run against the new Concurrent classes, except where that doesn't make sense

jbellis force-pushed the concurrent3 branch from c047b06 to 889413f Compare April 28, 2023 20:06

jbellis added 8 commits May 1, 2023 09:15

simplify to a single Map of graph levels since we're not special-casi…

20331df

…ng L0 anymore

optimize CNS.nodeIterator

91ecfc1

un-rename eps (makes it easier to compare to non-concurrent class)

2553f78

avoid using stream() since it creates unnecessary Map.Entry objects

8accf24

add explicit variable types

03f12c3

merge from main

91a3461

fix NPE in test sort

7363229

rerun tidy

c80d0f9

jbellis added 2 commits May 1, 2023 11:11

replace do/while compareAndSet loops with getAndAccumulate

2c975a2

add ThreadPoolExecutor support

65c4300

add JVM_ARGS to open jdk classes for use by ram usage estimator

1b08a74

use vectorsCopy in scoreBetween

817217d

jbellis added 23 commits May 13, 2023 17:01

clean up TODO

6c08b73

add logical clock size to ramBytesUsed

3549ee7

add logical clock size to ramBytesUsed

e5997d4

now that we have snapshot isolation we can switch back to building bo…

a8e352f

…ttom-down

we don't actually need thread safety in the Searcher bitset, just the…

1d93d25

… ability to grow itself extract JavaUtilBitSet -> GrowableBitSet to handle this use case this makes concurrent build about 8% faster (~162s on Texmex benchmark to ~150)

AtomicBitSet is no longer needed

e5ae63b

Merge branch 'concurrent3' of github.com:jbellis/lucene into concurrent3

db42047

optimize GrowableBitSet.clear for the case when we're clearing the en…

99ec2f4

…tire set

tidy

84a1a4b

fix NPE in ramByteUsed when graph levels have not been completely fil…

705c298

…led in

fix merge

c0caa23

fix NPE when size is called on empty graph

81155d2

create CompletionTracker to avoid box/unbox overhead of ConcurrentHas…

bb71da4

…hMap<Integer, Integer>

improve comments

5cb9203

the only time we would read an undefined node id is when we're insert…

ec986ca

…ing at the end, so we can just check size instead of the performance hit of filling unused slots with -1

refactor ramBytesUsed to allow us to compute the incremental ram used…

3b239a0

… by a new node

tidy

209793d

add searchConcurrent method that uses a GrowableBitSet since the grap…

c620c86

…h may get added to as we search

CompletionTracker had a race condition where ensureCapacity could rep…

86c17f7

…lace the array reference while other threads updated the old one. Fix using StampedLock

add custom functional interface to avoid boxing in scoreBetween

e8173c4

reformat

a51120a

re-use candidates NeighborQueue across calls to addGraphNode

a5d55a5

alessandrobenedetti added the vector-based-search label Jun 15, 2023

jbellis added 4 commits June 16, 2023 15:54

add BitSet.clear() (apache#12268)

c698eb1

add insertsInProgress

36d572f

avoid the temptation to optimize L0 by using ArrayNodesIterator

ca9f021

use a bitset in insertDiverse instead of allocating a new neighborarray

a30a66d

jbellis closed this Jul 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add ConcurrentOnHeapHnswGraph and Builder #12254

add ConcurrentOnHeapHnswGraph and Builder #12254

jbellis commented Apr 28, 2023

jbellis commented Apr 28, 2023

jbellis commented Apr 28, 2023

jbellis commented Apr 28, 2023

msokolov commented Apr 28, 2023

jbellis commented May 1, 2023

jbellis commented May 1, 2023

tang-hi commented May 1, 2023

jbellis commented May 1, 2023

jbellis commented Jul 6, 2023

add ConcurrentOnHeapHnswGraph and Builder #12254

add ConcurrentOnHeapHnswGraph and Builder #12254

Conversation

jbellis commented Apr 28, 2023

Motivation

Performance

Design

jbellis commented Apr 28, 2023

jbellis commented Apr 28, 2023

jbellis commented Apr 28, 2023

msokolov commented Apr 28, 2023

jbellis commented May 1, 2023

jbellis commented May 1, 2023

tang-hi commented May 1, 2023

jbellis commented May 1, 2023

jbellis commented Jul 6, 2023