Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add ConcurrentOnHeapHnswGraph and Builder #12254

Closed
wants to merge 88 commits into from

Conversation

jbellis
Copy link
Contributor

@jbellis jbellis commented Apr 28, 2023

Motivation

  1. It is faster to query a single on-heap hnsw index, than to query multiple such indexes and combine the result.
  2. Even with some contention necessarily occurring during building of the index, we still come out way ahead in terms of total efficiency vs creating per-thread indexes and combining them, since combining such indexes boils down to "pick the largest and then add all the other nodes normally," you don't really benefit from having computed the others previously.

Performance

On my i9-12900 (8 performance cores and 8 efficiency) I get a bit less than 10x speedup of wall clock time for building and querying the "siftsmall" and "sift" datasets from http://corpus-texmex.irisa.fr/. The small dataset is 10k vectors while the large is 1M. This speedup feels pretty good for a data structure that isn't completely parallelizable, and it's good to see that it's consistent as the dataset gets larger.

The concurrent classes achieve identical recall compared to the non-concurrent versions within my ability to test it, and are drop-in replacements for OnHeapHnswGraph and HnswGraphBuilder; I use threadlocals to work around the places where the existing API assumes no concurrency.

The timing harness (which includes recall checking) is here: https://github.com/jbellis/hnswdemo/tree/concurrent

Design

ConcurrentOnHeapHnswGraph is basically the same design as OnHeapHnswGraph but with the neighbor lists changed to use ConcurrentHashMap at both level 0 and higher levels. The internal guarantees change a bit along with this, e.g., addNode only initializes a neighbor list for node N when N is added, rather than all nodes <= N.

ConcurrentHnswGraphBuilder changes more, because it’s designed around “all the logic is in Builder and it calls primitive operations on the Graph and the Neighbors as necessary.” Unfortunately you can’t build threadsafe concurrent structures this way, not without massive synchronized blocks and a ton of contention. So I’ve moved a lot of the logic around adding neighbors into ConcurrentNeighborSet.

The key to achieving good concurrency without synchronization is, we track in-progress node additions in a ConcurrentSkipListSet. After adding ourselves, we take a snapshot of this set, and consider all other in-progress updates as neighbor candidates (subject to normal level constraints).

jbellis added 4 commits April 27, 2023 11:42
Switch OnHeapHnswGraph from representing a single node's neighbors
as a TreeMap, to a HashMap, and update its callers to no longer assume that
NodeIterator results are ordered by ordinal.
@jbellis
Copy link
Contributor Author

jbellis commented Apr 28, 2023

The first four commits are from this pull request: #12248

@jbellis
Copy link
Contributor Author

jbellis commented Apr 28, 2023

TODO: get ram usage working

@jbellis jbellis marked this pull request as draft April 28, 2023 15:06
@jbellis
Copy link
Contributor Author

jbellis commented Apr 28, 2023

TODO: update test suite to test both Concurrent and normal classes.

@msokolov
Copy link
Contributor

Very exciting!

After a very quick glance through the code, a few high-level comments/questions:

To use this when building indexes in Lucene, I think we'd need a way to control the threads used, so e.g. I don't think the parallel streams would allow for that. Have you considered accepting an Executor and using that to run the updates?

TBH I am not an expert on Lucene's IndexWriter threading model, but here is my stab at some background info: We need to maintain free threads that enable indexing to go on concurrent with flush / merge that would invoke this. ConcurrentMergeScheduler, for one, enables callers to specify the number of threads used for merging. Typically these are allocated per-merged-segment, but in theory it can be free to allow a single merge to use multiple threads. As for flush (initial creation of a new segment), I believe this is always single-threaded today. Perhaps we could use concurrency there too somehow, but it's a secondary target I think.

I don't see the need for these classes to conform to the existing HnswGraph API. If we decide we want to use this implementation when writing the index, we can switch LucenexxHnswWriter to do so. Given that, could we drop the use of ThreadLocal?

I wonder what the single-threaded performance of these new classes is as compared to the existing impl. Have you tested that?

@jbellis
Copy link
Contributor Author

jbellis commented May 1, 2023

I've made a couple optimizations and merged to main.

With the optimizations it's still about 38% slower than the non-concurrent code (including the HashMap optimization for the non-concurrent that went in). Nothing stands out in the profiler as a problem anymore; I guess that's what you get replacing HashMap with ConcurrentHashMap and more especially replacing ArrayList with ConcurrentHashMap in L0.

I still do not know how to fix the ram usage estimates; any suggestions?

@jbellis
Copy link
Contributor Author

jbellis commented May 1, 2023

Replaced parallel stream with ExecutorService. Only ThreadPoolExecutor is supported, since it knows how many thread it has available. After a quick look that is the only type I see Lucene creating.

I was pleasantly surprised that the TPE is no slower than parallel streams in my testing.

@tang-hi
Copy link
Contributor

tang-hi commented May 1, 2023

I've made a couple optimizations and merged to main.

With the optimizations it's still about 38% slower than the non-concurrent code (including the HashMap optimization for the non-concurrent that went in). Nothing stands out in the profiler as a problem anymore; I guess that's what you get replacing HashMap with ConcurrentHashMap and more especially replacing ArrayList with ConcurrentHashMap in L0.

I still do not know how to fix the ram usage estimates; any suggestions?

I believe the issue is related to the Java Platform Module System, which was introduced in Java 9. You can modify the following section in the gradlew script to avoid the exception:

lucene/gradlew

Line 221 in 3c16374

eval set -- $JAVA_OPTS $GRADLE_OPTS "\"-Dorg.gradle.appname=$APP_BASE_NAME\"" -classpath "\"$CLASSPATH\"" org.gradle.wrapper.GradleWrapperMain "$APP_ARGS"

modified version

JVM_ARGS=" -Ptests.jvmargs=\"--add-opens java.base/java.lang=ALL-UNNAMED --add-opens java.base/java.util.concurrent.atomic=ALL-UNNAMED\""

# Collect all arguments for the java command, following the shell quoting and substitution rules

eval set -- $JAVA_OPTS $GRADLE_OPTS "\"-Dorg.gradle.appname=$APP_BASE_NAME\"" -classpath "\"$CLASSPATH\"" org.gradle.wrapper.GradleWrapperMain "$APP_ARGS" "$JVM_ARGS"

After making this change, the exception should no longer be thrown. However, the test may still fail due to an AssertionError, as shown below:

  java.lang.AssertionError: expected:<1185336.0> but was:<486483.0>
   >         at __randomizedtesting.SeedInfo.seed([A41F62241729FCF:CF0592627AF94878]:0)
   >         at org.junit.Assert.fail(Assert.java:89)
   >         at org.junit.Assert.failNotEquals(Assert.java:835)
   >         at org.junit.Assert.assertEquals(Assert.java:555)
   >         at org.junit.Assert.assertEquals(Assert.java:685)
   >         at org.apache.lucene.util.hnsw.HnswGraphTestCase.testRamUsageEstimate(HnswGraphTestCase.java:789)
   >         at org.apache.lucene.util.hnsw.TestHnswFloatVectorGraph.testRamUsageEstimate(TestHnswFloatVectorGraph.java:37)
   >         at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   >         at java.base/jdk.internal.reflect.NativeMethodAccesso

@jbellis
Copy link
Contributor Author

jbellis commented May 1, 2023

Thanks! For me it is working:

$ ./gradlew -p lucene/core test --tests "*Hnsw*"
...
:lucene:core:test (SUCCESS): 75 test(s)

Note that adding these args do not seem to be picked up by intellij when running tests, is there a better place to add them?

jbellis added 23 commits May 13, 2023 17:01
… ability to grow itself

extract JavaUtilBitSet -> GrowableBitSet to handle this use case

this makes concurrent build about 8% faster (~162s on Texmex benchmark to ~150)
…pListSet

CSLS handles concurrent inserts better, but reads are much more expensive:
- Iterate references instead of array
- Unbox from Long
- decode to int

Even in a construction-only workload, there are so many more reads than updates that it's better to optimize for reads
…ing at the end, so we can just check size instead of the performance hit of filling unused slots with -1
…lace the array reference while other threads updated the old one. Fix using StampedLock
@jbellis
Copy link
Contributor Author

jbellis commented Jul 6, 2023

Closing in favor of #12421.

@jbellis jbellis closed this Jul 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants