-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add ConcurrentOnHeapHnswGraph and Builder #12254
Conversation
Switch OnHeapHnswGraph from representing a single node's neighbors as a TreeMap, to a HashMap, and update its callers to no longer assume that NodeIterator results are ordered by ordinal.
…nd Collections.sort instead
The first four commits are from this pull request: #12248 |
TODO: get ram usage working |
TODO: update test suite to test both Concurrent and normal classes. |
Tests run against the new Concurrent classes, except where that doesn't make sense
Very exciting! After a very quick glance through the code, a few high-level comments/questions: To use this when building indexes in Lucene, I think we'd need a way to control the threads used, so e.g. I don't think the parallel streams would allow for that. Have you considered accepting an Executor and using that to run the updates? TBH I am not an expert on Lucene's IndexWriter threading model, but here is my stab at some background info: We need to maintain free threads that enable indexing to go on concurrent with flush / merge that would invoke this. ConcurrentMergeScheduler, for one, enables callers to specify the number of threads used for merging. Typically these are allocated per-merged-segment, but in theory it can be free to allow a single merge to use multiple threads. As for flush (initial creation of a new segment), I believe this is always single-threaded today. Perhaps we could use concurrency there too somehow, but it's a secondary target I think. I don't see the need for these classes to conform to the existing HnswGraph API. If we decide we want to use this implementation when writing the index, we can switch LucenexxHnswWriter to do so. Given that, could we drop the use of ThreadLocal? I wonder what the single-threaded performance of these new classes is as compared to the existing impl. Have you tested that? |
I've made a couple optimizations and merged to main. With the optimizations it's still about 38% slower than the non-concurrent code (including the HashMap optimization for the non-concurrent that went in). Nothing stands out in the profiler as a problem anymore; I guess that's what you get replacing HashMap with ConcurrentHashMap and more especially replacing ArrayList with ConcurrentHashMap in L0. I still do not know how to fix the ram usage estimates; any suggestions? |
Replaced parallel stream with ExecutorService. Only ThreadPoolExecutor is supported, since it knows how many thread it has available. After a quick look that is the only type I see Lucene creating. I was pleasantly surprised that the TPE is no slower than parallel streams in my testing. |
I believe the issue is related to the Java Platform Module System, which was introduced in Java 9. You can modify the following section in the gradlew script to avoid the exception: Line 221 in 3c16374
modified version JVM_ARGS=" -Ptests.jvmargs=\"--add-opens java.base/java.lang=ALL-UNNAMED --add-opens java.base/java.util.concurrent.atomic=ALL-UNNAMED\""
# Collect all arguments for the java command, following the shell quoting and substitution rules
eval set -- $JAVA_OPTS $GRADLE_OPTS "\"-Dorg.gradle.appname=$APP_BASE_NAME\"" -classpath "\"$CLASSPATH\"" org.gradle.wrapper.GradleWrapperMain "$APP_ARGS" "$JVM_ARGS" After making this change, the exception should no longer be thrown. However, the test may still fail due to an AssertionError, as shown below:
|
Thanks! For me it is working:
Note that adding these args do not seem to be picked up by intellij when running tests, is there a better place to add them? |
… ability to grow itself extract JavaUtilBitSet -> GrowableBitSet to handle this use case this makes concurrent build about 8% faster (~162s on Texmex benchmark to ~150)
…pListSet CSLS handles concurrent inserts better, but reads are much more expensive: - Iterate references instead of array - Unbox from Long - decode to int Even in a construction-only workload, there are so many more reads than updates that it's better to optimize for reads
…hMap<Integer, Integer>
…ing at the end, so we can just check size instead of the performance hit of filling unused slots with -1
…h may get added to as we search
…lace the array reference while other threads updated the old one. Fix using StampedLock
Closing in favor of #12421. |
Motivation
Performance
On my i9-12900 (8 performance cores and 8 efficiency) I get a bit less than 10x speedup of wall clock time for building and querying the "siftsmall" and "sift" datasets from http://corpus-texmex.irisa.fr/. The small dataset is 10k vectors while the large is 1M. This speedup feels pretty good for a data structure that isn't completely parallelizable, and it's good to see that it's consistent as the dataset gets larger.
The concurrent classes achieve identical recall compared to the non-concurrent versions within my ability to test it, and are drop-in replacements for OnHeapHnswGraph and HnswGraphBuilder; I use threadlocals to work around the places where the existing API assumes no concurrency.
The timing harness (which includes recall checking) is here: https://github.com/jbellis/hnswdemo/tree/concurrent
Design
ConcurrentOnHeapHnswGraph is basically the same design as OnHeapHnswGraph but with the neighbor lists changed to use ConcurrentHashMap at both level 0 and higher levels. The internal guarantees change a bit along with this, e.g., addNode only initializes a neighbor list for node N when N is added, rather than all nodes <= N.
ConcurrentHnswGraphBuilder changes more, because it’s designed around “all the logic is in Builder and it calls primitive operations on the Graph and the Neighbors as necessary.” Unfortunately you can’t build threadsafe concurrent structures this way, not without massive synchronized blocks and a ton of contention. So I’ve moved a lot of the logic around adding neighbors into ConcurrentNeighborSet.
The key to achieving good concurrency without synchronization is, we track in-progress node additions in a ConcurrentSkipListSet. After adding ourselves, we take a snapshot of this set, and consider all other in-progress updates as neighbor candidates (subject to normal level constraints).