-
Notifications
You must be signed in to change notification settings - Fork 115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Draft] Code to reproduce the disconnectedness in HNSW using open source datasets #236
base: master
Are you sure you want to change the base?
Conversation
…arate folder. From now the project should be compiled using ./gradlew build added 2 methods to StringFieldDocSelector to allow compilation with lucene-9.8 and current main branch.
changes for creating knn index independently renamed the python file with .py extension
|
||
import java.io.IOException; | ||
import java.nio.file.Paths; | ||
import java.util.*; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's avoid *-import
|
||
public class CheckHNSWConnectedness { | ||
private static int getReachableNodes(HnswGraph graph, int level) throws IOException { | ||
Set<Integer> visited = new HashSet<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we have this nice IntIntHashMap that stores primitives
|
||
visited.add(node); | ||
// emulate collapsing all the edges | ||
for (int level = 0; level < numLevels; level++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for levels > 0 can we iterate over the nodes in the level rather than all the nodes?
Should we commit what we have here even though it's still draft? |
Hi @mikemccand. I will do it in parts. First part is to refactor the existing code for which I have raised the PR. Its mostly refactoring so we can immediately commit with someones help. If you find time please review it. : #254 CC : @msokolov |
Adding this code for someone to checkout and comment but not merge yet. This change picks up changes from earlier pending PR #234.
Added
CheckHNSWConnectedness
for finding the connectedness of HNSW graph at each level of the graph and also overall disconnected nodes of the graph.Refactored KnnIndexer out of KnnGraphTester to be able to use standalone and created KnnIndexerMain for it.
##Tests
ant vectors300-docs
ant vectors100-docs
./gradlew :src:main:run -PmainClass=knn.KnnIndexerMain --args=" -docvectorspath lucene/benchmarks/data/enwiki-20120502-lines-1k-100d.vec -indexpath lucene/benchmarks/indices/vector_index -maxconn 16 -beamwidth 100 -vectorencoding FLOAT32 -similarityfunction DOT_PRODUCT -numdocs 1000000 -dimension 100"
./gradlew :src:main:checkHnswConnected -Pindex="lucene/benchmarks/indices/vector300_index" -Pknn-field="knn"
The script currently only generated single segment indexes. I will try to create multiple segments and check again. But this proves that the there is disconnectedness in graph even when there is less dynamism in creating and updating documents.
Results
100-Dim 1M-vectors
~ 0.4 % disconnectedness
300-Dim 1M-vectors
~ 1 % disconnectedness