-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HnwsGraph creates disconnected components #12627
Comments
I was able to run tests on wiki dataset using the luceneutils package. The results shows that even with a single segment index and no updates, around 1% nodes gets disconnected for about 1M vectors. It would be great if someone else can have a look at the This may or may not be an issue for different system given that this is 'approximate' nearest neighbour search. But in my opinion it is worth exploring more and if possible some fix. Next I will try to reproduce with multiple segments and try to find the cause and fix for it. |
Have you any results how connectivity varies with maxconn? I found this article that talks about connectivity when filtering; anyway it shows a suggestive graph that made me think maybe there is a non-linear behavior and we should try to fit maxconn to the dataset better. I also believe that this GloVe data is somewhat pathological - it doesn't have the best numerical properties. Maybe try re-running your test with some of the other wikipedia-based datasets we generated for luceneutil like the one using the minilm model? Also - can you think of some way to preserve global connectivity while removing links? It seems somewhat intractable to me, but maybe there is some way? |
I'm thinking of something like this: first build a fully-connected graph (no link removal, no enforcement of maxconns). Then apply a global pruning algorithm of some sort that attempts to preserve connectivity. For example, you could iterate over all the vectors in the graph, searching for each one using a consistent entry point (node 0 in the graph say, since it will probably be maximally-connected), and "coloring" the visited nodes and edges during the searches. Stop once every node has been visited. Some of the edges will not have been visited, and we could prune those from nodes with large numbers of edges. |
Thanks @msokolov : These are really good suggestions. I will try to incorporate these ideas in solutions. I think in the end there can be multiple ways to allow more connectivity. I was also worried about how the connections are made and pruned. I checked these lines in our algorithm. Here we connect a node to the neighbour list only if is not closer to any of the current neighbour. Think of a situaion where for some nodes (say To check this I added lot of code to collect the events (like To fix this I checked that the original paper proposes the concept of keepPrunedConnections. I implemented that in my local and ran tests. But something is not right, I will keep checking if I did something wrong in the implementation etc. Will submit "draft" PR soon for others comments. I think there is another issue in current implementation. Look at this code where we finally connected edge from Note that adding the concept of
Final connections were, full 2-way connection 297 <---> 298 and one sided connection of 297---> 293 (left over connection). |
This is by design; the resultant graph is not expected to be symmetric (ie undirected, bidirectional). It is allowed to have a->b with no corresponding b->a |
Hi @msokolov ! Thanks for clarifying. But I think it can help to remove the 'less important' edge from both sides, since it frees up a degree of "other" node to accept a new connection without indulging into diversity check. Most of the time the problem that I saw with diversity check is that, the most recently added connection to the node itself turns out to be non-diverse and gets removed immediately. Because of this the new incoming node does not even get enough connections to begin with in the graph. Even in the case where I removed the diversity check while connecting incoming node (a) to neighbours (say b,c) and allow full max-conn number of connections, most of the connections from neighbours (b->a) to the nodes are removed because of diversity check, leaving mostly half connections from (a->b). Now when a new node say |
I was able to conduct some perf-test as well. I would like to propose following changes
Proposed heuristic
Performance number:
Since now we are increasing the number of connections per node, we also see a good increase in the recall for same max-conn. For the same reason we do expect the avgCpuTime to increase at query time. I can work a little bit on improving the indexing time as the my current implementation is very naive and finds common nodes by two for loops and some other places recalculates the scores where not needed. But overall the indexing time will increase for sure. Another data that I will post next is the number of (histogram) number of edges each node has. Expect it to increase from lower number to max-conn on average. I will upload my (rough) draft PR now. How I reached this heuristic:Before coming up with this heuristic I tried quite some simpler ones, but in those cases I was either getting much much higher disconnections than the mainline or if very less and even just 1 disconnection, but some disconnection was happening. I am open for removing or adding any other condition which is considered for edge removal. Initially, I tried with just (1) and left the rest of the algorithm same as in mainline. This "increased" the disconnections by large factor instead of decreasing it. The issue was that say 0 to 32 nodes are added with star kind of formation, now if 33rd node comes in, it will form connections with 1-32 nodes and each may find 0 as the non diverse node, so each one removes 0 leaving it disconnected from the graph. Other than this extreme case there were others too. |
@nitirajrathore i know https://github.com/nmslib/nmslib/blob/master/similarity_search/src/method/hnsw.cc has various heuristics, have you explored those to see how they can be modified or added here? |
@benwtrent : I have added draft PR. The code is not at all optimized right now for performance and I am hoping to fix some obvious stuff and will post perf results here.
No, I haven't checked those. Thanks for pointing it out. I will check and revert. |
@nitirajrathore @msokolov I had an idea around this, and it will cost an extra 4bytes per node on each layer its a member (maybe we only need this on the bottom layer...) What if we added an "incoming connection" count for every node? Then when disconnecting non-diverse neighbors, we enforce that we can never disconnect the last connection to a node. |
Is the problem primarily to do with single isolated nodes or do we also see disconnected subgraphs containing multiple nodes? I think this idea would prevent the isolated nodes, but not fix the other case. |
@msokolov good point. It seems to me we would only fully disconnect a sub-graph only if its very clustered. Is there a way to detect this in the diversity selection? One other thing I just found that confuses me: #12235 This slightly changed the diversity connection in a way to to attempt to improve performance. The key issue is here: If we have "checked the indices", we always just remove the furthest one, which doesn't seem correct to me. We should check for diversity starting at the furthest one, not always remove it? @nitirajrathore for your connection checker, are you testing on a Lucene > 9.7.0? |
My memory of the way this diversity criterion has evolved is kind of hazy, but I believe in the very first implementation we would not impose any diversity check until the neighbor array was full? It seems as if that would have tended to preserve more links, but then we (I think?) decided that wasn't the approach advocated by the academic papers we were emulating, and implemented the diversity check to every added neighbor. Now we have this checked/unchecked distinction. I seem to remember that the checked ones have had a diversity check applied -- so we know that as a group, they are diverse (none of them is closer to any of the others than it is to the target node). So this would seem to support the idea that we can add unchecked nodes and then only check when the list gets full?? So, what are we doing?! Sorry, I've lost track. |
@nitirajrathore could you add something to KnnGraphTester that is a test for connectedness? If you can't get to it, I can try. It would be good to have it along side all our other performance testing as another metric to record and optimize for. |
&
I will write a small code to check the type of disconnectedness. Clustered vs single
Paper proposes 3 different heuristics.
Our current implementation is like (1) with some optimizations for performance. My proposed approach is similar to (2) with more strict conditions by checking diversity only for nodes which are surely connected via their neighbours. I think there is huge difference between the number of connections between heuristic (1) and (2), and by inference between the current implementation and my proposed implementation, so I will post numbers around that. But I don't think that is necessarily bad, if the number of avg connections are increasing it is also increasing the recall with ExactKNN, meaning that we can get same recall for a smaller max-conn value now. I will also try to get some number around same recall with current implementation with lower max-conn and then check the search latency. This may also reduce the indexing time, lets see by how much.
I have separate scripts for those, I will post the links here and will try to get the code in the luceneutil, although compilation is a issue in that package ( which we are trying to fix separately ). We can think of adding that as optional check in the KnnGraphTester as well. |
I ran some tests with with max-conn 16 and max-conn = 8 and it seems like with my proposal, even max-conn=8 is better as compared to max-conn=16 of mainline. I will also add more stats. I diverged from
candidate with max conn = 16
candidate with max conn = 8
Interesting fact: simple implementation of using 2 for loops to find the common neighbours works better than using HashSet or IntIntHashMap(). As I think the major contribution to indexing time is because of increased number of connections. I will update with more scripts + info/stats and some code improvements next. |
@nitirajrathore very interesting findings. This makes me wonder if the heuristic should take a middle ground and instead of keeping all pruned connections, keep half. I am interested in seeing the connective-ness between the heuristics. |
yes, this is a promising avenue to explore! One note of caution: we should avoid drawing strong inferences from a single dataset. I'm especially wary of GloVe because I've noticed it seems to have poor numerical properties. We especially should not be testing with random vectors. Ideally we would try several datasets, but if I had to pick one I'd recommend the minilm (384-dim) vectors we computed from wikipedia, or some internal Amazon dataset, or I know Elastic folks have been testing with a Cohere dataset? You can download the minilm data from sftp @home.apache.org; cd /home/sokolov/public_html if you have an apache login. You can also regenerate using infer_vectors.py in luceneutil, but it takes a little while |
I +1 this issue and would like to try the gains in Amazon codebase. I think with lowered max-conn this can give some latency gains. I also feel that there is a use case for indexes with less number of updates where the disconnected nodes might never be connected again. Looking forward to the results from the latest benchmark run. I'm also wondering if there is anyway to make this configurable as an indexing flag to enable backwards compatibility. |
OK, I added mikemccand/luceneutil#253 Doing some local benchmarking. It seems that the more merges occur, the worse we can get. Sometimes I get good graphs like this:
Other times I get graphs that are pretty abysmal:
The numbers are so bad, I almost think this is a bug in my measurements, but it isn't clear to me where it would be. I am going to validate older versions of Lucene to see if this changes. |
I have done this test back in Lucene 9.4, and we still end up every once in a while a graph where the mean number of connections hovers around |
OK, I did some more digging, and it seems like my data is garbage, or at least how I am reading it in. I looked at one of these extremely disconnected graphs and found that all the float32 vectors were exactly the same. So, this appears to be a bug in my code. |
Hi @benwtrent, I left Amazon but I was able to run some tests with open dataset and also with Amazon dataset before leaving. I cannot share whole lot of detail about Amazon dataset but some aggregated results are shown. Rest assured I am picking this issue on priority now. I was able to run tests trying out different heuristics. Basically, I tried the Lucene Anyway, I also did the experiment using Amazon dataset for PR with all heuristics : https://github.com/apache/lucene/pull/12783/commits max-conn=16 & num_docs = 50K (unless specified)
|
@nitirajrathore very interesting results. This sort of indicates to me that no matter the heuristic, we just need a second pass over the graph to ensure connectedness and fix it up :/ Even though the graph is different, the idea is the same here: https://github.com/jbellis/jvector/blob/main/jvector-base/src/main/java/io/github/jbellis/jvector/graph/GraphIndexBuilder.java (see: reconnectOrphanedNodes()) I will see about testing your new "new heuristic with remove otherhalf and honour max-conn" as if I am reading this correctly, seems the most promising. But, I suspect, even in extreme scenarios, we will still need a second pass over the graph to ensure graph connectedness. |
I want to revive this discussion about disconnectedness. I think the two-pass idea is where we would have to go in order to ensure a connected graph, and in order to implement that we need to incorporate a segmentation or labeling step. Before we get to trying to implement that in a "production" indexing setup though I would like to incorporate some of these tools in our test framework. This will enable us to provide test assertions that are aware of graph connectedness. Today we occasionally see random failures due to disconnected graphs and this has led us to have looser test assertions than we might otherwise be able to do. As a first step we can create (or re-use -- did we already create them here?) some tools for counting and sizing connected components, and then also for patching them together. Could we relax the maxconns when patching in order to avoid re-disconnecting? |
Oh!, I see we added some tooling in mikemccand/luceneutil#253 as part of KnnGraphTester. Maybe we can migrate some of this to lucene's test-framework |
@msokolov I think adding a "reachable" test to Lucene would be nice. The main goal of such a test would be ensuring that every node is eventually reachable on every layer. The tricky part is I am sure such a test would be noisy until we have a "connectedness check", especially for random vectors. I do like having something in Lucene-util though as I have found the larger statistics useful when debugging why a particular graph or change breaks. So, something in both places would be good. |
I was thinking of a baby step: count the number of nodes that are reachable and then use that in assertions like lucene/lucene/core/src/test/org/apache/lucene/search/BaseVectorSimilarityQueryTestCase.java Line 144 in bf193a7
and this one lucene/lucene/core/src/test/org/apache/lucene/search/BaseVectorSimilarityQueryTestCase.java Line 310 in bf193a7
which failed randomly on policeman jenkins the other day with an off-by-one - I couldn't prove it in the time I had but I suspect it was due to disconnectedness |
I struggle to make this work though since the changes to make everything more typesafe have also made the interesting bits inaccessible. EG I thought of adding something like this:
to class org.apache.lucene.codecs.perfield.PerFieldKnnVectorsFormat$FieldsReader cannot be cast to class org.apache.lucene.codecs.HnswGraphProvider (org.apache.lucene.codecs.perfield.PerFieldKnnVectorsFormat$FieldsReader and org.apache.lucene.codecs.HnswGraphProvider are in unnamed module of loader 'app') not sure what the approved way to go about this would be |
@msokolov in the HNSW codec, we do something like this already when gathering the underlying graph. I would do something like:
|
Thanks @benwtrent I opened #13260 |
Is there a clean separation between:
If this is mostly about the first, then it's a known issue with known mitigations. If it's mostly about the second, doesn't that point to some specific defect in the implementation, a defect which could/should be fixed (in contrast, e.g., to two-pass approaches that may address both 1 and 2) ? |
I'm not aware of any Lucene-specific issue here. We see this mostly in unit tests, but it has also been replicated with production data. Although one can speculate this might be more of an issue in Lucene because of its segmented multi-graph structure; I'm not sure since I don't have all that much experience of HNSW in other settings. I would be interested to hear about the known mitigations you mentioned. |
@msokolov I am not sure what particular mitigations @CloudMarc is talking about, but I know of a couple of options outlined here: #12627 (comment) Though I would prefer not having yet another configuration item for HNSW. I think we should do a second "clean up pass" over HNSW (like JVector does for Vamana), to ensure we don't have orphaned nodes or clusters. A second pass over the layers wouldn't be too bad performance wise (we wouldn't have to do many, if any at all, vector comparisons in the second pass), and it would be mostly for adding a couple of backlinks that were originally removed. |
Re "known mitigations" - I simply meant common HNSW/ANN configuration changes like increasing MaxConnections and BeamWidth. Thanks for mentioning precedence for the two-pass approach. This seems useful: https://weaviate.io/blog/ann-algorithms-vamana-vs-hnsw#vamana-implementation-details I'm +1 for the 2-pass approach, fwiw. |
One might hypothesize that this issue is due to adapting HNWS to Lucene's approach to segmentation. This seems undercut by these observations: #12627 (comment) Key question: Does single-segment Lucene have the same rate of dis-connectedness as a stock or reference HNSW implementation, or more? If it's significantly more, that may indicate an implementation defect. |
Hello, nitirajrathore! I came across your discussion regarding various tests you conducted with the "minilm" dataset and found it particularly interesting. I'm currently engaged in similar research and am keen on exploring this further. Could you please share how I might be able to access the "minilm" dataset, or point me to any resources where it might be available? It seems like it is a open dataset. Any guidance you could provide would be greatly appreciated. |
I'd like to take a stab at the "second pass" idea for patching up disconnected graph components. As a first step I think we ought to add state to the |
also backported to 9x |
Description
I work for Amazon Retail Product search and we are using Lucene KNN for
semantic search of products.
We recently noticed that the hnsw graphs generated are not always strongly
connected and in worst case scenario some products may be undiscoverable.
Connectedness of Hierarchical graph can be complicated, so below I am
mentioning my experiment details.
Experiment:
I took the Lucene indexes from our production servers and for each segment
(hnsw graph) I did following test.
At each level graph I took the same entry point, the entry point of HNSW
graph, checked how many nodes are reachable from this entrypoint. Note that
connectedness at each level was checked independently of other levels.
Sample code attached. My observations are as below.
considered, one graph for each "level" of HNSW, almost 18% of the graphs
had some disconnectedness.
at most 3 to 4 levels in HNSW graphs.
disconnected out of 259342) to 3.7% (eg. 87 disconnected out of 2308).
In some extreme case the entry-point(EP) in zeroth level graph was disconnected
from rest of the graph making the %age disconnected as high as 99.9% (only 65
reachable nodes from EP out of 252275). But this does not necessarily mean
that the 99.9% of nodes were not discoverable, it just means that if
unluckily we end up on EP in the 0th level graph for a query, there can at
max be 65 nodes that can be reached. But had we diverted our path from EP
to some other node in the upper level graphs then may be more nodes be
discoverable via that node.
Reproduciability
Although it is fairly easy to be reproduced with Amazon's internal data, but I found it really hard to reproduce it with random vectors. I will attached a simple main class of my test and the parameters that I tried. I only got some disconnectedness as listed in below table. For higher number of vectors, the execution becomes painfully slow because of hnsw graph creation. I will continue to work to find some parameters for reproducibilty and also explore open source datasets for the same.
DIS-L0 means: %age of disconnected nodes in graph at level-0 of HNSW graph.
No direct solution comes to my mind but as per my discussion in this email thread. I will look into the part where some connections are removed to maintain the Max-Conn property (diversity check etc.).
Version and environment details
Lucene version : 9.7
The text was updated successfully, but these errors were encountered: