-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reuse HNSW graphs when merging segments? [LUCENE-10318] #11354
Comments
Jack Mazanec (migrated from JIRA) Hi @jtibshirani I was thinking about something similar and would be interested in working on this. I can run some experiments to see if this would improve performance, if you haven’t already started to do so. Additionally, I am wondering if it would make sense to extend this to support graphs that contain deleted nodes. I can think of an approach, but it is a little messy. It would follow the same idea for merging — add vectors from smaller graph into larger graph. However, before adding vectors from smaller graph, all of the deleted nodes would need to be removed from the larger graph. In order to remove a node from the graph, I think we would need to remove it from list of neighbor arrays for each level it is in. In addition to this, because removal would break the ordinals, we would have to update all of the ordinals in the graph, which for OnHeapHNSW graph would mean updating all nodes by levels and also potentially each neighbor in each NeighborArray in the graph. Because removing a node could cause a number of nodes in the graph to lose a neighbor, we would need to repair the graph. To do this, I think we could create a repair_list that tracks the nodes that lost a connection due to the deleted node{}.{} To fill the list, we would need to iterate over all of the nodes in the graph and then check if any of their m connections are to the deleted node (I think this could be done when the ordinals are being updated). If so, remove the connection and then add the node to the {}repair_list{}. Once the repair_list is complete, for each node in the list, search the graph to get new neighbors to fill up the node’s connections to the desired amount. At this point, I would expect the time it takes to finish merging to be equal to the time it takes to insert the number of live vectors in the smaller graph plus the size of the repair list into the large graph. All that being said, I am not sure if removing deleted nodes in the graph would be faster than just building the graph from scratch. From the logic above, we would need to at least iterate over each connection in the graph and potentially perform several list deletions. My guess is that when the repair list is small, I think it would be faster, but when it is large, probably not. I am going to try to start playing around with this idea, but please let me know what you think! |
Mayya Sharipova (@mayya-sharipova) (migrated from JIRA) Thanks for looking into this, Jack. We have not done any development on this, but some thoughts from us:
So may be, a good start could be to have a very lean prototype with a lot of performance benchmarks. |
Julie Tibshirani (@jtibshirani) (migrated from JIRA)
Removing nodes and repairing the graph could be a nice direction. But for now we can keep things simple and assume there's a segment without deletes. If that's looking good and shows a nice improvement in index/ merge benchmarks, then we can handle deletes in a follow-up. Edit: Oops, I didn't refresh the page so I missed Mayya's comment. It looks like we're in agreement! |
Michael Sokolov (@msokolov) (migrated from JIRA) Another idea I played with at one point was to preserve all the graphs from the existing segments (remapping their ordinals) and linking them together with additional links. But a lot of links needed to be created in order to get close to the recall for a new "from scratch" graph, and I struggled to get any improvement. At the time I wasn't even concerning myself about deletions. |
Thanks @mayya-sharipova @jtibshirani. I started working on this this week. Hopefully will have some experimental results by the end of this week or early next week. One potential issue I see is that the scores of the neighbors are not available through the KnnVectorReaders graphs at merge time. In the version Im working on now, I just recompute the scores
@msokolov That's interesting. I can't think of a way to guarantee that the graphs merge fully, what approach did you take? I guess one method for this might be for semi-inserting nodes (node retains some of its neighbors but also gets new ones) from the smaller graph into the larger graph. I think that a node in general should have a proportional number of neighbors from each graph based on the size of each graph. So if 2/3rds of all nodes are in the larger graph, 2/3rds of the neighbors from a node in the smaller graph could be updated to nodes in the larger graph. This approach I think would still require the procedure of looking up new neighbors for all of the nodes in the smaller graph - however search space should be smaller and M would also be smaller, so there might be some potential for improvement. |
We indeed don't store neighbour's scores, as they are not need for graph search. How are you using neighbours' scores; perhaps there is a way to calculate them once during merge and reuse them? |
Right, I think the scores are required when checking if an already added node becomes a new node's neighbor and we check if the new node should be added as an existing node's neighbor. I was thinking about potentially waiting to recompute the neighbors distances until they are needed in the line above. Also, I found some issues with my original implementation. They should be fixed. Ill post results from experiments soon. |
I just finished the initial set of experiments. Seems like there still may be some issues with the implementation. SetupFor the data set, I used the sift data set from ann-benchmarks. I created a small script to put the data into the correct format: hdf5-dump.py. I ran all tests on a c5.4xlarge instance. To test merge, I uncommented these two lines and varied the maxBufferedDocs as 10K, 100K and 500K. I used the following command after building the repo:
Then I grepped for the merge metrics:
I also captured recall on 100 queries to see how search was influenced. I ran the 3 sets of experiments, 3 times each. Results10K
100K
500K
ConclusionsFrom the experiments above, it seems that initializing from a graph during merge works well when few segments are being merged, but adds a cost when a lot of segments are being merged. Need to investigate why this might be happening. Additionally, my implementation appears to reduce recall slightly compared to the control. Im going to see if I can figure out why this might be happening. |
Hi @jmazanec15, I had a quick doubt. Currently how are segment merges happening in Lucene for the HNSW graph ? Is the graph being reconstructed from scratch ? |
Update: Sorry for delay, I am still working on this but got a little side tracked with other work. Hi @harishankar-gopalan, yes what currently happens is the graph gets reconstructed from scratch. In #11719, I am working on selecting the largest graph from a segment and using that to initialize the newly created segment's graph. Posted above are my initial benchmark results. However, I am running into some issues where the recall is slightly lower with the test setup and the merge time is higher. I have been debugging a little bit why this is happening, but have not yet make progress. I am going to take another try at it this week or next week. |
Hi @jmazanec15 thanks for the update. Are there any public stats available for the current segment merges for HNSW based graph indexes in Lucene ? To be more clear any performance benchmarks to compare the Lucene segment merges for Documents with and without KnnVectorFields indexed as a HNSW Graph. If you are aware of any initial benchmarks that you are using as reference, would be great full if you could share links to those if possible. |
@harishankar-gopalan I am not sure. The benchmarks I ran above compare merge with and without my draft PR changes - all containing KNN vector field. |
Hi @mayya-sharipova @jtibshirani @msokolov I figured out the issue in the previous tests with the recall - I was not using the copy of the vectors when recomputing the distances. I fixed that and re-ran the benchmarks and it looks like the recall values are fixed: Results10K
100K
500K
That being said, I think initialization from a graph has benefits when a segment that is larger is being merged with other segments. For instance, on the 1M data set, when the segment size is 500K, merge time looks a lot better; however, when the segment size is 10K, merge time differences are not noticeable between control. I am wondering what you think might be good next steps. I was thinking that I could either get the PR out of draft state for review or I could focus on running more experiments on different data sets. Before doing those, I wanted to see what you thought of the results thus far? |
Results for new PR #12050: 10K
100K
500K
|
HI Jack, thanks for persisting and returning to this. I haven't had a chance to review the PR yet, just looking at the results here I have a few questions. First, it looks to me as if we see some very since improvement for the larger graphs, preserve the same recall, and changes to QPS are probably noise. I guess the assumption is we are producing similar results with less work? Just so we can understand these results a little better, could you document how you arrived at them? What dataset did you use? How did you measure the times and recall (was it using KnnGraphTester? luceneutil? some other benchmarking tool?). I'd also be curious to see the numbers and sizes of the segments in the results: I assume they would be unchanged from Control to Test, but it would be nice to be able to verify. Thanks again! |
Hi @msokolov,
Right, basically instead of adding the first 0-X ordinals to the graph, we manually insert the nodes and their neighbors from the initializer graph into the merge graph, avoiding the searching for neighbors step. I think QPS is mostly noise. Recall is roughly the same - not always exactly because in the PR the random number generation gets a bit mixed up.
Sure, I used the same procedure for the latest results as outlined here: #11354 (comment). I used the sift 1M 128 dimensional L2 data set. This was using KnnGraphTester, controlling the number of initial segments and then forcemerging to 1 segment.
I would assume so too. Let me get these numbers as well - will post soon. |
I see. Given that you're force-merging, the latency variability seems quite high - over 10% I think. Do you see the same variability testing without re-indexing? |
Thats a good idea. Let me retry without reindexing - should be faster as well. |
@msokolov here are the results re-using a single index for each experiment. Overall, there is still some variability, it seems like there is less. For the 10K results, it appears that control performed better, however, the recall is slightly worse. Also included are the size of the files after merge. 10K
100K
500K
|
yeah, thanks that seems to have reduced the noise some. Probably what remains is down to GC, system hiccups, etc; it's inevitable to see some variance. |
Currently when merging segments, the HNSW vectors format rebuilds the entire graph from scratch. In general, building these graphs is very expensive, and it'd be nice to optimize it in any way we can. I was wondering if during merge, we could choose the largest segment with no deletes, and load its HNSW graph into heap. Then we'd add vectors from the other segments to this graph, through the normal build process. This could cut down on the number of operations we need to perform when building the graph.
This is just an early idea, I haven't run experiments to see if it would help. I'd guess that whether it helps would also depend on details of the MergePolicy.
Migrated from LUCENE-10318 by Julie Tibshirani (@jtibshirani), 2 votes, updated Aug 19 2022
The text was updated successfully, but these errors were encountered: