Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reuse HNSW graphs when merging segments? [LUCENE-10318] #11354

Closed
asfimport opened this issue Dec 16, 2021 · 20 comments · Fixed by #12050
Closed

Reuse HNSW graphs when merging segments? [LUCENE-10318] #11354

asfimport opened this issue Dec 16, 2021 · 20 comments · Fixed by #12050

Comments

@asfimport
Copy link

Currently when merging segments, the HNSW vectors format rebuilds the entire graph from scratch. In general, building these graphs is very expensive, and it'd be nice to optimize it in any way we can. I was wondering if during merge, we could choose the largest segment with no deletes, and load its HNSW graph into heap. Then we'd add vectors from the other segments to this graph, through the normal build process. This could cut down on the number of operations we need to perform when building the graph.

This is just an early idea, I haven't run experiments to see if it would help. I'd guess that whether it helps would also depend on details of the MergePolicy.


Migrated from LUCENE-10318 by Julie Tibshirani (@jtibshirani), 2 votes, updated Aug 19 2022

@asfimport
Copy link
Author

Jack Mazanec (migrated from JIRA)

Hi @jtibshirani  I was thinking about something similar and would be interested in working on this. I can run some experiments to see if this would improve performance, if you haven’t already started to do so.

Additionally, I am wondering if it would make sense to extend this to support graphs that contain deleted nodes. I can think of an approach, but it is a little messy. It would follow the same idea for merging — add vectors from smaller graph into larger graph. However, before adding vectors from smaller graph, all of the deleted nodes would need to be removed from the larger graph.

In order to remove a node from the graph, I think we would need to remove it from list of neighbor arrays for each level it is in. In addition to this, because removal would break the ordinals, we would have to update all of the ordinals in the graph, which for OnHeapHNSW graph would mean updating all nodes by levels and also potentially each neighbor in each NeighborArray in the graph.

Because removing a node could cause a number of nodes in the graph to lose a neighbor, we would need to repair the graph. To do this, I think we could create a repair_list that tracks the nodes that lost a connection due to the deleted node{}.{} To fill the list, we would need to iterate over all of the nodes in the graph and then check if any of their m connections are to the deleted node (I think this could be done when the ordinals are being updated). If so, remove the connection and then add the node to the {}repair_list{}.

Once the repair_list is complete, for each node in the list, search the graph to get new neighbors to fill up the node’s connections to the desired amount. At this point, I would expect the time it takes to finish merging to be equal to the time it takes to insert the number of live vectors in the smaller graph plus the size of the repair list into the large graph.

All that being said, I am not sure if removing deleted nodes in the graph would be faster than just building the graph from scratch. From the logic above, we would need to at least iterate over each connection in the graph and potentially perform several list deletions. My guess is that when the repair list is small, I think it would be faster, but when it is large, probably not. I am going to try to start playing around with this idea, but please let me know what you think!

@asfimport
Copy link
Author

Mayya Sharipova (@mayya-sharipova) (migrated from JIRA)

Thanks for looking into this, Jack. 

We have not done any development on this, but some thoughts from us:

  • Looks like the way MergePolicy works, it chooses segments of approximately same size. So during merge, we may not have one single big segment, whose graph we can reuse.  So I would imagine for many uses case it may not worth reusing graphs (especially if segments are relative small) - extra complexity would not justify a very small speedups. 
  • I agree with your thoughts on deletions that it may also not worth reusing graphs if some heavy deletions are present.

So may be, a good start  could be to have a very lean prototype with a lot of  performance benchmarks. 

@asfimport
Copy link
Author

Julie Tibshirani (@jtibshirani) (migrated from JIRA)

@jmazanec15 it's great you're interested in looking into this! I don't have any prototype or experiments, you're welcome to pick it up.

Removing nodes and repairing the graph could be a nice direction. But for now we can keep things simple and assume there's a segment without deletes. If that's looking good and shows a nice improvement in index/ merge benchmarks, then we can handle deletes in a follow-up.

Edit: Oops, I didn't refresh the page so I missed Mayya's comment. It looks like we're in agreement!

@asfimport
Copy link
Author

Michael Sokolov (@msokolov) (migrated from JIRA)

Another idea I played with at one point was to preserve all the graphs from the existing segments (remapping their ordinals) and linking them together with additional links. But a lot of links needed to be created in order to get close to the recall for a new "from scratch" graph, and I struggled to get any improvement. At the time I wasn't even concerning myself about deletions.

@jmazanec15
Copy link
Contributor

Removing nodes and repairing the graph could be a nice direction. But for now we can keep things simple and assume there's a segment without deletes. If that's looking good and shows a nice improvement in index/ merge benchmarks, then we can handle deletes in a follow-up.

Thanks @mayya-sharipova @jtibshirani. I started working on this this week. Hopefully will have some experimental results by the end of this week or early next week. One potential issue I see is that the scores of the neighbors are not available through the KnnVectorReaders graphs at merge time. In the version Im working on now, I just recompute the scores
for the graph that will be retained in the merge process. However, this will add some overhead. Short of serializing the neighbor scores with the graph, do you have any other ideas for avoiding score re-computation?

Another idea I played with at one point was to preserve all the graphs from the existing segments (remapping their ordinals) and linking them together with additional links. But a lot of links needed to be created in order to get close to the recall for a new "from scratch" graph, and I struggled to get any improvement. At the time I wasn't even concerning myself about deletions.

@msokolov That's interesting. I can't think of a way to guarantee that the graphs merge fully, what approach did you take? I guess one method for this might be for semi-inserting nodes (node retains some of its neighbors but also gets new ones) from the smaller graph into the larger graph. I think that a node in general should have a proportional number of neighbors from each graph based on the size of each graph. So if 2/3rds of all nodes are in the larger graph, 2/3rds of the neighbors from a node in the smaller graph could be updated to nodes in the larger graph. This approach I think would still require the procedure of looking up new neighbors for all of the nodes in the smaller graph - however search space should be smaller and M would also be smaller, so there might be some potential for improvement.

@mayya-sharipova
Copy link
Contributor

mayya-sharipova commented Aug 30, 2022

One potential issue I see is that the scores of the neighbors are not available through the KnnVectorReaders graphs at merge time. ... Short of serializing the neighbor scores with the graph, do you have any other ideas for avoiding score re-computation?

We indeed don't store neighbour's scores, as they are not need for graph search. How are you using neighbours' scores; perhaps there is a way to calculate them once during merge and reuse them?

@jmazanec15
Copy link
Contributor

Right, I think the scores are required when checking if an already added node becomes a new node's neighbor and we check if the new node should be added as an existing node's neighbor.

I was thinking about potentially waiting to recompute the neighbors distances until they are needed in the line above.

Also, I found some issues with my original implementation. They should be fixed. Ill post results from experiments soon.

@jmazanec15
Copy link
Contributor

I just finished the initial set of experiments. Seems like there still may be some issues with the implementation.

Setup

For the data set, I used the sift data set from ann-benchmarks. I created a small script to put the data into the correct format: hdf5-dump.py.

I ran all tests on a c5.4xlarge instance.

To test merge, I uncommented these two lines and varied the maxBufferedDocs as 10K, 100K and 500K. I used the following command after building the repo:

cd lucene/core/build
java -cp libs/lucene-core-10.0.0-SNAPSHOT.jar:classes/java/test:../../test-framework/build/classes/java/main:../../codecs/build/classes/java/main/ org.apache.lucene.util.hnsw.KnnGraphTester -search /home/ec2-user/merge-test/data/sift-128-euclidean.test -docs /home/ec2-user/merge-test/data/sift-128-euclidean.train -ndoc 1000000 -maxConn 16 -beamWidthIndex 100 -niter 100 -dim 128 -reindex -metric euclidean -forceMerge > test-1 2>&1

Then I grepped for the merge metrics:

cat test-1 | grep "msec to merge numeric"

I also captured recall on 100 queries to see how search was influenced.

I ran the 3 sets of experiments, 3 times each.

Results

10K

Exper. time to merge (ms) QPS Recall
Control 1 611190 740 0.977
Control 2 621678 769 0.977
Control 3 619656 769 0.977
Test 1 691339 657 0.976
Test 2 760199 689 0.976
Test 3 685738 704 0.976

100K

Exper. time to merge (ms) QPS Recall
Control 1 621603 775 0.977
Control 2 627452 769 0.977
Control 3 628613 833 0.977
Test 1 616133 746 0.973
Test 2 636186 699 0.973
Test 3 638978 709 0.973

500K

Exper. time to merge (ms) QPS Recall
Control 1 671704 763 0.977
Control 2 643735 714 0.977
Control 3 639047 800 0.977
Test 1 398604 699 0.962
Test 2 409549 751 0.962
Test 3 370152 775 0.962

Conclusions

From the experiments above, it seems that initializing from a graph during merge works well when few segments are being merged, but adds a cost when a lot of segments are being merged. Need to investigate why this might be happening.

Additionally, my implementation appears to reduce recall slightly compared to the control. Im going to see if I can figure out why this might be happening.

@harishankar-gopalan
Copy link

Hi @jmazanec15, I had a quick doubt. Currently how are segment merges happening in Lucene for the HNSW graph ? Is the graph being reconstructed from scratch ?

@jmazanec15
Copy link
Contributor

Update: Sorry for delay, I am still working on this but got a little side tracked with other work.

Hi @harishankar-gopalan, yes what currently happens is the graph gets reconstructed from scratch. In #11719, I am working on selecting the largest graph from a segment and using that to initialize the newly created segment's graph. Posted above are my initial benchmark results. However, I am running into some issues where the recall is slightly lower with the test setup and the merge time is higher. I have been debugging a little bit why this is happening, but have not yet make progress. I am going to take another try at it this week or next week.

@harishankar-gopalan
Copy link

Update: Sorry for delay, I am still working on this but got a little side tracked with other work.

Hi @harishankar-gopalan, yes what currently happens is the graph gets reconstructed from scratch. In #11719, I am working on selecting the largest graph from a segment and using that to initialize the newly created segment's graph. Posted above are my initial benchmark results. However, I am running into some issues where the recall is slightly lower with the test setup and the merge time is higher. I have been debugging a little bit why this is happening, but have not yet make progress. I am going to take another try at it this week or next week.

Hi @jmazanec15 thanks for the update. Are there any public stats available for the current segment merges for HNSW based graph indexes in Lucene ? To be more clear any performance benchmarks to compare the Lucene segment merges for Documents with and without KnnVectorFields indexed as a HNSW Graph. If you are aware of any initial benchmarks that you are using as reference, would be great full if you could share links to those if possible.

@jmazanec15
Copy link
Contributor

@harishankar-gopalan I am not sure. The benchmarks I ran above compare merge with and without my draft PR changes - all containing KNN vector field.

@jmazanec15
Copy link
Contributor

Hi @mayya-sharipova @jtibshirani @msokolov

I figured out the issue in the previous tests with the recall - I was not using the copy of the vectors when recomputing the distances. I fixed that and re-ran the benchmarks and it looks like the recall values are fixed:

Results

10K

Exper. time to merge (ms) QPS Recall
Control 1 611190 740 0.977
Control 2 621678 769 0.977
Control 3 619656 769 0.977
Test 1 649492 793 0.977
Test 2 663221 813 0.977
Test 3 594122 775 0.977

100K

Exper. time to merge (ms) QPS Recall
Control 1 621603 775 0.977
Control 2 627452 769 0.977
Control 3 628613 833 0.977
Test 1 583509 813 0.977
Test 2 608190 735 0.977
Test 3 602910 763 0.977

500K

Exper. time to merge (ms) QPS Recall
Control 1 671704 763 0.977
Control 2 643735 714 0.977
Control 3 639047 800 0.977
Test 1 369440 787 0.977
Test 2 361934 787 0.977
Test 3 346221 735 0.977

That being said, I think initialization from a graph has benefits when a segment that is larger is being merged with other segments. For instance, on the 1M data set, when the segment size is 500K, merge time looks a lot better; however, when the segment size is 10K, merge time differences are not noticeable between control.

I am wondering what you think might be good next steps. I was thinking that I could either get the PR out of draft state for review or I could focus on running more experiments on different data sets. Before doing those, I wanted to see what you thought of the results thus far?

@jmazanec15
Copy link
Contributor

Results for new PR #12050:

10K

Exper. time to merge (ms) QPS Recall
Control 1 756582 662 0.979
Control 2 757324 662 0.979
Control 3 702034 763 0.979
Test 1 699830 671 0.98
Test 2 756010 694 0.98
Test 3 710235 617 0.98

100K

Exper. time to merge (ms) QPS Recall
Control 1 700465 704 0.981
Control 2 685446 694 0.981
Control 3 690448 781 0.981
Test 1 671778 740 0.981
Test 2 638535 666 0.981
Test 3 652043 729 0.981

500K

Exper. time to merge (ms) QPS Recall
Control 1 686805 763 0.98
Control 2 668690 735 0.98
Control 3 692826 671 0.98
Test 1 386561 724 0.98
Test 2 358238 735 0.98
Test 3 379240 781 0.98

@msokolov
Copy link
Contributor

msokolov commented Jan 1, 2023

HI Jack, thanks for persisting and returning to this. I haven't had a chance to review the PR yet, just looking at the results here I have a few questions. First, it looks to me as if we see some very since improvement for the larger graphs, preserve the same recall, and changes to QPS are probably noise. I guess the assumption is we are producing similar results with less work? Just so we can understand these results a little better, could you document how you arrived at them? What dataset did you use? How did you measure the times and recall (was it using KnnGraphTester? luceneutil? some other benchmarking tool?). I'd also be curious to see the numbers and sizes of the segments in the results: I assume they would be unchanged from Control to Test, but it would be nice to be able to verify. Thanks again!

@jmazanec15
Copy link
Contributor

Hi @msokolov,

First, it looks to me as if we see some very since improvement for the larger graphs, preserve the same recall, and changes to QPS are probably noise. I guess the assumption is we are producing similar results with less work?

Right, basically instead of adding the first 0-X ordinals to the graph, we manually insert the nodes and their neighbors from the initializer graph into the merge graph, avoiding the searching for neighbors step. I think QPS is mostly noise. Recall is roughly the same - not always exactly because in the PR the random number generation gets a bit mixed up.

Just so we can understand these results a little better, could you document how you arrived at them? What dataset did you use? How did you measure the times and recall (was it using KnnGraphTester? luceneutil? some other benchmarking tool?).

Sure, I used the same procedure for the latest results as outlined here: #11354 (comment). I used the sift 1M 128 dimensional L2 data set. This was using KnnGraphTester, controlling the number of initial segments and then forcemerging to 1 segment.

I'd also be curious to see the numbers and sizes of the segments in the results: I assume they would be unchanged from Control to Test, but it would be nice to be able to verify.

I would assume so too. Let me get these numbers as well - will post soon.

@msokolov
Copy link
Contributor

msokolov commented Jan 3, 2023

This was using KnnGraphTester, controlling the number of initial segments and then forcemerging to 1 segment.

I see. Given that you're force-merging, the latency variability seems quite high - over 10% I think. Do you see the same variability testing without re-indexing?

@msokolov msokolov linked a pull request Jan 3, 2023 that will close this issue
@jmazanec15
Copy link
Contributor

Thats a good idea. Let me retry without reindexing - should be faster as well.

@jmazanec15
Copy link
Contributor

jmazanec15 commented Jan 5, 2023

@msokolov here are the results re-using a single index for each experiment. Overall, there is still some variability, it seems like there is less. For the 10K results, it appears that control performed better, however, the recall is slightly worse. Also included are the size of the files after merge.

10K

Exper. time to merge (ms) QPS Recall Size vec (MB) Size vem (KB) Size vex (MB)
Control 1 696096 684 0.979 512.0001 70.172 60.62953
Control 2 695400 724 0.979 512.0001 70.172 60.62953
Control 3 710602 699 0.979 512.0001 70.172 60.62953
Test 1 736711 649 0.98 512.0001 70.129 60.62525
Test 2 742799 751 0.98 512.0001 70.129 60.62525
Test 3 742263 746 0.98 512.0001 70.129 60.62525

100K

Exper. time to merge (ms) QPS Recall Size vec (MB) Size vem (KB) Size vex (MB)
Control 1 714349 689 0.981 512.0001 70.172 60.44963
Control 2 703428 763 0.981 512.0001 70.172 60.44963
Control 3 721943 666 0.981 512.0001 70.172 60.44963
Test 1 669922 729 0.981 512.0001 70.26 60.45246
Test 2 682579 729 0.981 512.0001 70.26 60.45246
Test 3 659374 724 0.981 512.0001 70.26 60.45246

500K

Exper. time to merge (ms) QPS Recall Size vec (MB) Size vem (KB) Size vex (MB)
Control 1 674606 751 0.98 512.0001 70.172 59.69535
Control 2 657207 699 0.98 512.0001 70.172 59.69535
Control 3 664536 694 0.98 512.0001 70.172 59.69535
Test 1 381532 793 0.98 512.0001 70.256 59.69746
Test 2 371540 793 0.98 512.0001 70.256 59.69746
Test 3 382440 800 0.98 512.0001 70.256 59.69746

@msokolov
Copy link
Contributor

msokolov commented Jan 9, 2023

yeah, thanks that seems to have reduced the noise some. Probably what remains is down to GC, system hiccups, etc; it's inevitable to see some variance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants