-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for similarity-based vector searches #12679
Conversation
lucene/core/src/java/org/apache/lucene/search/AbstractRnnVectorQuery.java
Outdated
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/search/AbstractRnnVectorQuery.java
Outdated
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/search/AbstractRnnVectorQuery.java
Outdated
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/search/AbstractRnnVectorQuery.java
Outdated
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/search/RnnCollector.java
Outdated
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/search/AbstractRnnVectorQuery.java
Outdated
Show resolved
Hide resolved
Thanks for adding this @kaivalnp! The idea makes sense to me, looking forward to the benchmarks results. I left some minor comments. Sharing some thoughts below :
|
If I read correctly, this query ends up calling In my opinion, there are two options: either we force this query to take a |
No, we're calling the new API (from here) with a custom The I think you're worried that we'll end up performing brute-force KNN on all documents in the segment, and then retain vectors above the threshold? What we instead aim to do is: starting from the entry node in the last level of HNSW graphs, we keep visiting candidates as long as they are above the This is not necessarily slower than normal HNSW searches, provided the |
Thanks for explaining, I had overlooked how the |
Thanks for the review @shubhamvishu! Addressed some of the comments above
I think of it as finding all results within a high-dimensional circle / sphere / equivalent, and the radius-based search seems to capture the essence. Although "threshold-based search" may be more appropriate (since radius is tied to Euclidean Distance, and may not be easy to relate with Cosine Similarity or Dot Product) No strong opinions here, looking for others' thoughts as well on more appropriate naming..
The problem here is that we'll have to generalize many other (unrelated to this change) internal classes. I'll keep this to a separate issue |
dot-product, cosine, etc. don't really follow that same idea as you point out. I would prefer something like
I agree with @jpountz concerns. The topDocs collector gets a replay of the matched documents. We should put sane limits here and prevent folks from getting 100,000s of matches (int & float value arrays) via approximate search. It seems like having a huge number like that could cause issues. |
BenchmarksUsing the vector file from https://home.apache.org/~sokolov/enwiki-20120502-lines-1k-100d.vec (enwiki dataset, unit vectors, 100 dimensions) The setup was 1M doc vectors in a single HNSW graph with The baseline for the new objective is "all vectors above a score threshold" (as opposed to the best-scoring Here are some statistics for the result counts in the new baseline:
This is used to get an estimate of query - result count distribution for various Here we will benchmark the new API against a high K-NN Search (current system)
R-NN Search (new system)
IF the goal is to "get all vectors within a radius", then looks like using the new radius-based search API scales better than having a large |
Totally agreed @jpountz! It is very easy to go wrong in the new API, specially if the user passes a low threshold (high radius -> low threshold). As we can see from benchmarks above, the number of nodes to visit may jump very fast with slight reduction in the
This makes sense to me.. Something like a lazy-loading iterator, where we perform vector comparisons and determine whether a doc matches on
I like this, thanks for the suggestion @benwtrent! |
I think @kaivalnp the thing to do would be to say the Collector is full by flagging "incomplete" (I think this is possible) once a threshold is reached. You can do this independently from a "maxvisit" as we don't care about visiting the vector, we just care about adding it to the result set. |
The results: #12679 (comment) Are astounding! I will try and replicate with Lucene Util. The numbers seem almost too good ;) |
Do you mean that we return incomplete results? Instead, maybe we can:
This "lazy-loading" works very well for our use case because the fact that a vector matches our query or not is independent of other vectors (unlike in K-NN, where given a query and an arbitrary doc vector, we cannot say whether the doc vector will be in the Is this what you had in mind earlier @jpountz?
Yes, I took inspiration from |
Sorry for the confusion, I tried renaming the branch from |
OK, I tried testing with KnnGraphTester. I indexed 100_000 normalized Cohere vectors (768 dims). With regular knn, recall@10:
I tried the similarity threshold and its way worse.
|
@kaivalnp I see the issue with my test, you are specifically testing "post-filtering" on the top values, not just getting the top10 k. I understand my issue. Could you post your testing code or something in a gist ? |
Thanks for running this @benwtrent! I just had a couple of questions:
As I'm writing this, I see your comment. I'll post my setup in a while |
Here is the gist of my benchmark: https://gist.github.com/kaivalnp/79808017ed7666214540213d1e2a21cf I'm calculating the baseline / individual results as "count of vectors above the threshold" Note that we do not need the actual vectors, because any vector with a score >= Had some other helper functions mainly for calling these and formatting output, but kept the important functions in the gist (how I'm calculating the baseline, KNN / RNN results and time taken) |
Hi @benwtrent! Curious to hear if you've been able to reproduce the benchmark? |
@kaivalnp I have been busy doing other things. I hope to look into this in the next week or so. |
Thank you! I'll try to incorporate earlier suggestions in the meanwhile |
- Make use of inherent independence of segment-level results - Do not greedily collect exact matches, return a lazy-loading iterator instead
Summary of new changes:
Please let me know if this approach makes sense? |
Right, I meant that we need not score all other vectors to determine if the vector itself is a "hit" or not (we just need its similarity score to be above the
I've tried to re-use some of this work to directly reject vectors that are above the I wonder if we can extend this further: This is also usable in the current KNN-based search, wherever we fall back from Right now we score all vectors present in the Here are some very rough changes to support this -- what do you think @benwtrent? |
Thanks, the gist was really helpful and gave some files including normalized and un-normalized vectors. I assume that since you mentioned I saw ~476k vectors of 768 dimensions there and indexed the first 400k in a single segment, while querying the next 10k, using the following command: ./gradlew :lucene:core:similarity-benchmark --args=" --vecPath=/home/kaivalnp/working/similarity-benchmark/cohere-768.vec --indexPath=/home/kaivalnp/working/similarity-benchmark/cohere-indexes --dim=768 --function=MAXIMUM_INNER_PRODUCT --numDocs=400000 --numQueries=10000 --topKs=5000,2500,1000,500,100 --topK-thresholds=300,305,310,315,320 --traversalSimilarities=295,300,305,310,315 --resultSimilarities=300,305,310,315,320" KNN search
Similarity-based search
IF the goal is to "get all vectors above a similarity", then looks like using the new similarity-based search API scales better than having a large |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for all the benchmarking due diligence @kaivalnp. From that standpoint, it looks good to me.
I do worry a bit around the post-filtering. It seems likely in a restrictive search scenario, we would do a bunch of searching to no avail. However, I don't know a good way around it.
What I would like to see are tests covering this query, its expected behavior and edge cases (no docs, no matching docs with a filter, do we fall back to exact ok, etc.).
- Set traversalSimilarity = resultSimilarity by default - Continue graph search until better nodes are available - Add filter to determine visitLimit for falling back to exact search
Thanks @benwtrent! I also simplified the queries: I realized that the API may be difficult to use in the current state (we are leaving two parameters - I noticed from above benchmarks that Another issue previously encountered (amplified by the above change) is that we stop graph search too early when the entry node is far away from the query. To overcome this, can we continue search as long as we find better scoring nodes (so we know there is a possibility of reaching nodes above For configuring Here is the benchmark setup and results with these changes (same range as before): https://gist.github.com/kaivalnp/07d6a96d22adfad4d3cd5924b13ed524 Also added some tests
Agreed, we do some work in graph search (like similarity computations, collecting results, etc) - which should be reusable from exact search I had opened #12820 to discuss this issue (also affects KNN queries) - perhaps we can include these similarity-based queries if we arrive to a solution there? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kaivalnp thank you so much for all the hard work here. I think this is ready to merge as an experimental query.
Could you add changes for Lucene 9.10? I can merge and backport.
It would also be good to update this branch with latest main to make sure CI is still happy.
lucene/CHANGES.txt
Outdated
* GITHUB#12679: Add support for similarity-based vector searches. Finds all vectors scoring above a `resultSimilarity` | ||
while traversing the HNSW graph till better-scoring nodes are available, or the best candidate is below a score of | ||
`traversalSimilarity` in the lowest level. (Aditya Prakash, Kaival Parikh) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add the vector query names?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, didn't get what you mean here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add support for similarity-based vector searches
Well, what are the query names? :D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh got it.. Updated now :)
Thanks for all the help here @benwtrent !
Added an entry under "New Features" (also added one of my teammates along with whom this change was designed) |
### Description Background in #12579 Add support for getting "all vectors within a radius" as opposed to getting the "topK closest vectors" in the current system ### Considerations I've tried to keep this change minimal and non-invasive by not modifying any APIs and re-using existing HNSW graphs -- changing the graph traversal and result collection criteria to: 1. Visit all nodes (reachable from the entry node in the last level) that are within an outer "traversal" radius 2. Collect all nodes that are within an inner "result" radius ### Advantages 1. Queries that have a high number of "relevant" results will get all of those (not limited by `topK`) 2. Conversely, arbitrary queries where many results are not "relevant" will not waste time in getting all `topK` (when some of them will be removed later) 3. Results of HNSW searches need not be sorted - and we can store them in a plain list as opposed to min-max heaps (saving on `heapify` calls). Merging results from segments is also cheaper, where we just concatenate results as opposed to calculating the index-level `topK` On a higher level, finding `topK` results needed HNSW searches to happen in `#rewrite` because of an interdependence of results between segments - where we want to find the index-level `topK` from multiple segment-level results. This is kind of against Lucene's concept of segments being independently searchable sub-indexes? Moreover, we needed explicit concurrency (#12160) to perform these in parallel, and these shortcomings would be naturally overcome with the new objective of finding "all vectors within a radius" - inherently independent of results from another segment (so we can move searches to a more fitting place?) ### Caveats I could not find much precedent in using HNSW graphs this way (or even the radius-based search for that matter - please add links to existing work if someone is aware) and consequently marked all classes as `@lucene.experimental` For now I have re-used lots of functionality from `AbstractKnnVectorQuery` to keep this minimal, but if the use-case is accepted more widely we can look into writing more suitable queries (as mentioned above briefly)
I see random test failures that could be related to this change:
|
Discovered in #12921, and introduced in #12679 The first issue is that we weren't advancing the `VectorScorer` [here](https://github.com/apache/lucene/blob/cf13a9295052288b748ed8f279f05ee26f3bfd5f/lucene/core/src/java/org/apache/lucene/search/AbstractVectorSimilarityQuery.java#L257-L262) -- so it was still un-positioned while trying to compute the similarity score Earlier in the PR, the underlying delegate of the `FilteredDocIdSetIterator` was `scorer.iterator()` (see [here](https://github.com/apache/lucene/blob/cad565439be512ac6e95a698007b1fc971173f00/lucene/core/src/java/org/apache/lucene/search/AbstractVectorSimilarityQuery.java#L107)) -- so we didn't need to explicitly advance it Later, we decided to maintain parity to `AbstractKnnVectorQuery` and introduce filtering in `AbstractVectorSimilarityQuery` (see [this commit](5096790)) to determine the `visitLimit` of approximate search -- after which the underlying iterator changed to the accepted docs (see [here](https://github.com/apache/lucene/blob/5096790f281e477c529a7c8311aeb353ccdffdeb/lucene/core/src/java/org/apache/lucene/search/AbstractVectorSimilarityQuery.java#L255)) and I missed advancing the `VectorScorer` explicitly.. After doing so, we no longer get the original `java.lang.ArrayIndexOutOfBoundsException` -- but the `BaseVectorSimilarityQueryTestCase#testApproximate` starts failing because it falls back to exact search, as the limit of the prefilter is met during graph search Relaxed the parameters of the test to fix this (making the filter less restrictive, and trying to visit a fewer number of nodes so that approximate search completes without hitting its limit) Sorry for missing this earlier!
Discovered in #12921, and introduced in #12679 The first issue is that we weren't advancing the `VectorScorer` [here](https://github.com/apache/lucene/blob/cf13a9295052288b748ed8f279f05ee26f3bfd5f/lucene/core/src/java/org/apache/lucene/search/AbstractVectorSimilarityQuery.java#L257-L262) -- so it was still un-positioned while trying to compute the similarity score Earlier in the PR, the underlying delegate of the `FilteredDocIdSetIterator` was `scorer.iterator()` (see [here](https://github.com/apache/lucene/blob/cad565439be512ac6e95a698007b1fc971173f00/lucene/core/src/java/org/apache/lucene/search/AbstractVectorSimilarityQuery.java#L107)) -- so we didn't need to explicitly advance it Later, we decided to maintain parity to `AbstractKnnVectorQuery` and introduce filtering in `AbstractVectorSimilarityQuery` (see [this commit](5096790)) to determine the `visitLimit` of approximate search -- after which the underlying iterator changed to the accepted docs (see [here](https://github.com/apache/lucene/blob/5096790f281e477c529a7c8311aeb353ccdffdeb/lucene/core/src/java/org/apache/lucene/search/AbstractVectorSimilarityQuery.java#L255)) and I missed advancing the `VectorScorer` explicitly.. After doing so, we no longer get the original `java.lang.ArrayIndexOutOfBoundsException` -- but the `BaseVectorSimilarityQueryTestCase#testApproximate` starts failing because it falls back to exact search, as the limit of the prefilter is met during graph search Relaxed the parameters of the test to fix this (making the filter less restrictive, and trying to visit a fewer number of nodes so that approximate search completes without hitting its limit) Sorry for missing this earlier!
Hi, do we have any scheduled release date for this exciting feature? |
This feature will ship with Lucene 9.10 I'm not sure when that will be released, though I see ~2-4 months between previous minor versions |
Hi @kaivalnp, thanks for this contribution! My question is why do we have two thresholds, one for grap traversal (used to decide if it's worth exploring a candidate neighbour) and the resulting threshold (used to accept or not a result)? |
Yes @alessandrobenedetti that is correct -- some result may be missed if nodes along its path from the entry node score below the result threshold (but still higher than a traversal threshold <= result threshold) This traversal threshold exists purely as a tunable parameter for recall v/s latency, somewhat like the |
Description
Background in #12579
Add support for getting "all vectors within a radius" as opposed to getting the "topK closest vectors" in the current system
Considerations
I've tried to keep this change minimal and non-invasive by not modifying any APIs and re-using existing HNSW graphs -- changing the graph traversal and result collection criteria to:
Advantages
topK
)topK
(when some of them will be removed later)heapify
calls). Merging results from segments is also cheaper, where we just concatenate results as opposed to calculating the index-leveltopK
On a higher level, finding
topK
results needed HNSW searches to happen in#rewrite
because of an interdependence of results between segments - where we want to find the index-leveltopK
from multiple segment-level results. This is kind of against Lucene's concept of segments being independently searchable sub-indexes?Moreover, we needed explicit concurrency (#12160) to perform these in parallel, and these shortcomings would be naturally overcome with the new objective of finding "all vectors within a radius" - inherently independent of results from another segment (so we can move searches to a more fitting place?)
Caveats
I could not find much precedent in using HNSW graphs this way (or even the radius-based search for that matter - please add links to existing work if someone is aware) and consequently marked all classes as
@lucene.experimental
For now I have re-used lots of functionality from
AbstractKnnVectorQuery
to keep this minimal, but if the use-case is accepted more widely we can look into writing more suitable queries (as mentioned above briefly)Next steps
Run benchmarks with this new query to see how it compares to the
topK
based search