-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DISCUSS] Should there be a threshold-based vector search API? #12579
Comments
Interesting - so this is kind of like a noisy radius search in high dimensions? It makes sense to me intuitively since we don't generally expect searches to have the same number of results. If I search for I found one system that supports a threshold: https://typesense.org/docs/0.25.0/api/vector-search.html#distance-threshold but I don't think we do support that yet, do we, in KnnVector*Query? Perhaps we ought to lead with tha in order to introduce the utility of radius-searching, since you can simulate this with a threshold + large K, right? |
Thanks @msokolov, this nicely summarizes what I'm trying to say!
I took a look here: and seems like they first perform kNN searches, and then remove results below the threshold (like a post-filtering step that you mentioned)
Yes, this would be the easiest way to find all (or a large number of) vectors within a radius using the existing Interestingly, the threshold-based search proposed above may not need any invasive changes thanks to the amazing Opening a PR to discuss next steps.. |
@kaivalnp yes, One of the tricky things I can see is that its possible that the bottom layer entry point is so bad, that it itself isn't within the But, a quick overview of your PR shows that this type of logic fits pretty easily. I look forward to the performance and recall tests! |
@kaivalnp one other thing to think about is https://weaviate.io/blog/weaviate-1-20-release#autocut I wonder if we could do something similar by dynamically adjusting the "traversial_similarity" to indicate that exploring more would be useless. |
Interesting! They seem to make "cuts" when there is an abnormal jump in scores. An "abnormal jump" is defined when the score of the next candidate does not match the gradual decrease in scores (for example: if we have collected 10 vectors with a max and min score of 1.0 and 0.8, the "step" is basically 0.02 per vector, so we make a "cut" if the next best candidate is below a score of 0.78) Might be a good next step to simplify the API! |
### Description Background in #12579 Add support for getting "all vectors within a radius" as opposed to getting the "topK closest vectors" in the current system ### Considerations I've tried to keep this change minimal and non-invasive by not modifying any APIs and re-using existing HNSW graphs -- changing the graph traversal and result collection criteria to: 1. Visit all nodes (reachable from the entry node in the last level) that are within an outer "traversal" radius 2. Collect all nodes that are within an inner "result" radius ### Advantages 1. Queries that have a high number of "relevant" results will get all of those (not limited by `topK`) 2. Conversely, arbitrary queries where many results are not "relevant" will not waste time in getting all `topK` (when some of them will be removed later) 3. Results of HNSW searches need not be sorted - and we can store them in a plain list as opposed to min-max heaps (saving on `heapify` calls). Merging results from segments is also cheaper, where we just concatenate results as opposed to calculating the index-level `topK` On a higher level, finding `topK` results needed HNSW searches to happen in `#rewrite` because of an interdependence of results between segments - where we want to find the index-level `topK` from multiple segment-level results. This is kind of against Lucene's concept of segments being independently searchable sub-indexes? Moreover, we needed explicit concurrency (#12160) to perform these in parallel, and these shortcomings would be naturally overcome with the new objective of finding "all vectors within a radius" - inherently independent of results from another segment (so we can move searches to a more fitting place?) ### Caveats I could not find much precedent in using HNSW graphs this way (or even the radius-based search for that matter - please add links to existing work if someone is aware) and consequently marked all classes as `@lucene.experimental` For now I have re-used lots of functionality from `AbstractKnnVectorQuery` to keep this minimal, but if the use-case is accepted more widely we can look into writing more suitable queries (as mentioned above briefly)
### Description Background in #12579 Add support for getting "all vectors within a radius" as opposed to getting the "topK closest vectors" in the current system ### Considerations I've tried to keep this change minimal and non-invasive by not modifying any APIs and re-using existing HNSW graphs -- changing the graph traversal and result collection criteria to: 1. Visit all nodes (reachable from the entry node in the last level) that are within an outer "traversal" radius 2. Collect all nodes that are within an inner "result" radius ### Advantages 1. Queries that have a high number of "relevant" results will get all of those (not limited by `topK`) 2. Conversely, arbitrary queries where many results are not "relevant" will not waste time in getting all `topK` (when some of them will be removed later) 3. Results of HNSW searches need not be sorted - and we can store them in a plain list as opposed to min-max heaps (saving on `heapify` calls). Merging results from segments is also cheaper, where we just concatenate results as opposed to calculating the index-level `topK` On a higher level, finding `topK` results needed HNSW searches to happen in `#rewrite` because of an interdependence of results between segments - where we want to find the index-level `topK` from multiple segment-level results. This is kind of against Lucene's concept of segments being independently searchable sub-indexes? Moreover, we needed explicit concurrency (#12160) to perform these in parallel, and these shortcomings would be naturally overcome with the new objective of finding "all vectors within a radius" - inherently independent of results from another segment (so we can move searches to a more fitting place?) ### Caveats I could not find much precedent in using HNSW graphs this way (or even the radius-based search for that matter - please add links to existing work if someone is aware) and consequently marked all classes as `@lucene.experimental` For now I have re-used lots of functionality from `AbstractKnnVectorQuery` to keep this minimal, but if the use-case is accepted more widely we can look into writing more suitable queries (as mentioned above briefly)
#12679 added a new experimental |
Context
Almost all vector search algorithms focus on getting the
topK
highest-scoring results for a given query vector. This however, may not be the best objective to pursue because:topK
results will be low scoring (and not relevant to the query). We may want some kind of scorethreshold
, below which a vector is not considered as a "match"?topK
when we have some more (slightly lower scoring) results, but still above thethreshold
. Considering a query and a vector as a "match" if their similarity score is above thethreshold
, we may want ALL possible matches?I found the same thing being discussed in #11946 as well.. However, I'm not aware of any algorithms designed for this objective (or how common of a use-case this may be)
Proposal
IF this turns out to be a valid use-case, HNSW may not be the best algorithm for this new objective (as it was designed for getting the K-Nearest Neighbors). On a high-level, the current algorithm is:
topK
results (initially empty) and an unbounded queue of candidates (initially containing the entry node). Both of these are priority queues in descending order of similarity scorestopK
results (2) it is better scoring than any existing result (also removing the least scoring result in this case)For the new objective, the underlying graphs created by HNSW may still be useful - where each node is connected to its nearest (and diverse) neighbors, so we have some sense of direction / ordering on which node to explore further
Can we instead change the graph traversal (and result collection) criteria such that:
The lower graph-traversal threshold acts as a buffer, to retain results where all nodes along its path may not be above the actual
threshold
(and is treated as a search-time hyper-parameter, chosen according to the underlying vector space and queries)The upper levels of search will remain the same (find the single-best entry point for the next layer) and this new criteria is only applicable to the lowest layer (with all nodes) to collect an unbounded number of results above the
threshold
This will also eliminate the need to maintain results and candidates as priority queues, instead keeping them as simple lists (reducing the overhead of
heapify
calls that maintain the min-max heap property) and just sort them once at the endLooking for inputs on both the use-case and tweaked algorithm..
The text was updated successfully, but these errors were encountered: