-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LUCENE-10504: KnnGraphTester to use KnnVectorQuery #796
Conversation
numDocs = reader.maxDoc(); | ||
for (int i = 0; i < warmCount; i++) { | ||
// warm up | ||
targets.get(target); | ||
results[i] = doKnnSearch(reader, KNN_FIELD, target, topK, fanout); | ||
doKnnSearch(reader, KNN_FIELD, target, topK, fanout); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be fine to use doKnnVectorQuery
here so we could delete doKnnSearch
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that makes sense, I don't see why not
@@ -349,8 +353,9 @@ private void testSearch(Path indexPath, Path queryPath, Path outputPath, int[][] | |||
TopDocs[] results = new TopDocs[numIters]; | |||
long elapsed, totalCpuTime, totalVisited = 0; | |||
try (FileChannel q = FileChannel.open(queryPath)) { | |||
int bufferSize = Math.max(numIters, warmCount) * dim * Float.BYTES; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we could just assert warmCount < numIters, seems unusual to warm up with queries that you don't use in the benchmark?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think we can simply replace warmCount with numIters
@msokolov it'd be great to get this in when you have the time! |
whoa sorry I lost track. I will address the comments and push soon. I have other changes I want to make, relating to testing pre-filtering. Some folks here have been testing and finding very interesting results; I think we are going to want to have a good test harness for different filter selectivity. |
Thanks @msokolov ! So I'm not waiting in suspense too much, what sorts of interesting results have you found related to prefiltering? |
I'm trying not to steal the thunder of the folks who are actually working on this, but at a high level: we were seeing prefiltering being more expensive than postfiltering (over collecting and then filtering) for the same "yield" of top K, but were able to recover the cost, and flip the balance the other way by using the non-matching nodes to traverse the graph up until some threshold. Basically, we play with how we update the lower bound, allowing it to increase based on non-matching (the filter) nodes, until we get close to where we want to be. I guess the intuition is that nodes that are never going to be included in the top K are still useful for traversing the graph. |
Super interesting, looking forward to hearing more! I do hope we can stick with a prefiltering-like approach (and just improve its performance), since it feels easier to work with for users. If you request This doesn't sound like what you're talking about, but I did notice that prefiltering can be expensive when the filter matches a lot of documents. Unlike in postfiltering, the filter is not allowed skip any of the docs and you need to convert all the matches into a bit set, which is not cheap. |
* main: LUCENE-10532: remove @Slow annotation (apache#832) LUCENE-10312: Add PersianStemmer (apache#540) LUCENE-10558: Implement URL ctor to support classpath/module usage in Kuromoji and Nori dictionaries (main branch) (apache#871) LUCENE-10436: Reinstate public getdocValuesdocIdSetIterator method on DocValues (apache#869) Disable liftbot, we have our own tools LUCENE-10553: Fix WANDScorer's handling of 0 and +Infty. (apache#860) Make CONTRIBUTING.md a bit more succinct (apache#866) LUCENE-10504: KnnGraphTester to use KnnVectorQuery (apache#796) Add change line for LUCENE-9848 LUCENE-9848 Sort HNSW graph neighbors for construction (apache#862)
* main: (24 commits) LUCENE-10532: remove @Slow annotation (apache#832) LUCENE-10312: Add PersianStemmer (apache#540) LUCENE-10558: Implement URL ctor to support classpath/module usage in Kuromoji and Nori dictionaries (main branch) (apache#871) LUCENE-10436: Reinstate public getdocValuesdocIdSetIterator method on DocValues (apache#869) Disable liftbot, we have our own tools LUCENE-10553: Fix WANDScorer's handling of 0 and +Infty. (apache#860) Make CONTRIBUTING.md a bit more succinct (apache#866) LUCENE-10504: KnnGraphTester to use KnnVectorQuery (apache#796) Add change line for LUCENE-9848 LUCENE-9848 Sort HNSW graph neighbors for construction (apache#862) LUCENE-10524 Add benchmark suite details to CONTRIBUTING.md (apache#853) LUCENE-10552: KnnVectorQuery has incorrect equals/ hashCode (apache#859) LUCENE-10534: MinFloatFunction / MaxFloatFunction calls exists twice (apache#837) LUCENE-10188: Give SortedSetDocValues a docValueCount() (apache#663) Allow to link to github PR from changes (apache#854) LUCENE-10551: improve testing of LowercaseAsciiCompression (apache#858) LUCENE-10542: FieldSource exists implementations can avoid value retrieval (apache#847) LUCENE-10539: Return a stream of completions from FSTCompletion. (apache#844) gradle 7.3.3 quick upgrade (apache#856) LUCENE-10530: Avoid floating point precision bug in TestTaxonomyFacetAssociations (apache#848) ...
* LUCENE-10504: KnnGraphTester to use KnnVectorQuery
This really has two changes: