LUCENE-10504: KnnGraphTester to use KnnVectorQuery #796

msokolov · 2022-04-06T20:14:32Z

This really has two changes:

it switches the vector searches it runs to use the Query impl, as the description says
it becomes a bit more clever about managing its cache of "exact" NN that are used for recall comparisons. Previously, if you changed the source data files it would still potentially re-use the cached NN file. Now it stores a hash of the file name and looks at the modification times to see if it should regenerate the NN file

jtibshirani · 2022-04-06T21:18:50Z

lucene/core/src/test/org/apache/lucene/util/hnsw/KnnGraphTester.java

        numDocs = reader.maxDoc();
        for (int i = 0; i < warmCount; i++) {
          // warm up
          targets.get(target);
-          results[i] = doKnnSearch(reader, KNN_FIELD, target, topK, fanout);
+          doKnnSearch(reader, KNN_FIELD, target, topK, fanout);


Would it be fine to use doKnnVectorQuery here so we could delete doKnnSearch?

Yes, that makes sense, I don't see why not

jtibshirani · 2022-04-06T21:25:40Z

lucene/core/src/test/org/apache/lucene/util/hnsw/KnnGraphTester.java

@@ -349,8 +353,9 @@ private void testSearch(Path indexPath, Path queryPath, Path outputPath, int[][]
    TopDocs[] results = new TopDocs[numIters];
    long elapsed, totalCpuTime, totalVisited = 0;
    try (FileChannel q = FileChannel.open(queryPath)) {
+      int bufferSize = Math.max(numIters, warmCount) * dim * Float.BYTES;


Maybe we could just assert warmCount < numIters, seems unusual to warm up with queries that you don't use in the benchmark?

Yeah, I think we can simply replace warmCount with numIters

jtibshirani · 2022-05-04T20:20:28Z

@msokolov it'd be great to get this in when you have the time!

msokolov · 2022-05-04T21:05:02Z

@msokolov it'd be great to get this in when you have the time!

whoa sorry I lost track. I will address the comments and push soon. I have other changes I want to make, relating to testing pre-filtering. Some folks here have been testing and finding very interesting results; I think we are going to want to have a good test harness for different filter selectivity.

jtibshirani · 2022-05-04T22:45:47Z

Thanks @msokolov ! So I'm not waiting in suspense too much, what sorts of interesting results have you found related to prefiltering?

msokolov · 2022-05-04T23:09:11Z

I'm trying not to steal the thunder of the folks who are actually working on this, but at a high level: we were seeing prefiltering being more expensive than postfiltering (over collecting and then filtering) for the same "yield" of top K, but were able to recover the cost, and flip the balance the other way by using the non-matching nodes to traverse the graph up until some threshold. Basically, we play with how we update the lower bound, allowing it to increase based on non-matching (the filter) nodes, until we get close to where we want to be. I guess the intuition is that nodes that are never going to be included in the top K are still useful for traversing the graph.

jtibshirani · 2022-05-04T23:23:57Z

Super interesting, looking forward to hearing more! I do hope we can stick with a prefiltering-like approach (and just improve its performance), since it feels easier to work with for users. If you request k documents, you always get k back -- there's no guessing about how many candidates you need to retrieve as in post-filtering.

This doesn't sound like what you're talking about, but I did notice that prefiltering can be expensive when the filter matches a lot of documents. Unlike in postfiltering, the filter is not allowed skip any of the docs and you need to convert all the matches into a bit set, which is not cheap.

* main: LUCENE-10532: remove @Slow annotation (apache#832) LUCENE-10312: Add PersianStemmer (apache#540) LUCENE-10558: Implement URL ctor to support classpath/module usage in Kuromoji and Nori dictionaries (main branch) (apache#871) LUCENE-10436: Reinstate public getdocValuesdocIdSetIterator method on DocValues (apache#869) Disable liftbot, we have our own tools LUCENE-10553: Fix WANDScorer's handling of 0 and +Infty. (apache#860) Make CONTRIBUTING.md a bit more succinct (apache#866) LUCENE-10504: KnnGraphTester to use KnnVectorQuery (apache#796) Add change line for LUCENE-9848 LUCENE-9848 Sort HNSW graph neighbors for construction (apache#862)

* main: (24 commits) LUCENE-10532: remove @Slow annotation (apache#832) LUCENE-10312: Add PersianStemmer (apache#540) LUCENE-10558: Implement URL ctor to support classpath/module usage in Kuromoji and Nori dictionaries (main branch) (apache#871) LUCENE-10436: Reinstate public getdocValuesdocIdSetIterator method on DocValues (apache#869) Disable liftbot, we have our own tools LUCENE-10553: Fix WANDScorer's handling of 0 and +Infty. (apache#860) Make CONTRIBUTING.md a bit more succinct (apache#866) LUCENE-10504: KnnGraphTester to use KnnVectorQuery (apache#796) Add change line for LUCENE-9848 LUCENE-9848 Sort HNSW graph neighbors for construction (apache#862) LUCENE-10524 Add benchmark suite details to CONTRIBUTING.md (apache#853) LUCENE-10552: KnnVectorQuery has incorrect equals/ hashCode (apache#859) LUCENE-10534: MinFloatFunction / MaxFloatFunction calls exists twice (apache#837) LUCENE-10188: Give SortedSetDocValues a docValueCount() (apache#663) Allow to link to github PR from changes (apache#854) LUCENE-10551: improve testing of LowercaseAsciiCompression (apache#858) LUCENE-10542: FieldSource exists implementations can avoid value retrieval (apache#847) LUCENE-10539: Return a stream of completions from FSTCompletion. (apache#844) gradle 7.3.3 quick upgrade (apache#856) LUCENE-10530: Avoid floating point precision bug in TestTaxonomyFacetAssociations (apache#848) ...

* LUCENE-10504: KnnGraphTester to use KnnVectorQuery

LUCENE-10504: KnnGraphTester to use KnnVectorQuery

17aefef

jtibshirani reviewed Apr 6, 2022

View reviewed changes

remove KnnGraphTester -warm parameter; just use numIters

baf3bd8

also use KnnVectorQuery for warming

7563390

msokolov merged commit 7fbaa63 into apache:main May 4, 2022

msokolov deleted the LUCENE-10504 branch May 4, 2022 22:22

jtibshirani pushed a commit that referenced this pull request Jul 29, 2022

LUCENE-10504: KnnGraphTester to use KnnVectorQuery (#796)

2cb0e26

* LUCENE-10504: KnnGraphTester to use KnnVectorQuery

asfimport mentioned this pull request Jul 29, 2022

KnnGraphTester should use KnnVectorQuery [LUCENE-10504] #11540

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LUCENE-10504: KnnGraphTester to use KnnVectorQuery #796

LUCENE-10504: KnnGraphTester to use KnnVectorQuery #796

msokolov commented Apr 6, 2022

jtibshirani Apr 6, 2022

msokolov Apr 6, 2022

jtibshirani Apr 6, 2022

msokolov Apr 6, 2022

jtibshirani commented May 4, 2022

msokolov commented May 4, 2022

jtibshirani commented May 4, 2022

msokolov commented May 4, 2022

jtibshirani commented May 4, 2022

LUCENE-10504: KnnGraphTester to use KnnVectorQuery #796

LUCENE-10504: KnnGraphTester to use KnnVectorQuery #796

Conversation

msokolov commented Apr 6, 2022

jtibshirani Apr 6, 2022

Choose a reason for hiding this comment

msokolov Apr 6, 2022

Choose a reason for hiding this comment

jtibshirani Apr 6, 2022

Choose a reason for hiding this comment

msokolov Apr 6, 2022

Choose a reason for hiding this comment

jtibshirani commented May 4, 2022

msokolov commented May 4, 2022

jtibshirani commented May 4, 2022

msokolov commented May 4, 2022

jtibshirani commented May 4, 2022