Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LUCENE-10504: KnnGraphTester to use KnnVectorQuery #796

Merged
merged 3 commits into from
May 4, 2022

Conversation

msokolov
Copy link
Contributor

@msokolov msokolov commented Apr 6, 2022

This really has two changes:

  1. it switches the vector searches it runs to use the Query impl, as the description says
  2. it becomes a bit more clever about managing its cache of "exact" NN that are used for recall comparisons. Previously, if you changed the source data files it would still potentially re-use the cached NN file. Now it stores a hash of the file name and looks at the modification times to see if it should regenerate the NN file

numDocs = reader.maxDoc();
for (int i = 0; i < warmCount; i++) {
// warm up
targets.get(target);
results[i] = doKnnSearch(reader, KNN_FIELD, target, topK, fanout);
doKnnSearch(reader, KNN_FIELD, target, topK, fanout);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be fine to use doKnnVectorQuery here so we could delete doKnnSearch?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that makes sense, I don't see why not

@@ -349,8 +353,9 @@ private void testSearch(Path indexPath, Path queryPath, Path outputPath, int[][]
TopDocs[] results = new TopDocs[numIters];
long elapsed, totalCpuTime, totalVisited = 0;
try (FileChannel q = FileChannel.open(queryPath)) {
int bufferSize = Math.max(numIters, warmCount) * dim * Float.BYTES;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could just assert warmCount < numIters, seems unusual to warm up with queries that you don't use in the benchmark?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think we can simply replace warmCount with numIters

@jtibshirani
Copy link
Member

@msokolov it'd be great to get this in when you have the time!

@msokolov
Copy link
Contributor Author

msokolov commented May 4, 2022

@msokolov it'd be great to get this in when you have the time!

whoa sorry I lost track. I will address the comments and push soon. I have other changes I want to make, relating to testing pre-filtering. Some folks here have been testing and finding very interesting results; I think we are going to want to have a good test harness for different filter selectivity.

@msokolov msokolov merged commit 7fbaa63 into apache:main May 4, 2022
@msokolov msokolov deleted the LUCENE-10504 branch May 4, 2022 22:22
@jtibshirani
Copy link
Member

Thanks @msokolov ! So I'm not waiting in suspense too much, what sorts of interesting results have you found related to prefiltering?

@msokolov
Copy link
Contributor Author

msokolov commented May 4, 2022

I'm trying not to steal the thunder of the folks who are actually working on this, but at a high level: we were seeing prefiltering being more expensive than postfiltering (over collecting and then filtering) for the same "yield" of top K, but were able to recover the cost, and flip the balance the other way by using the non-matching nodes to traverse the graph up until some threshold. Basically, we play with how we update the lower bound, allowing it to increase based on non-matching (the filter) nodes, until we get close to where we want to be. I guess the intuition is that nodes that are never going to be included in the top K are still useful for traversing the graph.

@jtibshirani
Copy link
Member

Super interesting, looking forward to hearing more! I do hope we can stick with a prefiltering-like approach (and just improve its performance), since it feels easier to work with for users. If you request k documents, you always get k back -- there's no guessing about how many candidates you need to retrieve as in post-filtering.

This doesn't sound like what you're talking about, but I did notice that prefiltering can be expensive when the filter matches a lot of documents. Unlike in postfiltering, the filter is not allowed skip any of the docs and you need to convert all the matches into a bit set, which is not cheap.

wjp719 added a commit to wjp719/lucene that referenced this pull request May 10, 2022
* main:
  LUCENE-10532: remove @Slow annotation (apache#832)
  LUCENE-10312: Add PersianStemmer (apache#540)
  LUCENE-10558: Implement URL ctor to support classpath/module usage in Kuromoji and Nori dictionaries (main branch) (apache#871)
  LUCENE-10436: Reinstate public getdocValuesdocIdSetIterator method on DocValues (apache#869)
  Disable liftbot, we have our own tools
  LUCENE-10553: Fix WANDScorer's handling of 0 and +Infty. (apache#860)
  Make CONTRIBUTING.md a bit more succinct (apache#866)
  LUCENE-10504: KnnGraphTester to use KnnVectorQuery (apache#796)
  Add change line for LUCENE-9848
  LUCENE-9848 Sort HNSW graph neighbors for construction (apache#862)
wjp719 added a commit to wjp719/lucene that referenced this pull request May 10, 2022
* main: (24 commits)
  LUCENE-10532: remove @Slow annotation (apache#832)
  LUCENE-10312: Add PersianStemmer (apache#540)
  LUCENE-10558: Implement URL ctor to support classpath/module usage in Kuromoji and Nori dictionaries (main branch) (apache#871)
  LUCENE-10436: Reinstate public getdocValuesdocIdSetIterator method on DocValues (apache#869)
  Disable liftbot, we have our own tools
  LUCENE-10553: Fix WANDScorer's handling of 0 and +Infty. (apache#860)
  Make CONTRIBUTING.md a bit more succinct (apache#866)
  LUCENE-10504: KnnGraphTester to use KnnVectorQuery (apache#796)
  Add change line for LUCENE-9848
  LUCENE-9848 Sort HNSW graph neighbors for construction (apache#862)
  LUCENE-10524 Add benchmark suite details to CONTRIBUTING.md (apache#853)
  LUCENE-10552: KnnVectorQuery has incorrect equals/ hashCode (apache#859)
  LUCENE-10534: MinFloatFunction / MaxFloatFunction calls exists twice (apache#837)
  LUCENE-10188: Give SortedSetDocValues a docValueCount() (apache#663)
  Allow to link to github PR from changes (apache#854)
  LUCENE-10551: improve testing of LowercaseAsciiCompression (apache#858)
  LUCENE-10542: FieldSource exists implementations can avoid value retrieval (apache#847)
  LUCENE-10539: Return a stream of completions from FSTCompletion. (apache#844)
  gradle 7.3.3 quick upgrade (apache#856)
  LUCENE-10530: Avoid floating point precision bug in TestTaxonomyFacetAssociations (apache#848)
  ...
jtibshirani pushed a commit that referenced this pull request Jul 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants