Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add knn result consistency test #14167

Merged
merged 3 commits into from
Jan 29, 2025

Conversation

benwtrent
Copy link
Member

Inspired by some weird behavior I have seen, adding a consistency test.

I found that indeed, this fails over some seeds.

Frustratingly, the seeded failures do not seem to be repeatable. But, running

./gradlew :lucene:core:test --tests "org.apache.lucene.search.TestSeededKnnFloatVectorQuery.testRandomConsistency" -Dtests.iters=1000

Results in failures, though, not consistently. This seems to indicate some funky race condition.

Obviously, this shouldn't be merged until we figure out the consistency issue.

@mikemccand
Copy link
Member

Frustratingly, the seeded failures do not seem to be repeatable.

Hmm that is bad ... it means there is a test bug or test infra bug (separate from the scary bug this test is chasing!)?

Oh, maybe force SerialMergeScheduler to your RandomIndexWriter? Since CMS (Lucene's default and RIW will sometimes pick that) launches threads and we don't know how to determinize JVM's/OS's thread scheduling, that might explain the non-reproducibility? E.g.:

    IndexWriterConfig iwc = LuceneTestCase.newIndexWriterConfig(r, new MockAnalyzer(r)), true, r.nextBoolean();
    iwc.setMergeScheduler(new SerialMergeScheduler());
    RandomIndexWriter riw = new RandomIndexWriter(random(), dir, iwc);

or so?

@msokolov
Copy link
Contributor

As for the reproducibility problem, that may be caused by concurrent HNSW merging, which is nondeterministic.

@benwtrent
Copy link
Member Author

@msokolov @mikemccand maybe the consistency I am testing isn't clear.

First: Index a bunch of vectors
Second: do a single query on a static index to get the top-k
Repeat-N: verify the exact same query on the exact same index without changes results in the same docs and scores.

I am not sure any merging or indexing time changes would effect this no?

@msokolov
Copy link
Contributor

I think our comments relate to the observation that the test does not reproducibly fail with the same seed

@benwtrent
Copy link
Member Author

I think our comments relate to the observation that the test does not reproducibly fail with the same seed

🤦 for sure. Let me see if I can shore it up.

@benwtrent
Copy link
Member Author

OK, I cleaned it all up, and have two separate tests, one for multi-threaded one for single threaded.

The multi-threaded one is the only one that fails periodically, which explains the difficulty in replicating. Threads might be racing to explore their segments first and thus stop exploring other graphs sooner than other runs.

As for the single-threaded, I haven't had it fail in 10s of thousands of runs. Which doesn't 100% mean there isn't an issue there as well. I just haven't had a failure yet.

@benwtrent
Copy link
Member Author

OK, if I change to never use MultiLeafKnnCollector, the multi-threaded consistency test passes. But with using that collector, it will fail a couple times over 10k+ repeats.

@mayya-sharipova
Copy link
Contributor

@benwtrent Thanks for raising this, this indeed happens because of MultiLeafKnnCollector and search threads exchanging info of the globally collected results. Because it is not deterministic when each segment thread shares info with the global queue, we may get inconsistent results between runs.

So far, I could not find a way to make it deterministic.

@benwtrent
Copy link
Member Author

@mayya-sharipova maybe a search time flag is possible, but it would stink to have a "inconsistent but fast" flag that users then have to worry about.

I don't know of another query where multiple passes over a static dataset can return different docs.

It seems that the default behavior should be consistency.

I think we need to do one of the following:

  • fix multi-threaded consistency with information sharing (my first choice if I had a magic wand)
  • turn off direct multi-threaded queries in kNN (my second choice)
  • turn off information sharing between segments (the last resort)

I would rather keep doing less work with information sharing and use less threads than to do more work while also using more threads. However, if I can magically have both, I prefer it.

I am curious to the opinions of others here: @jpountz @msokolov @mikemccand

Another consideration is if this is enough for a bugfix in Lucene 9.12.x.

@jpountz
Copy link
Contributor

jpountz commented Jan 27, 2025

I don't know of another query where multiple passes over a static dataset can return different docs.

Currently, this does not happen because Lucene only enables so-called "rank-safe" optimizations to top-k query processing for lexical search. So regardless of how search threads race with one another, Top(ScoreDoc|Field)CollectorManager are guaranteed to always return the same (correct) hits. However, would we enable "rank-unsafe" optimizations (e.g. #12446), we would be observing the same issue that you are seeing here.

I suspect that users may indeed struggle with this behavior, e.g. if running the same query multiple times on an e-commerce website doesn't return the same hits every time. It probably makes it hard to write integration tests as well. I believe that the Anserini IR toolkit wouldn't be happy either given how much it cares about reproducibility. The direction that you are suggesting makes sense to me, I have no idea how hard it is.

@jpountz
Copy link
Contributor

jpountz commented Jan 27, 2025

Somewhat related, thinking out loud: I have been wondering about what is the best way to parallelize top-k query processing. Lexical search has a similar issue as knn search in that it is not very CPU-efficient to let search threads independently make similar decisions about what it means for a hit to be competitive. This made me wonder if it would be a better trade-off to let just one slice run on its own first, and then let all other N-1 slices run in parallel with one another, taking advantage of what we "learned" from processing the first slice. If these N-1 slices would only look at what we learned from this first slice and ignored everything about any other slice, I believe that there wouldn't be any consistency due to races while query processing would still be mostly parallel and likely more CPU-efficient (as in total CPU time per query).

@benwtrent
Copy link
Member Author

This made me wonder if it would be a better trade-off to let just one slice run on its own first, and then let all other N-1 slices run in parallel with one another,

I really like this idea. For kNN search, it seems best to take the largest tiers, gather information from them, and then run the smaller tiers in parallel.

The major downside of kNN is that there is no slicing at all. Every segment is just its own worker, which is sort of crazy. We should at a minimum combine all the tiny segments together into a single thread.

What do you think @mayya-sharipova ? Slicing the segments and then picking the "largest" slice and search that in current thread. Then using that information to help the future parallel threads?

@msokolov
Copy link
Contributor

msokolov commented Jan 28, 2025 via email

@benwtrent
Copy link
Member Author

To aid in the conversation, I opened an issue: #14180

I plan on merging this new test, but with the multi-threaded case muted until we can fix: #14180

@benwtrent benwtrent merged commit feb0e18 into apache:main Jan 29, 2025
5 checks passed
@benwtrent benwtrent deleted the test/add-knn-consistency-test branch January 29, 2025 15:12
benwtrent added a commit that referenced this pull request Jan 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants