Add knn result consistency test #14167

benwtrent · 2025-01-23T15:10:24Z

Inspired by some weird behavior I have seen, adding a consistency test.

I found that indeed, this fails over some seeds.

Frustratingly, the seeded failures do not seem to be repeatable. But, running

./gradlew :lucene:core:test --tests "org.apache.lucene.search.TestSeededKnnFloatVectorQuery.testRandomConsistency" -Dtests.iters=1000

Results in failures, though, not consistently. This seems to indicate some funky race condition.

Obviously, this shouldn't be merged until we figure out the consistency issue.

mikemccand · 2025-01-23T15:18:01Z

Frustratingly, the seeded failures do not seem to be repeatable.

Hmm that is bad ... it means there is a test bug or test infra bug (separate from the scary bug this test is chasing!)?

Oh, maybe force SerialMergeScheduler to your RandomIndexWriter? Since CMS (Lucene's default and RIW will sometimes pick that) launches threads and we don't know how to determinize JVM's/OS's thread scheduling, that might explain the non-reproducibility? E.g.:

    IndexWriterConfig iwc = LuceneTestCase.newIndexWriterConfig(r, new MockAnalyzer(r)), true, r.nextBoolean();
    iwc.setMergeScheduler(new SerialMergeScheduler());
    RandomIndexWriter riw = new RandomIndexWriter(random(), dir, iwc);

or so?

msokolov · 2025-01-23T15:32:06Z

As for the reproducibility problem, that may be caused by concurrent HNSW merging, which is nondeterministic.

benwtrent · 2025-01-23T15:36:45Z

@msokolov @mikemccand maybe the consistency I am testing isn't clear.

First: Index a bunch of vectors
Second: do a single query on a static index to get the top-k
Repeat-N: verify the exact same query on the exact same index without changes results in the same docs and scores.

I am not sure any merging or indexing time changes would effect this no?

msokolov · 2025-01-23T15:39:07Z

I think our comments relate to the observation that the test does not reproducibly fail with the same seed

benwtrent · 2025-01-23T15:41:15Z

I think our comments relate to the observation that the test does not reproducibly fail with the same seed

🤦 for sure. Let me see if I can shore it up.

benwtrent · 2025-01-23T16:46:16Z

OK, I cleaned it all up, and have two separate tests, one for multi-threaded one for single threaded.

The multi-threaded one is the only one that fails periodically, which explains the difficulty in replicating. Threads might be racing to explore their segments first and thus stop exploring other graphs sooner than other runs.

As for the single-threaded, I haven't had it fail in 10s of thousands of runs. Which doesn't 100% mean there isn't an issue there as well. I just haven't had a failure yet.

benwtrent · 2025-01-23T17:20:36Z

OK, if I change to never use MultiLeafKnnCollector, the multi-threaded consistency test passes. But with using that collector, it will fail a couple times over 10k+ repeats.

mayya-sharipova · 2025-01-23T20:46:31Z

@benwtrent Thanks for raising this, this indeed happens because of MultiLeafKnnCollector and search threads exchanging info of the globally collected results. Because it is not deterministic when each segment thread shares info with the global queue, we may get inconsistent results between runs.

So far, I could not find a way to make it deterministic.

benwtrent · 2025-01-27T16:21:28Z

@mayya-sharipova maybe a search time flag is possible, but it would stink to have a "inconsistent but fast" flag that users then have to worry about.

I don't know of another query where multiple passes over a static dataset can return different docs.

It seems that the default behavior should be consistency.

I think we need to do one of the following:

fix multi-threaded consistency with information sharing (my first choice if I had a magic wand)
turn off direct multi-threaded queries in kNN (my second choice)
turn off information sharing between segments (the last resort)

I would rather keep doing less work with information sharing and use less threads than to do more work while also using more threads. However, if I can magically have both, I prefer it.

I am curious to the opinions of others here: @jpountz @msokolov @mikemccand

Another consideration is if this is enough for a bugfix in Lucene 9.12.x.

jpountz · 2025-01-27T17:07:20Z

I don't know of another query where multiple passes over a static dataset can return different docs.

Currently, this does not happen because Lucene only enables so-called "rank-safe" optimizations to top-k query processing for lexical search. So regardless of how search threads race with one another, Top(ScoreDoc|Field)CollectorManager are guaranteed to always return the same (correct) hits. However, would we enable "rank-unsafe" optimizations (e.g. #12446), we would be observing the same issue that you are seeing here.

I suspect that users may indeed struggle with this behavior, e.g. if running the same query multiple times on an e-commerce website doesn't return the same hits every time. It probably makes it hard to write integration tests as well. I believe that the Anserini IR toolkit wouldn't be happy either given how much it cares about reproducibility. The direction that you are suggesting makes sense to me, I have no idea how hard it is.

jpountz · 2025-01-27T17:14:24Z

Somewhat related, thinking out loud: I have been wondering about what is the best way to parallelize top-k query processing. Lexical search has a similar issue as knn search in that it is not very CPU-efficient to let search threads independently make similar decisions about what it means for a hit to be competitive. This made me wonder if it would be a better trade-off to let just one slice run on its own first, and then let all other N-1 slices run in parallel with one another, taking advantage of what we "learned" from processing the first slice. If these N-1 slices would only look at what we learned from this first slice and ignored everything about any other slice, I believe that there wouldn't be any consistency due to races while query processing would still be mostly parallel and likely more CPU-efficient (as in total CPU time per query).

benwtrent · 2025-01-27T19:54:43Z

This made me wonder if it would be a better trade-off to let just one slice run on its own first, and then let all other N-1 slices run in parallel with one another,

I really like this idea. For kNN search, it seems best to take the largest tiers, gather information from them, and then run the smaller tiers in parallel.

The major downside of kNN is that there is no slicing at all. Every segment is just its own worker, which is sort of crazy. We should at a minimum combine all the tiny segments together into a single thread.

What do you think @mayya-sharipova ? Slicing the segments and then picking the "largest" slice and search that in current thread. Then using that information to help the future parallel threads?

msokolov · 2025-01-28T16:24:40Z

I was thinking of another approach based on pro-rating. On its own this is deterministic and close to optimally efficient, but risks missing the best results when the index is skewed. If me that if the HNSW search could be made re-entrant, by preserving the state in the HnswSearcher (visited list, priority queues) then we could examine all the per-segment results after completing a pass through the graphs, and then revisit some segments more deeply if the results appear skewed. Basically the information-sharing would be done in a sequential, periodic fashion

…

On Monday, January 27th, 2025 at 2:55 PM, Benjamin Trent ***@***.***> wrote: > This made me wonder if it would be a better trade-off to let just one slice run on its own first, and then let all other N-1 slices run in parallel with one another, I really like this idea. For kNN search, it seems best to take the largest tiers, gather information from them, and then run the smaller tiers in parallel. The major downside of kNN is that there is no slicing at all. Every segment is just its own worker, which is sort of crazy. We should at a minimum combine all the tiny segments together into a single thread. What do you think ***@***.***(https://github.com/mayya-sharipova) ? Slicing the segments and then picking the "largest" slice and search that in current thread. Then using that information to help the future parallel threads? — Reply to this email directly, [view it on GitHub](#14167 (comment)), or [unsubscribe](https://github.com/notifications/unsubscribe-auth/AAHHUQOHSZPE3IX6N2MXJ532M2FJXAVCNFSM6AAAAABVXSSWDKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMMJWG43DANRTG4). You are receiving this because you were mentioned.Message ID: ***@***.***>

benwtrent · 2025-01-29T14:05:30Z

To aid in the conversation, I opened an issue: #14180

I plan on merging this new test, but with the multi-threaded case muted until we can fix: #14180

Add knn result consistency test

ca22de6

iter

2c19484

benwtrent mentioned this pull request Jan 29, 2025

Multi-threaded vector search over multiple segments can lead to inconsistent results #14180

Open

muting multi-threaded test case, bug ref: apache#14180

143c526

benwtrent merged commit feb0e18 into apache:main Jan 29, 2025
5 checks passed

benwtrent deleted the test/add-knn-consistency-test branch January 29, 2025 15:12

benwtrent added a commit that referenced this pull request Jan 29, 2025

Add knn result consistency test (#14167)

6357460

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add knn result consistency test #14167

Add knn result consistency test #14167

benwtrent commented Jan 23, 2025

mikemccand commented Jan 23, 2025

msokolov commented Jan 23, 2025

benwtrent commented Jan 23, 2025

msokolov commented Jan 23, 2025

benwtrent commented Jan 23, 2025

benwtrent commented Jan 23, 2025

benwtrent commented Jan 23, 2025

mayya-sharipova commented Jan 23, 2025

benwtrent commented Jan 27, 2025

jpountz commented Jan 27, 2025

jpountz commented Jan 27, 2025

benwtrent commented Jan 27, 2025

msokolov commented Jan 28, 2025 via email

benwtrent commented Jan 29, 2025

Add knn result consistency test #14167

Add knn result consistency test #14167

Conversation

benwtrent commented Jan 23, 2025

mikemccand commented Jan 23, 2025

msokolov commented Jan 23, 2025

benwtrent commented Jan 23, 2025

msokolov commented Jan 23, 2025

benwtrent commented Jan 23, 2025

benwtrent commented Jan 23, 2025

benwtrent commented Jan 23, 2025

mayya-sharipova commented Jan 23, 2025

benwtrent commented Jan 27, 2025

jpountz commented Jan 27, 2025

jpountz commented Jan 27, 2025

benwtrent commented Jan 27, 2025

msokolov commented Jan 28, 2025 via email

benwtrent commented Jan 29, 2025