Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

testMergeStability failing for Knn formats #13640

Closed
benwtrent opened this issue Aug 9, 2024 · 16 comments · Fixed by #14172
Closed

testMergeStability failing for Knn formats #13640

benwtrent opened this issue Aug 9, 2024 · 16 comments · Fixed by #14172

Comments

@benwtrent
Copy link
Member

Description

All KNN formats are periodically failing testMergeStability.

I have verified its due to #13566

The stability failure is due to a different size in the vex (e.g. vector graph connections).

Gradle command to reproduce

./gradlew test --tests TestLucene99HnswScalarQuantizedVectorsFormat.testMergeStability -Dtests.seed=84298DFFA7C134B7 -Dtests.locale=kn -Dtests.timezone=Hongko -Dtests.asserts=true -Dtests.file.encoding=UTF-8
@benwtrent
Copy link
Member Author

@msokolov ^ I haven't been able to look into fixing it yet. Just now noticed it.

@msokolov
Copy link
Contributor

hmm thanks I'll take a look soon

@msokolov
Copy link
Contributor

I didn't know about this constraint until now. Basically what happens is during merge we check for disconnected components and attempt to add connections to connect them. So it makes sense we might be adding some bytes to the vex file. Maybe a way to avoid this is to skip recreating the HNSW graph when merging a single segment. Honestly I don't know why we would be doing that. I'll dig a bit more/

@msokolov
Copy link
Contributor

OK, I guess we would need to actually build a graph when merging a single segment in case there are deletions. In any case it would be nice if the graph reconnection were stable. This test exposes some interesting problems! Yay for our tests.

@msokolov
Copy link
Contributor

msokolov commented Aug 11, 2024

Apparently that patch did not fix all the things; this failure got generated on that patched version (reporoduced for me on 9x branch):

   gradlew test --tests TestPerFieldKnnVectorsFormat.testMergeStability -Dtests.seed=A6835D6A0735B851 -Dtests.multiplier=3 -Dtests.locale=ar-IL -Dtests.timezone=Europe/Berlin -Dtests.asserts=true -Dtests.file.encoding=UTF-8

@msokolov
Copy link
Contributor

found and fixed a branch_9x-only problem. Hopefully this calms down now

@benwtrent
Copy link
Member Author

Not to be the bearer of bad news:

./gradlew test --tests TestLucene95HnswVectorsFormat.testMergeStability -Dtests.seed=3D48DB65BA75CD03 -Dtests.locale=uz-Latn-UZ -Dtests.timezone=Europe/Tirane -Dtests.asserts=true -Dtests.file.encoding=UTF-8

Fails on main.

Its interesting that the old formats are adjusting their outputs at all. I would expect all older formats to be unchanged. I wonder if the default HnswGraph builder behavior changed.

@msokolov
Copy link
Contributor

See #13654

@msokolov
Copy link
Contributor

No it's not HnswGraphBuilder per se but the HnswConcurrentMergeBuilder that calls finish() and thence connectComponents. Since it's part of the o.a.l.u.hnsw package it's not versioned and is shared by all these different index versions. This seems like a testing-only issue to me since when we read old indexes and merge them we write in the new format and expect changes.

@msokolov
Copy link
Contributor

It's been a few days since I've seen any automated failures and all known instances have been addressed. I think we can close, wdyt @benwtrent ?

@benwtrent
Copy link
Member Author

agreed, closing.

@benwtrent
Copy link
Member Author

@msokolov its reared its head again.

./gradlew test --tests TestPerFieldKnnVectorsFormat.testMergeStability -Dtests.seed=FF1182F3FC600FF -Dtests.locale=mni-Beng-IN -Dtests.timezone=SystemV/PST8 -Dtests.asserts=true -Dtests.file.encoding=UTF-8

Makes me think that we need to mark this merging as unstable.

I wanted to verify the format, and verified its:

 Lucene99HnswVectorsFormat(name=Lucene99HnswVectorsFormat, maxConn=5, beamWidth=40, flatVectorFormat=Lucene99FlatVectorsFormat(vectorsScorer=DefaultFlatVectorScorer()))

If I turned off connect components, this seed passes.

@benwtrent
Copy link
Member Author

I discovered two other weird behaviors digging into this test failure. But, neither seemed to fix this inconsistency: #14174

@msokolov
Copy link
Contributor

Curious if you tried git bisect to see if there was any recent change that reintroduced this?

@msokolov
Copy link
Contributor

I did the git bisect dance and found this test seed starts failing with Randomize KnnVector codec params in RandomCodec. That was basically a test change only though so I guess maybe it merely exposed this condition

@benwtrent
Copy link
Member Author

Interesting, the randomized case isn't anything special. Its just a plain 'ole Lucene99Hnsw index. No quantization or anything :/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants