-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BaseVectorSimilarityQueryTestCase assumes connected hnsw graph #13260
Conversation
…rityQueryTestCase
@@ -165,6 +171,9 @@ public void testRandomFilter() throws IOException { | |||
|
|||
try (Directory indexStore = getIndexStore(getRandomVectors(numDocs, dim)); | |||
IndexReader reader = DirectoryReader.open(indexStore)) { | |||
if (graphIsDisconnected(reader)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would use assumeFalse(graphIsDisconnected(reader))
as this will also log a short notice that the test was not executed at all.
@@ -522,4 +534,20 @@ final Directory getIndexStore(V... vectors) throws IOException { | |||
} | |||
return dir; | |||
} | |||
|
|||
private boolean graphIsDisconnected(IndexReader reader) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks a bit crazy.... Maybe move it to the test-util, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, maybe we want to use in some other test class too. Ill refactor and pull out the field name
lucene/test-framework/src/java/org/apache/lucene/tests/util/hnsw/HnswTestUtil.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @msokolov for fixing this!.....I left some comments
private static int nextClearBit(FixedBitSet bits, int index) { | ||
// Depends on the ghost bits being clear! | ||
long[] barray = bits.getBits(); | ||
assert index >= 0 && index < bits.length() : "index=" + index + ", numBits=" + bits.length(); | ||
int i = index >> 6; | ||
long word = ~(barray[i] >> index); // skip all the bits to the right of index | ||
|
||
if (word != 0) { | ||
return index + Long.numberOfTrailingZeros(word); | ||
} | ||
|
||
while (++i < barray.length) { | ||
word = ~barray[i]; | ||
if (word != 0) { | ||
int next = (i << 6) + Long.numberOfTrailingZeros(word); | ||
if (next >= bits.length()) { | ||
return NO_MORE_DOCS; | ||
} else { | ||
return next; | ||
} | ||
} | ||
} | ||
return NO_MORE_DOCS; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function seems to be motivated from FixedBitSet#nextSetBit
and would be more suitable for FixedBitSet
class?....I wonder why it does not have it already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it could go there - it's a near clone of FixedBitSet.nextSetBit
but I guess the motivation is to limit the surface area of supported functions. Perhaps if this gets wider use we could move it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess one would also want to discourage its use in performance-sensitive code because it requires extra work beyond what nextSetBit
does
* Returns the sizes of the distinct graph components on level 0. If the graph is fully-connected | ||
* there will only be a single component. If the graph is empty, the returned list will be empty. | ||
*/ | ||
public static List<Integer> componentSizes(HnswGraph hnsw) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like the purpose of this function is to be a utility for future use cases(if any) because if we only need to know the graph is fully connected we could just pick a node and traverse only once to check if its size is equal to graph size i.e. early terminate instead of going through all components. Alternatively, I think we should keep componentSizes
as-is but implement isFullyConnected
like below?
public static boolean isFullyConnected(HnswGraph hnsw) throws IOException {
FixedBitSet connectedNodes = new FixedBitSet(hnsw.size());
assert hnsw.size() == hnsw.getNodesOnLevel(0).size();
System.out.println("size=" + hnsw.size());
return connectedNodes.length() == traverseConnectedNodes(hnsw, connectedNodes);
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this is a tiny bit if lookahead. I would like to be able to use this to identify distinct components so we can patch them up while merging. TBH I look at this whole class as somewhat provisional and expect it to eventually move into the normal util package, and get tested better. In the mean time I agree we can reduce the cost of isFullyConnected in the best case; I can do that. But I'd like to hold on to the componentSizes method even though it will then be unused.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another reason to keep it this way (counting components) is if we want to extend it to work with liveDocs or some other Bits filter then we won't necessarily know the expected size in advance, but the topological invariant will still hold.
I think the Unwrappable for FilterReader should be added in a separate PR. For now this is fine. I remember vaguely that I left our FilterReader because of the static method conflicting with the interface. This may be a good job for cleanup 10.x (remove static and just implement interface). |
thanks for the reviews! |
hm a failure popped up when I pushed:
I don't think it's related but I'll check |
@msokolov I have had something similar fail in other code, so it isn't related to your change here. I think its a flaky test. It doesn't reliably reproduce for me. |
…rityQueryTestCase
Description