explore truncation of search results based on overlap in the `Index.find(...)` protocol. #1925

ctb · 2022-04-03T16:34:12Z

While working on #1808, I started thinking about the find protocol implementation for several Index types (SBT, LCA_Database, and the maybe-soon SqliteIndex).

The internalIndex.find(...) implementations work as follows:

potential matches are discovered based on size of overlap between query and subject
these matches are then processed in order of that overlap, and Jaccard similarity or containment is calculated;
results are checked against the threshold and any picklist and returned if they pass

This is a nice optimization that takes advantage of intersection operations that can be done quickly on these databases; note that SBT does something slightly different that could maybe be adjusted per this issue.

So, why bring this up?

There is a potential optimization for containment-based searches where we could truncate the results checking. Since containment is monotonically decreasing in the results-checking loop, once the containment drops below the the threshold, we could stop checking any more results. This cannot be done for Jaccard similarity, however, because of situations where you might have a large Jaccard but a small overlap: specifically, consider:

for three sketches A, B, C,           
    |A intersect B| is greater than |A intersect C|                           
_but_                                                                     
|A jaccard B| is less than |A intersect B|

The current implementation does not do any truncation, because the Index.find protocol doesn't support it. We could change the protocol to support this level of flexibility.

Elsewhere, I'm introducing tests (*jaccard_ordering*) that will test that there is no improper truncation in the future.

Note, the Index.find(...) protocol was introduced in #1392 and #1477.

The text was updated successfully, but these errors were encountered:

ctb · 2022-04-21T13:21:15Z

A test to make sure truncation doesn't happen in Jaccard similarity was introduced in #1926.

ctb mentioned this issue Apr 3, 2022

[MRG] add some tests for Jaccard output ordering #1926

Merged

ctb mentioned this issue Jul 3, 2023

Implement find as an iterator in SBT to provide library users more options #2665

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

explore truncation of search results based on overlap in the `Index.find(...)` protocol. #1925

explore truncation of search results based on overlap in the `Index.find(...)` protocol. #1925

ctb commented Apr 3, 2022

ctb commented Apr 21, 2022

explore truncation of search results based on overlap in the Index.find(...) protocol. #1925

explore truncation of search results based on overlap in the Index.find(...) protocol. #1925

Comments

ctb commented Apr 3, 2022

ctb commented Apr 21, 2022

explore truncation of search results based on overlap in the `Index.find(...)` protocol. #1925

explore truncation of search results based on overlap in the `Index.find(...)` protocol. #1925