Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Rework the find functionality for Index classes #1392

Merged
merged 117 commits into from
Apr 22, 2021

Conversation

ctb
Copy link
Contributor

@ctb ctb commented Mar 13, 2021

"They who control find control the universe."

This PR implements a common Index.find generator function for Jaccard similarity and containment on Index classes.

This new generator function takes a JaccardSearch object and a query signature as inputs, and yields all signatures that meet the criteria.

The JaccardSearch object in src/sourmash/search.py is the workhorse for find; it scores potential matches (and can truncate searches) based on query_size, intersection, subject_size, and union, which is sufficient to calculate similarity, containment, and max_containment.

This provides some nice simplifications -

Moreover,

  • the logic around downsampling and database search is greatly simplified, and is now much more thoroughly tested;
  • the logic for abundance searches is split off into a separate function, which seems appropriate given that it's largely unused and largely untested...;

The overall result is a pretty good code consolidation and simplification.

This PR also:

Specifically, this PR:

Fixes #829 - sourmash categorize now takes all database types
Fixes #1377 - provides a location property on Index objects
Fixes #1389 - --best-only now works for similarity and containment
Fixes #1454 - sourmash index now flattens SBT leaves.

Larger thoughts:

Questions and comments for reviewers

  • One goal for this is to better support a prefetch linear pass search, other makeshift strategies for large scale database search - the "greyhound" issue #1226. I'm not clear on whether I've gotten the abstractions right in such a way that they can be used/optimized in Rust code. Comments please!
  • the search and gather methods in the Index class, and the find functions in the various Index subclasses, are really the key changes in this PR.
  • this breaks backwards compatibility for what I think is a bug (at the very least it is previously undocumented behavior :) - we allowed storage of abund signatures in SBTs, and sourmash categorize took advantage of this to return angular similarity calculations when the query signature had abundances.
    • This is literally the only place in the code base where this was used, so... I disabled it. See "Larger thoughts", above, and check out sourmash index does not flatten the signatures when building an SBT #1454.
    • In terms of user notification, sourmash now produces an appropriate error message requiring the user to specify --ignore-abundance explicitly, which should steer people in the right direction and minimize the impact of this.
  • this breaks backwards compatibility for loading v1 and v2 SBT databases. I think this is correct. Briefly,

TODO items:

  • look into v1/v2 SBT breakage/tests in more detail.
  • fix docstring in sbt.py and lca_db.py
  • look into test_sbt.py lines 419 and 458, why are they not run!?
  • see @ctb note in test_search.py
  • look into commented out line in test_sbt_categorize_ignore_abundance_2
  • create new "good first issue" to replace the tuple indices in results checking in test_index.py with the new IndexSearchResult namedtuple attributes.
  • fix up and test categorize code
  • write more better tests of sourmash gather with abundance signatures & save unassigned
  • double check multigather tests; look at save unassigned there, too.
  • look into removing categorize use of LoadSingleSignature refactor LoadSingleSignatures? #1077
  • write tests for new get_search_obj function
  • clean up, document, and test all of the downsampling etc code in find
  • implement/test find_best code fixing 'search --best-only' on SBTs only works for similarity, not for containment #1389
  • test num logic in minhash.py
  • implement unload_data

@codecov
Copy link

codecov bot commented Mar 13, 2021

Codecov Report

Merging #1392 (c8d8cd6) into latest (eb2b210) will increase coverage by 0.13%.
The diff coverage is 96.44%.

Impacted file tree graph

@@            Coverage Diff             @@
##           latest    #1392      +/-   ##
==========================================
+ Coverage   89.58%   89.71%   +0.13%     
==========================================
  Files         122      123       +1     
  Lines       18989    19464     +475     
  Branches     1455     1483      +28     
==========================================
+ Hits        17011    17463     +452     
- Misses       1750     1775      +25     
+ Partials      228      226       -2     
Flag Coverage Δ
python 94.86% <97.71%> (+0.06%) ⬆️
rust 67.20% <0.00%> (-0.17%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/core/src/ffi/minhash.rs 0.00% <0.00%> (ø)
src/core/src/sketch/minhash.rs 91.31% <ø> (ø)
src/sourmash/sbt.py 80.82% <86.53%> (-3.13%) ⬇️
src/sourmash/minhash.py 92.69% <89.47%> (+0.05%) ⬆️
src/sourmash/commands.py 83.84% <94.33%> (+0.50%) ⬆️
src/sourmash/search.py 93.23% <95.74%> (+2.00%) ⬆️
tests/test_search.py 98.48% <98.48%> (ø)
src/sourmash/index.py 94.62% <98.85%> (+2.51%) ⬆️
src/sourmash/cli/categorize.py 100.00% <100.00%> (ø)
src/sourmash/lca/lca_db.py 92.30% <100.00%> (+0.96%) ⬆️
... and 11 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update eb2b210...c8d8cd6. Read the comment docs.

@ctb ctb changed the title [WIP] Rework the find functionality for Index classes [MRG] Rework the find functionality for Index classes Apr 19, 2021
@ctb
Copy link
Contributor Author

ctb commented Apr 19, 2021

OK, I think this is done, provisionally.

@luizirber
Copy link
Member

Review ongoing, but while I don't finish it: the gather implementation for SBTs became way slower, and memory consumption also went up:

image

@ctb
Copy link
Contributor Author

ctb commented Apr 19, 2021

yeesh! that's not good!

I wonder if unload_data's default changed?

@ctb
Copy link
Contributor Author

ctb commented Apr 19, 2021

(there's no obvious algorithmic reason for the time or memory to go up)

@luizirber
Copy link
Member

(there's no obvious algorithmic reason for the time or memory to go up)

All set calculations are done in Python now:
https://github.com/dib-lab/sourmash/pull/1392/files#diff-796cf35ae8d09c8df495f82265c2593075e00b9d91f5fed92f3e17e47d155a16R421

While not algorithmic different, it is a lot of memory copying to pull hashes out of Rust, create sets in Python, and then calculate...

@ctb
Copy link
Contributor Author

ctb commented Apr 19, 2021

(there's no obvious algorithmic reason for the time or memory to go up)

All set calculations are done in Python now:
https://github.com/dib-lab/sourmash/pull/1392/files#diff-796cf35ae8d09c8df495f82265c2593075e00b9d91f5fed92f3e17e47d155a16R421

While not algorithmic different, it is a lot of memory copying to pull hashes out of Rust, create sets in Python, and then calculate...

ahh! good point! and that's pretty easy to fix with MinHash code... I'll see if I can get it working using the Rust code.

@ctb
Copy link
Contributor Author

ctb commented Apr 19, 2021

(great job on finding that, I struggled to get that code (a) working and then (b) clean and (c) tested, so now it's time for (d) optimization 😂)

@luizirber
Copy link
Member

(great job on finding that, I struggled to get that code (a) working and then (b) clean and (c) tested, so now it's time for (d) optimization joy)

(that's the code I need to change for #1137 #1138 #1201 #1221, so...)

@ctb
Copy link
Contributor Author

ctb commented Apr 19, 2021

(that's the code I need to change for #1137 #1138 #1201 #1221, so...)

image

@ctb
Copy link
Contributor Author

ctb commented Apr 20, 2021

(there's no obvious algorithmic reason for the time or memory to go up)

All set calculations are done in Python now:
https://github.com/dib-lab/sourmash/pull/1392/files#diff-796cf35ae8d09c8df495f82265c2593075e00b9d91f5fed92f3e17e47d155a16R421
While not algorithmic different, it is a lot of memory copying to pull hashes out of Rust, create sets in Python, and then calculate...

ahh! good point! and that's pretty easy to fix with MinHash code... I'll see if I can get it working using the Rust code.

#1474 provides a MinHash.intersection implemented in Rust. Would be interested in benchmarks!

@ctb
Copy link
Contributor Author

ctb commented Apr 21, 2021

note to self: @bluegenes and I want to add a match argument to JaccardSearch.collect(...) so that we can ignore matches to specific signatures, either by identity or by name or by ...

this would enable #985 and #849 more generically.

Edit: added in #1477

* add MinHash.intersection method
* rearrange order of intersection
* swizzle SBT search code over to using Rust-based intersection code, too
* add intersection_and_union_size method to MinHash
* make flatten a no-op if track_abundance=False
* intersection_union_size in the FFI

Co-authored-by: Luiz Irber <luiz.irber@gmail.com>
Copy link
Member

@luizirber luizirber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is already a large PR, and there are PRs merging more code into this, so I vote for merging now and rebasing the other ones.

Overall a nice cleanup of the codebase, no obvious performance regressions, and I think it mostly fits with future indices written in Rust.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants