Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Add MinHash.intersection(...) #1474

Merged
merged 13 commits into from
Apr 21, 2021
Merged

Conversation

ctb
Copy link
Contributor

@ctb ctb commented Apr 20, 2021

Note: merge into #1392.

Bring MinHash.intersection(...) in from Rust implementation.

Question: Do we need to bump the rust/sourmash version?

@codecov
Copy link

codecov bot commented Apr 20, 2021

Codecov Report

Merging #1474 (36ae5b7) into refactor/index_find (57467cd) will decrease coverage by 5.12%.
The diff coverage is 90.44%.

Impacted file tree graph

@@                   Coverage Diff                   @@
##           refactor/index_find    #1474      +/-   ##
=======================================================
- Coverage                94.84%   89.71%   -5.13%     
=======================================================
  Files                       96      123      +27     
  Lines                    15730    19464    +3734     
  Branches                  1466     1483      +17     
=======================================================
+ Hits                     14919    17463    +2544     
- Misses                     586     1775    +1189     
- Partials                   225      226       +1     
Flag Coverage Δ
python 94.86% <98.40%> (+0.02%) ⬆️
rust 67.20% <0.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/core/src/ffi/minhash.rs 0.00% <0.00%> (ø)
src/core/src/sketch/minhash.rs 91.31% <ø> (ø)
src/sourmash/minhash.py 92.69% <89.47%> (-0.28%) ⬇️
src/sourmash/index.py 94.62% <100.00%> (-0.08%) ⬇️
src/sourmash/sbt.py 80.82% <100.00%> (-0.10%) ⬇️
tests/test__minhash.py 99.74% <100.00%> (+0.02%) ⬆️
src/core/src/index/search.rs 100.00% <0.00%> (ø)
src/core/src/ffi/signature.rs 0.00% <0.00%> (ø)
src/core/src/ffi/utils.rs 0.00% <0.00%> (ø)
src/core/src/index/bigsi.rs 89.47% <0.00%> (ø)
... and 23 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 57467cd...36ae5b7. Read the comment docs.

@luizirber
Copy link
Member

Hmm, this still doesn't quite work, because it still pulls all the calculations into Python.

image

I merged #1137 into #1392 in find+remove, moving the changes into the node_search method:
refactor/index_find...find+remove#diff-796cf35ae8d09c8df495f82265c2593075e00b9d91f5fed92f3e17e47d155a16L420

This is still wasteful, because count_common and similarity both calculate the intersection, but less than the approach in this PR:
image

(and there is something VERY wrong going on with the memory consumption of #1392...)

For this PR, I think a better approach might be a function that returns the size of the intersection and the union of two MHs (while going thru the data only once), because that's the info that #1392 wants now.

@luizirber
Copy link
Member

For this PR, I think a better approach might be a function that returns the size of the intersection and the union of two MHs (while going thru the data only once), because that's the info that #1392 wants now.

Err, that's what the intersection/intersection_size methods in Rust already do! But there is no function in the FFI that return both values...

@ctb
Copy link
Contributor Author

ctb commented Apr 20, 2021

oh. duh. I realized I hadn't changed the intersection code in the SBT. Fixed in 102b504. Curious if this changes performance!

(Obviously it is better to implement the counts directly in rust. Will work on that next.)

@ctb
Copy link
Contributor Author

ctb commented Apr 20, 2021

ok, encapsulated key calculations in new method MinHash.intersection_and_union_size(...). Still in Python, but getting there!

@luizirber
Copy link
Member

oh. duh. I realized I hadn't changed the intersection code in the SBT. Fixed in 102b504. Curious if this changes performance!

Yup, getting better!

image

@luizirber
Copy link
Member

From experimentations in #1475 it is possible that intersection is not the main/largest problem, starting to suspect that all the calls to .flatten() when reaching leaf nodes might be the culprit...

@ctb
Copy link
Contributor Author

ctb commented Apr 21, 2021

hmm. Can you run the benchmark on this branch again when you get a chance? If flatten is the culprit, I should have fixed that.

@luizirber
Copy link
Member

hmm. Can you run the benchmark on this branch again when you get a chance? If flatten is the culprit, I should have fixed that.

It is not really flatten too... mh_intersect_rust is using the intersection from #1475 (and the fixed FFI function), but it doesn't save much time in the end:
image

The clue seems to be the memory consumption figure. What seems to be happening is that the search algorithm changed in a way that is triggering more internal node checks, and if we compare a gather run between latest:
image
and mh_intersect_rust:
image
we can see that they are somewhat similar, but mh_intersect_rust takes twice as long (~5900 samples versus ~3000, with ~100 samples per second).

I'll pull the sunburst plots from #1201 to see what's going on.

@ctb
Copy link
Contributor Author

ctb commented Apr 21, 2021

huh! that is super interesting and weird and unintentional :). I'll dig a little bit to see if there's something obvious.

Edit: fixed by 57467cd

@luizirber
Copy link
Member

Fixed!
image

@ctb
Copy link
Contributor Author

ctb commented Apr 21, 2021

🎉 🎉

shall I merge this?

@ctb
Copy link
Contributor Author

ctb commented Apr 21, 2021

(or, well, go ahead and merge it :)

@luizirber
Copy link
Member

I changed a few things in 36ae5b7

  • Made kmerminhash_intersection in the FFI return a new MinHash, instead of being inplace
  • Renamed kmerminhash_intersection_size to kmerminhash_intersection_union_size to return both intersection and union sizes (and fixed a bug along the way 🙈

@luizirber luizirber changed the title [WIP] Add MinHash.intersection(...) [MRG] Add MinHash.intersection(...) Apr 21, 2021
@luizirber luizirber merged commit c8d8cd6 into refactor/index_find Apr 21, 2021
@luizirber luizirber deleted the add/mh_intersect branch April 21, 2021 23:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants