Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Oxidize parts of LCA_Database? #948

Closed
ctb opened this issue Apr 18, 2020 · 13 comments · Fixed by #1808
Closed

Oxidize parts of LCA_Database? #948

ctb opened this issue Apr 18, 2020 · 13 comments · Fixed by #1808

Comments

@ctb
Copy link
Contributor

ctb commented Apr 18, 2020

An implicit goal of #946 refactoring and testing the LCA_Database class is to support optimization of searches by moving them into a Rust extension module. I think the function _find_signatures(...) would be the main target (along with _signatures(...)). Could be good to look at some of the other LCA submodule commands to see if the hashvalue gathering done in classify and summarize could be pulled back to an LCA_Database method as well.

@luizirber
Copy link
Member

I was thinking about doing the revindex parts in Rust, and the LCA bits in Python. I took a quick look at #946 and as you pointed out it is almost there, since _signatures and _find_signatures don't do any LCA/taxonomy calculations.

The most complicated part is loading/saving, because parts of it would be in Rust (revindex) and parts in Python (the lineage stuff). It's not terribly bad, but may need to read the JSON file twice on load, and save the revindex bits to JSON first and then later adding the lineage data in Python on save.

@ctb
Copy link
Contributor Author

ctb commented Apr 20, 2020 via email

@luizirber
Copy link
Member

I think the lineage stuff is also quite easy in rust, at least for saving and loading.

Then you have my blessing to go implement it 🪄

@ctb
Copy link
Contributor Author

ctb commented Apr 23, 2020

I went and took a look at what LCA_Database was doing with respect to storing lineages, b/c I didn't remember whether we were doing anything sophisticated. Good news there, I think - rather than doing any collapsing or treeify-ing of lineage tuples, we store the following in the LCA database JSON file:

  "lid_to_lineage": {
    "0": [
      [ 
        "superkingdom",
        "Archaea"
      ],
      [ 
        "phylum",
        "Euryarchaeota"
      ],
      [ 
        "class",
        "Archaeoglobi"
      ],
      [ 
        "order",
        "Archaeoglobales"
      ],
      [ 
        "family",
        "Archaeoglobaceae"
      ],
...

here each signature int id (idx) has an optional lineage id associated with it via a mapping idx_to_lid, and then the lineage id goes to a list of rank/name pairs.

Oxidation of this code in save/load would then "just" be saving/loading these lists of rank/name pairs.

@erikyoung85
Copy link
Contributor

Is this going to be a similar move to the optimization of the MinHash class in rust?

@ctb
Copy link
Contributor Author

ctb commented Jul 22, 2020

you could do it that way, and move all the code over at once; OR you could do it piecemeal (which for me is my preferred way with Python) - slowly transfer code over into Rust by writing functioning code chunks and keeping the tests working, and refactor the Rust code regularly and eventually make it into a new class. e.g. you could transfer over _signatures and _find_signatures first, and then extend from there. Personally I like the regular endorphin hit of refactoring -- actually getting code working the same as before (but faster/better) -- but I also have a short attention span 😂

something tells me that Rust would reward the all-at-once approach more, but @luizirber should weigh in here.

@luizirber
Copy link
Member

Hopefully you can go more piecemeal than the MinHash transition... That was a lot, and took quite some time. But we also have more structure nowadays to support the oxidation =]

I agree with Titus: transferring _signatures and _find_signatures first is a good idea. This will probably require having at least load working, using serde on the Rust side.

@luizirber
Copy link
Member

Upon closer inspection, I changed my mind: It's probably easier to move the whole LCA_Database to Rust. It will be clumsy to pass the data for the lineage parts otherwise, and the class is not that big.

@luizirber
Copy link
Member

(just need to check that we are not accessing any attributes of LCA_Database in other parts of sourmash. We could expose them as properties for now, but it would be better to have methods to access any internal structure)

@ctb
Copy link
Contributor Author

ctb commented Jul 22, 2020 via email

@erikyoung85 erikyoung85 linked a pull request Jul 27, 2020 that will close this issue
5 tasks
@luizirber
Copy link
Member

(just need to check that we are not accessing any attributes of LCA_Database in other parts of sourmash. We could expose them as properties for now, but it would be better to have methods to access any internal structure)
we, umm, are. at least in the tests. sorry.

If they are only used in the tests, we can move the tests to Rust and avoid exposing internals.

@ctb
Copy link
Contributor Author

ctb commented Jan 26, 2022

#1808 implements SqliteIndex, which (as currently written) provides both find and signatures functionality.

If adjusted to interoperate with #1651, it may also provide full LCA-like functionality, in which case I suspect this issue will be moot.

@ctb
Copy link
Contributor Author

ctb commented Apr 16, 2022

note that #1936 refactors LCA_Database into a much leaner and clearer public API, and provides fairly thorough tests for said API.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants