Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Build a LineageDB interface for taxonomy databases/information #1651

Merged
merged 31 commits into from
Jul 8, 2021

Conversation

ctb
Copy link
Contributor

@ctb ctb commented Jul 2, 2021

This PR adds support for SQLite taxonomy databases, as well as a new sourmash tax prepare command.

As of sourmash 4.2, the new taxonomy subcommand operations take CSV input files, based on the format originally developed for sourmash lca index. However, for large taxonomy databases (e.g. GTDB rs202, which has ~250,000 rows), CSV takes a few seconds to load and puts everything in memory.

This PR:

  • adds support for SQLite databases that contain taxonomic information, which makes the loading time essentially zero and dramatically speeds up identifier lookups.
  • adds sourmash tax prepare -t <tax1> [<tax2> <tax3>... ] -o <combined_tax>, which outputs a combined taxonomy file where entries in later tax files (e.g. tax3) override earlier ones.
  • refactors internal tax_utils functionality:
    • creates a LineageDB class to replace the assignments dictionary
    • adds a LineageDB_Sqlite class
    • adds a MultiLineageDB class to wrap multiple LineageDB objects, with a load method to load multiple taxonomy files
    • aligns internal keyword arguments with CLI arguments (so now --keep-full-identifiers, --keep-identifier-versions)

Some other notes -

  • prepare can also be a way to validate, summarize, and/or debug taxonomy DBs.
  • consider supporting NCBI input files as well as greengenes TSV

Demo

tl;dr Everything works as before, except that you can provide a .db file as well as a CSV file :)

First, create a taxonomy sqlite database:

% sourmash tax prepare -t gtdb-rs202.taxonomy.v2.csv -o gtdb-rs202.taxonomy.v2.db

== This is sourmash version 4.2.0. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

created sqlite table
loading taxonomies...
...loaded 258406 entries.
processed 258405 taxonomy rows total
writing to gtdb-rs202.taxonomy.v2.db...
...done!

and then use it:

% sourmash tax annotate -t gtdb-rs202.taxonomy.v2.db -g ../genome-grist/outputs.paper/genbank/SRR606249.x.genbank.gather.csv 

== This is sourmash version 4.2.0. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

loaded 73 gather results.
of 73, missed 0 lineage assignments.
loaded results from 1 gather CSVs
saving `annotate` output to SRR606249.x.genbank.gather.with-lineages.csv.

TODO

  • support saving in multiple formats based on extension and/or args?
  • maybe add information about earlier tax IDs being shadowed by later tax entries?
  • add documentation
  • think about how to support avail_ranks better
  • fix 'assert 0' issue around split_identifiers/keep_identifier_versions; test
  • write more tests
  • refactor load_ functions into class methods on LineageDBs
  • evaluate if we need anything else in the LineageDB class for charcoal, while we're at it
  • figure out what to do about 'already exists'
  • can we alias taxonomy_csv to something else?
  • add some end-to-end tests to make sure that results are the same for genome and metagenome when using sqlite db

@codecov
Copy link

codecov bot commented Jul 2, 2021

Codecov Report

Merging #1651 (4225f53) into latest (512d961) will increase coverage by 0.38%.
The diff coverage is 96.35%.

Impacted file tree graph

@@            Coverage Diff             @@
##           latest    #1651      +/-   ##
==========================================
+ Coverage   81.83%   82.22%   +0.38%     
==========================================
  Files         112      113       +1     
  Lines       11492    11704     +212     
  Branches     1444     1478      +34     
==========================================
+ Hits         9405     9624     +219     
+ Misses       1827     1820       -7     
  Partials      260      260              
Flag Coverage Δ
python 89.63% <96.35%> (+0.37%) ⬆️
rust 66.28% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/sourmash/cli/tax/annotate.py 89.47% <ø> (ø)
src/sourmash/cli/tax/genome.py 83.87% <ø> (ø)
src/sourmash/cli/tax/metagenome.py 82.14% <ø> (ø)
src/sourmash/tax/tax_utils.py 97.28% <95.58%> (+4.89%) ⬆️
src/sourmash/cli/tax/__init__.py 100.00% <100.00%> (ø)
src/sourmash/cli/tax/prepare.py 100.00% <100.00%> (ø)
src/sourmash/tax/__main__.py 88.29% <100.00%> (-0.72%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 512d961...4225f53. Read the comment docs.

@ctb
Copy link
Contributor Author

ctb commented Jul 3, 2021

@bluegenes @taylorreiter @hehouts I'd be interested in your thoughts on the CLI functionality!

  • I suspect there are other uses for combine_tax that I haven't thought of - validation, debugging, summarization?
  • are there better names than sourmash tax combine_tax to use?
  • other thoughts?

Note there's a lot of missing functionality and it's got bad UX for the moment - before putting it up for review I will do things like support content sniffing so it doesn't use filename extensions, and support other output formats than sqlite DB - but curious what y'all think of the naming scheme and basic functionality!

@bluegenes
Copy link
Contributor

  • I suspect there are other uses for combine_tax that I haven't thought of - validation, debugging, summarization?
  • are there better names than sourmash tax combine_tax to use?
  • other thoughts?

re: naming -- I wonder if tax prepare or tax index might be better? Both of these sort of emphasize the formatting over the combining of taxonomies, but I think it might be helpful to think about this as a preparation step before doing taxonomic summarization.

I think you have "output other formats" in your to do list, but just +1 to optionally outputting a CSV of the combined/prepared taxonomy information.

This might be out of scope for this PR, but one of the issues we need to deal with in order to support CAMI output (#1606) is handling NCBI taxids/taxpaths .. it would be super FANTASTIC if it were possible to easily use/integrate NCBI taxids during database preparation.

Note there's a lot of missing functionality and it's got bad UX for the moment - before putting it up for review I will do things like support content sniffing so it doesn't use filename extensions, and support other output formats than sqlite DB - but curious what y'all think of the naming scheme and basic functionality!

haven't played much yet, but I think this will simplify and robust-ify our taxonomy functions!

@ctb
Copy link
Contributor Author

ctb commented Jul 6, 2021

  • I suspect there are other uses for combine_tax that I haven't thought of - validation, debugging, summarization?
  • are there better names than sourmash tax combine_tax to use?
  • other thoughts?

re: naming -- I wonder if tax prepare or tax index might be better? Both of these sort of emphasize the formatting over the combining of taxonomies, but I think it might be helpful to think about this as a preparation step before doing taxonomic summarization.

Probably not index, since we already use that in other places to mean other things. But I like 'prepare'!

I think you have "output other formats" in your to do list, but just +1 to optionally outputting a CSV of the combined/prepared taxonomy information.

yep.

This might be out of scope for this PR, but one of the issues we need to deal with in order to support CAMI output (#1606) is handling NCBI taxids/taxpaths .. it would be super FANTASTIC if it were possible to easily use/integrate NCBI taxids during database preparation.

ugh, yeah, this probably belongs to a separate issue. It would be good to delineate the output requirements so we know where we need to end up.

@taylorreiter
Copy link
Contributor

I don't have much to add, but +1 for prepare!

Will it go the opposite direction as well? sqlite db to csv? I could see that as being useful e.g. if someone inherits a db from someone else but wants a csv to look into its contents or something.

@ctb
Copy link
Contributor Author

ctb commented Jul 6, 2021

I don't have much to add, but +1 for prepare!

Will it go the opposite direction as well? sqlite db to csv? I could see that as being useful e.g. if someone inherits a db from someone else but wants a csv to look into its contents or something.

yep, that's my plan! No reason to restrict output formats and the classes should present a uniform interface either way.

@ctb
Copy link
Contributor Author

ctb commented Jul 7, 2021

OK, updated the PR to add prepare instead of combine_tax, and also it now does format switching between CSV and sqlite (you can read a mix of formats, and write to either format).

Any other formats we should support? For input, I was planning to look at the greengenes format, as well as the NCBI taxdump format (names.dmp etc). We could also output the greengenes tax format, perhaps.

@ctb
Copy link
Contributor Author

ctb commented Jul 7, 2021

OK, I think this is feature complete; moving on to documentation.

@ctb ctb changed the title [WIP] Build a LineageDB interface for taxonomy databases/information [MRG] Build a LineageDB interface for taxonomy databases/information Jul 8, 2021
@ctb
Copy link
Contributor Author

ctb commented Jul 8, 2021

Ready for review, @bluegenes @taylorreiter!

Copy link
Contributor

@bluegenes bluegenes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great to me!

@ctb
Copy link
Contributor Author

ctb commented Jul 8, 2021

🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants