-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] Build a LineageDB
interface for taxonomy databases/information
#1651
Conversation
Codecov Report
@@ Coverage Diff @@
## latest #1651 +/- ##
==========================================
+ Coverage 81.83% 82.22% +0.38%
==========================================
Files 112 113 +1
Lines 11492 11704 +212
Branches 1444 1478 +34
==========================================
+ Hits 9405 9624 +219
+ Misses 1827 1820 -7
Partials 260 260
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
@bluegenes @taylorreiter @hehouts I'd be interested in your thoughts on the CLI functionality!
Note there's a lot of missing functionality and it's got bad UX for the moment - before putting it up for review I will do things like support content sniffing so it doesn't use filename extensions, and support other output formats than sqlite DB - but curious what y'all think of the naming scheme and basic functionality! |
re: naming -- I wonder if I think you have "output other formats" in your to do list, but just +1 to optionally outputting a CSV of the combined/prepared taxonomy information. This might be out of scope for this PR, but one of the issues we need to deal with in order to support CAMI output (#1606) is handling NCBI taxids/taxpaths .. it would be super FANTASTIC if it were possible to easily use/integrate NCBI taxids during database preparation.
haven't played much yet, but I think this will simplify and robust-ify our taxonomy functions! |
Probably not
yep.
ugh, yeah, this probably belongs to a separate issue. It would be good to delineate the output requirements so we know where we need to end up. |
I don't have much to add, but +1 for Will it go the opposite direction as well? sqlite db to csv? I could see that as being useful e.g. if someone inherits a db from someone else but wants a csv to look into its contents or something. |
yep, that's my plan! No reason to restrict output formats and the classes should present a uniform interface either way. |
…o add/lineage_db
OK, updated the PR to add Any other formats we should support? For input, I was planning to look at the greengenes format, as well as the NCBI taxdump format (names.dmp etc). We could also output the greengenes tax format, perhaps. |
OK, I think this is feature complete; moving on to documentation. |
LineageDB
interface for taxonomy databases/informationLineageDB
interface for taxonomy databases/information
Ready for review, @bluegenes @taylorreiter! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks great to me!
🎉 |
This PR adds support for SQLite taxonomy databases, as well as a new
sourmash tax prepare
command.As of sourmash 4.2, the new
taxonomy
subcommand operations take CSV input files, based on the format originally developed forsourmash lca index
. However, for large taxonomy databases (e.g. GTDB rs202, which has ~250,000 rows), CSV takes a few seconds to load and puts everything in memory.This PR:
sourmash tax prepare -t <tax1> [<tax2> <tax3>... ] -o <combined_tax>
, which outputs a combined taxonomy file where entries in later tax files (e.g. tax3) override earlier ones.LineageDB
class to replace theassignments
dictionaryLineageDB_Sqlite
classMultiLineageDB
class to wrap multipleLineageDB
objects, with aload
method to load multiple taxonomy files--keep-full-identifiers
,--keep-identifier-versions
)Some other notes -
prepare
can also be a way to validate, summarize, and/or debug taxonomy DBs.Demo
tl;dr Everything works as before, except that you can provide a .db file as well as a CSV file :)
First, create a taxonomy sqlite database:
and then use it:
TODO