-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] add kreport output format to tax metagenome #2239
Conversation
Codecov Report
@@ Coverage Diff @@
## latest #2239 +/- ##
==========================================
+ Coverage 84.67% 92.06% +7.38%
==========================================
Files 131 100 -31
Lines 15521 11268 -4253
Branches 2213 2219 +6
==========================================
- Hits 13143 10374 -2769
+ Misses 2085 601 -1484
Partials 293 293
Flags with carried forward coverage won't be shown. Click here to find out more.
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
@ctb ready for review |
One suggestion - you could add a brief mention of |
ok - will add. What do you think re: abundance weighting?? since total bp is not currently weighted by abund, maybe I should just leave it blank as well? I think we can get at this number, but I we need to summarize it while adding each gather result in summarize_gather_at? |
I don't have an informed opinion, I'm afraid. It has to do with what's being reported and I don't grok the format! Can you explain more? |
So For the cumulative number of reads for this taxon and all descendants, I think the summarized abundance-weighted bp would probably be a better proxy than the unique (non-abund weighted) bp. The problem is that I'm not currently summarizing weighted bp-- just unique containment, weighted containment, and unique bp. Summarization lines: I think I'm just afraid of oversimplifying -- to get abund-weighted bp, can I just multiply the |
punted to #2240 |
This PR adds kraken-style kreport output to
tax metagenome
, which is useful for comparison with other taxonomic profiling methods. While this format typically records the percent of number of reads assigned to taxa, we can create comparable output by reporting the percent of k-mers (percent containment) and the total number of k-mers matched.standard
kreport
columns:Percent Reads Contained in Taxon
: The cumulative percentage of reads for this taxon and all descendants.Number of Reads Contained in Taxon
: The cumulative number of reads for this taxon and all descendants.Number of Reads Assigned to Taxon
: The number of reads assigned directly to this taxon (not a cumulative count of all descendants).Rank Code
: (U)nclassified, (R)oot, (D)omain, (K)ingdom, (P)hylum, (C)lass, (O)rder, (F)amily, (G)enus, or (S)pecies.NCBI Taxon ID
: Numerical ID from the NCBI taxonomy database.Scientific Name
: The scientific name of the taxon.Example reads-based kreport with all columns:
Description from here: https://github.com/dportik/LRSW-Taxonomic-Profiling-Tutorial.
current sourmash
kreport
caveats:Percent Reads [k-mers] Contained in Taxon
: weighted by k-mer abundanceNumber of Reads [bp from k-mers] Contained in Taxon
: NOT WEIGHTED BY ABUNDANCENumber of Reads Assigned to Taxon
andNCBI Taxon ID
will not be reported (blank entries).In the future, we may wish to report the NCBI taxid when we can (NCBI taxonomy only).
@ctb since total bp is not currently weighted by abund, maybe I should just leave it blank as well? I think we can get at this number, but I we need to summarize it while adding each gather result in
summarize_gather_at
?example sourmash
kreport
(tiny test gather):