[MRG] add kreport output format to tax metagenome #2239

bluegenes · 2022-08-26T19:33:39Z

This PR adds kraken-style kreport output to tax metagenome, which is useful for comparison with other taxonomic profiling methods. While this format typically records the percent of number of reads assigned to taxa, we can create comparable output by reporting the percent of k-mers (percent containment) and the total number of k-mers matched.

standard kreport columns:

Percent Reads Contained in Taxon: The cumulative percentage of reads for this taxon and all descendants.
Number of Reads Contained in Taxon: The cumulative number of reads for this taxon and all descendants.
Number of Reads Assigned to Taxon: The number of reads assigned directly to this taxon (not a cumulative count of all descendants).
Rank Code: (U)nclassified, (R)oot, (D)omain, (K)ingdom, (P)hylum, (C)lass, (O)rder, (F)amily, (G)enus, or (S)pecies.
NCBI Taxon ID: Numerical ID from the NCBI taxonomy database.
Scientific Name: The scientific name of the taxon.

Example reads-based kreport with all columns:

    88.41	2138742	193618	K	2	Bacteria
    0.16	3852	818	P	201174	  Actinobacteria
    0.13	3034	0	C	1760	    Actinomycetia
    0.13	3034	45	O	85009	      Propionibacteriales
    0.12	2989	1847	F	31957	        Propionibacteriaceae
    0.05	1142	352	G	1912216	          Cutibacterium
    0.03	790	790	S	1747	            Cutibacterium acnes

Description from here: https://github.com/dportik/LRSW-Taxonomic-Profiling-Tutorial.

current sourmash kreport caveats:

Percent Reads [k-mers] Contained in Taxon: weighted by k-mer abundance
Number of Reads [bp from k-mers] Contained in Taxon: NOT WEIGHTED BY ABUNDANCE
Number of Reads Assigned to Taxon and NCBI Taxon ID will not be reported (blank entries).

In the future, we may wish to report the NCBI taxid when we can (NCBI taxonomy only).

@ctb since total bp is not currently weighted by abund, maybe I should just leave it blank as well? I think we can get at this number, but I we need to summarize it while adding each gather result in summarize_gather_at?

example sourmash kreport (tiny test gather):

0.13	1024000		D		d__Bacteria
0.87	3990000		U		unclassified
0.07	582000		P		p__Bacteroidota
0.06	442000		P		p__Proteobacteria
0.07	582000		C		c__Bacteroidia
0.06	442000		C		c__Gammaproteobacteria
0.07	582000		O		o__Bacteroidales
0.06	442000		O		o__Enterobacterales
0.07	582000		F		f__Bacteroidaceae
0.06	442000		F		f__Enterobacteriaceae
0.06	444000		G		g__Prevotella
0.06	442000		G		g__Escherichia
0.02	138000		G		g__Phocaeicola
0.06	444000		S		s__Prevotella copri
0.06	442000		S		s__Escherichia coli
0.02	138000		S		s__Phocaeicola vulgatus

codecov · 2022-08-26T19:42:04Z

Codecov Report

Merging #2239 (986c1d0) into latest (b1ddabc) will increase coverage by 7.38%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           latest    #2239      +/-   ##
==========================================
+ Coverage   84.67%   92.06%   +7.38%     
==========================================
  Files         131      100      -31     
  Lines       15521    11268    -4253     
  Branches     2213     2219       +6     
==========================================
- Hits        13143    10374    -2769     
+ Misses       2085      601    -1484     
  Partials      293      293

Flag	Coverage Δ
python	`92.06% <100.00%> (+0.01%)`	⬆️
rust	`?`

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
src/sourmash/cli/tax/metagenome.py	`89.28% <ø> (ø)`
src/sourmash/tax/__main__.py	`90.31% <100.00%> (+0.15%)`	⬆️
src/sourmash/tax/tax_utils.py	`98.30% <100.00%> (+0.05%)`	⬆️
src/core/src/index/sbt/mhbt.rs
src/core/src/sketch/minhash.rs
src/core/src/ffi/utils.rs
src/core/src/index/bigsi.rs
src/core/src/ffi/minhash.rs
src/core/src/errors.rs
src/core/tests/minhash.rs
... and 24 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

bluegenes · 2022-08-26T20:00:03Z

@ctb ready for review

ctb · 2022-08-26T20:04:27Z

One suggestion - you could add a brief mention of kreport output format to the command-line docs as well.

bluegenes · 2022-08-26T20:09:49Z

One suggestion - you could add a brief mention of kreport output format to the command-line docs as well.

ok - will add.

What do you think re: abundance weighting??

since total bp is not currently weighted by abund, maybe I should just leave it blank as well? I think we can get at this number, but I we need to summarize it while adding each gather result in summarize_gather_at?

ctb · 2022-08-26T20:22:14Z

What do you think re: abundance weighting??

since total bp is not currently weighted by abund, maybe I should just leave it blank as well? I think we can get at this number, but I we need to summarize it while adding each gather result in summarize_gather_at?

I don't have an informed opinion, I'm afraid. It has to do with what's being reported and I don't grok the format! Can you explain more?

bluegenes · 2022-08-26T20:45:42Z

What do you think re: abundance weighting??
since total bp is not currently weighted by abund, maybe I should just leave it blank as well? I think we can get at this number, but I we need to summarize it while adding each gather result in summarize_gather_at?

I don't have an informed opinion, I'm afraid. It has to do with what's being reported and I don't grok the format! Can you explain more?

So kreport is reporting the cumulative percentage of reads for this taxon and all descendants and the cumulative number of reads for this taxon and all descendants. In sourmash terms, I think the rank-summarized f_unique_weighted is a good proxy for the cumulative percentage of reads for a taxon, which is the first column.

For the cumulative number of reads for this taxon and all descendants, I think the summarized abundance-weighted bp would probably be a better proxy than the unique (non-abund weighted) bp. The problem is that I'm not currently summarizing weighted bp-- just unique containment, weighted containment, and unique bp. Summarization lines:
https://github.com/sourmash-bio/sourmash/blob/latest/src/sourmash/tax/tax_utils.py#L259-L261

I think I'm just afraid of oversimplifying -- to get abund-weighted bp, can I just multiply the summarized_f_weighted * query_scaled?

bluegenes · 2022-08-26T22:04:28Z

punted to #2240

bluegenes added 4 commits August 26, 2022 12:03

add kreport output format to tax metagenome

e01d90d

better fn info

a2f9d39

rm change not for this pr

b9f1cb7

upd description

68e4c33

Merge branch 'latest' into tax-kreport

367a31c

bluegenes changed the title ~~[WIP] add kreport output format to tax metagenome~~ [MRG] add kreport output format to tax metagenome Aug 26, 2022

ctb approved these changes Aug 26, 2022

View reviewed changes

add kreport to docs

986c1d0

bluegenes mentioned this pull request Aug 26, 2022

report abundance weighted hashes/bp from gather and tax? #2240

Closed

bluegenes merged commit b1b4ef7 into latest Aug 26, 2022

bluegenes deleted the tax-kreport branch August 26, 2022 22:44

This was referenced Aug 29, 2022

draft release notes for v4.5.0 #2241

Closed

[MRG] add abundance-weighted columns to gather output #2249

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MRG] add kreport output format to tax metagenome #2239

[MRG] add kreport output format to tax metagenome #2239

bluegenes commented Aug 26, 2022 •

edited

Loading

codecov bot commented Aug 26, 2022 •

edited

Loading

bluegenes commented Aug 26, 2022

ctb commented Aug 26, 2022

bluegenes commented Aug 26, 2022

ctb commented Aug 26, 2022

bluegenes commented Aug 26, 2022 •

edited

Loading

bluegenes commented Aug 26, 2022

[MRG] add kreport output format to tax metagenome #2239

[MRG] add kreport output format to tax metagenome #2239

Conversation

bluegenes commented Aug 26, 2022 • edited Loading

codecov bot commented Aug 26, 2022 • edited Loading

Codecov Report

bluegenes commented Aug 26, 2022

ctb commented Aug 26, 2022

bluegenes commented Aug 26, 2022

ctb commented Aug 26, 2022

bluegenes commented Aug 26, 2022 • edited Loading

bluegenes commented Aug 26, 2022

bluegenes commented Aug 26, 2022 •

edited

Loading

codecov bot commented Aug 26, 2022 •

edited

Loading

bluegenes commented Aug 26, 2022 •

edited

Loading