Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] add kreport output format to tax metagenome #2239

Merged
merged 6 commits into from
Aug 26, 2022
Merged

Conversation

bluegenes
Copy link
Contributor

@bluegenes bluegenes commented Aug 26, 2022

This PR adds kraken-style kreport output to tax metagenome, which is useful for comparison with other taxonomic profiling methods. While this format typically records the percent of number of reads assigned to taxa, we can create comparable output by reporting the percent of k-mers (percent containment) and the total number of k-mers matched.

standard kreport columns:

  • Percent Reads Contained in Taxon: The cumulative percentage of reads for this taxon and all descendants.
  • Number of Reads Contained in Taxon: The cumulative number of reads for this taxon and all descendants.
  • Number of Reads Assigned to Taxon: The number of reads assigned directly to this taxon (not a cumulative count of all descendants).
  • Rank Code: (U)nclassified, (R)oot, (D)omain, (K)ingdom, (P)hylum, (C)lass, (O)rder, (F)amily, (G)enus, or (S)pecies.
  • NCBI Taxon ID: Numerical ID from the NCBI taxonomy database.
  • Scientific Name: The scientific name of the taxon.

Example reads-based kreport with all columns:

    88.41	2138742	193618	K	2	Bacteria
    0.16	3852	818	P	201174	  Actinobacteria
    0.13	3034	0	C	1760	    Actinomycetia
    0.13	3034	45	O	85009	      Propionibacteriales
    0.12	2989	1847	F	31957	        Propionibacteriaceae
    0.05	1142	352	G	1912216	          Cutibacterium
    0.03	790	790	S	1747	            Cutibacterium acnes

Description from here: https://github.com/dportik/LRSW-Taxonomic-Profiling-Tutorial.

current sourmash kreport caveats:

  • Percent Reads [k-mers] Contained in Taxon: weighted by k-mer abundance
  • Number of Reads [bp from k-mers] Contained in Taxon: NOT WEIGHTED BY ABUNDANCE
  • Number of Reads Assigned to Taxon and NCBI Taxon ID will not be reported (blank entries).

In the future, we may wish to report the NCBI taxid when we can (NCBI taxonomy only).

@ctb since total bp is not currently weighted by abund, maybe I should just leave it blank as well? I think we can get at this number, but I we need to summarize it while adding each gather result in summarize_gather_at?

example sourmash kreport (tiny test gather):

0.13	1024000		D		d__Bacteria
0.87	3990000		U		unclassified
0.07	582000		P		p__Bacteroidota
0.06	442000		P		p__Proteobacteria
0.07	582000		C		c__Bacteroidia
0.06	442000		C		c__Gammaproteobacteria
0.07	582000		O		o__Bacteroidales
0.06	442000		O		o__Enterobacterales
0.07	582000		F		f__Bacteroidaceae
0.06	442000		F		f__Enterobacteriaceae
0.06	444000		G		g__Prevotella
0.06	442000		G		g__Escherichia
0.02	138000		G		g__Phocaeicola
0.06	444000		S		s__Prevotella copri
0.06	442000		S		s__Escherichia coli
0.02	138000		S		s__Phocaeicola vulgatus

@codecov
Copy link

codecov bot commented Aug 26, 2022

Codecov Report

Merging #2239 (986c1d0) into latest (b1ddabc) will increase coverage by 7.38%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           latest    #2239      +/-   ##
==========================================
+ Coverage   84.67%   92.06%   +7.38%     
==========================================
  Files         131      100      -31     
  Lines       15521    11268    -4253     
  Branches     2213     2219       +6     
==========================================
- Hits        13143    10374    -2769     
+ Misses       2085      601    -1484     
  Partials      293      293              
Flag Coverage Δ
python 92.06% <100.00%> (+0.01%) ⬆️
rust ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/sourmash/cli/tax/metagenome.py 89.28% <ø> (ø)
src/sourmash/tax/__main__.py 90.31% <100.00%> (+0.15%) ⬆️
src/sourmash/tax/tax_utils.py 98.30% <100.00%> (+0.05%) ⬆️
src/core/src/index/sbt/mhbt.rs
src/core/src/sketch/minhash.rs
src/core/src/ffi/utils.rs
src/core/src/index/bigsi.rs
src/core/src/ffi/minhash.rs
src/core/src/errors.rs
src/core/tests/minhash.rs
... and 24 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@bluegenes bluegenes changed the title [WIP] add kreport output format to tax metagenome [MRG] add kreport output format to tax metagenome Aug 26, 2022
@bluegenes
Copy link
Contributor Author

@ctb ready for review

@ctb
Copy link
Contributor

ctb commented Aug 26, 2022

One suggestion - you could add a brief mention of kreport output format to the command-line docs as well.

@bluegenes
Copy link
Contributor Author

One suggestion - you could add a brief mention of kreport output format to the command-line docs as well.

ok - will add.

What do you think re: abundance weighting??

since total bp is not currently weighted by abund, maybe I should just leave it blank as well? I think we can get at this number, but I we need to summarize it while adding each gather result in summarize_gather_at?

@ctb
Copy link
Contributor

ctb commented Aug 26, 2022

What do you think re: abundance weighting??

since total bp is not currently weighted by abund, maybe I should just leave it blank as well? I think we can get at this number, but I we need to summarize it while adding each gather result in summarize_gather_at?

I don't have an informed opinion, I'm afraid. It has to do with what's being reported and I don't grok the format! Can you explain more?

@bluegenes
Copy link
Contributor Author

bluegenes commented Aug 26, 2022

What do you think re: abundance weighting??
since total bp is not currently weighted by abund, maybe I should just leave it blank as well? I think we can get at this number, but I we need to summarize it while adding each gather result in summarize_gather_at?

I don't have an informed opinion, I'm afraid. It has to do with what's being reported and I don't grok the format! Can you explain more?

So kreport is reporting the cumulative percentage of reads for this taxon and all descendants and the cumulative number of reads for this taxon and all descendants. In sourmash terms, I think the rank-summarized f_unique_weighted is a good proxy for the cumulative percentage of reads for a taxon, which is the first column.

For the cumulative number of reads for this taxon and all descendants, I think the summarized abundance-weighted bp would probably be a better proxy than the unique (non-abund weighted) bp. The problem is that I'm not currently summarizing weighted bp-- just unique containment, weighted containment, and unique bp. Summarization lines:
https://github.com/sourmash-bio/sourmash/blob/latest/src/sourmash/tax/tax_utils.py#L259-L261

I think I'm just afraid of oversimplifying -- to get abund-weighted bp, can I just multiply the summarized_f_weighted * query_scaled?

@bluegenes
Copy link
Contributor Author

punted to #2240

@bluegenes bluegenes merged commit b1b4ef7 into latest Aug 26, 2022
@bluegenes bluegenes deleted the tax-kreport branch August 26, 2022 22:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants