Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

be clearer about when abundance weighting is used in gather output #1805

Closed
ctb opened this issue Jan 20, 2022 · 1 comment · Fixed by #1819
Closed

be clearer about when abundance weighting is used in gather output #1805

ctb opened this issue Jan 20, 2022 · 1 comment · Fixed by #1819

Comments

@ctb
Copy link
Contributor

ctb commented Jan 20, 2022

I was looking at sourmash gather output and trying to figure out how this metagenome was 99.2% classified:

== This is sourmash version 4.2.4.dev0+g73aeb155.d20220116. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

selecting default query k=31.
loaded query: SRR12324253... (k=31, DNA)
loaded 1 databases.                                                            

Starting prefetch sweep across databases.
Found 18 signatures via prefetch; now doing gather.

overlap     p_query p_match avg_abund
---------   ------- ------- ---------
15.8 Mbp       0.4%   62.7%       2.6    SSXJ01000001.1 Cryptococcus neoforman...
11.3 Mbp       0.7%   97.6%       7.4    WMJW01000001.1 Saccharomyces cerevisi...
7.4 Mbp       30.0%   67.7%     461.6    JSPL01000060.1 Escherichia coli strai...
4.7 Mbp       24.7%   98.4%     607.9    VFAF01000002.1 Salmonella enterica st...
4.0 Mbp        2.8%   99.8%      80.1    CP039755.1 Bacillus subtilis strain N...
5.2 Mbp       13.3%   43.3%     474.1    FREP01002036.1 Escherichia coli isola...
2.8 Mbp        1.7%  100.0%      68.6    CP039751.1 Listeria monocytogenes str...
2.7 Mbp        4.6%   99.8%     192.8    VFAE01000004.1 Staphylococcus aureus ...
10.3 Mbp       0.0%   10.0%       2.2    AE017341.1 Cryptococcus neoformans va...
1.8 Mbp        6.2%   99.8%     395.3    CP039750.1 Limosilactobacillus fermen...
6.8 Mbp        6.2%   26.3%     394.4    CP041013.1 Pseudomonas aeruginosa str...
8.4 Mbp        0.0%    9.1%       1.9    CP003820.1 Cryptococcus neoformans va...
2.8 Mbp        3.5%   36.0%     391.4    CP039752.1 Enterococcus faecalis stra...
4.7 Mbp        5.0%   20.0%     604.4    VFAG01000002.1 Escherichia coli strai...
10.6 Mbp       0.1%    2.4%      16.5    LOQK01000001.1 Saccharomyces pastoria...
0.5 Mbp        0.0%    9.6%       1.3    VEMH01000100.1 Bacillus paranthracis ...
10.3 Mbp       0.0%    2.5%      10.7    CM010209.1 Saccharomyces cerevisiae s...
10.4 Mbp       0.0%    1.3%       3.8    CM006175.1 Saccharomyces cerevisiae s...
found less than 50.0 kbp in common. => exiting

found 18 matches total;
the recovered matches hit 99.2% of the query

when I realized that it was only 99.2% classified when abundance weighting was used - with --ignore-abundance, the percentage drops to 42.9%.

selecting default query k=31.
loaded query: SRR12324253... (k=31, DNA)
loaded 1 databases.                                                            

Starting prefetch sweep across databases.
Found 18 signatures via prefetch; now doing gather.

overlap     p_query p_match
---------   ------- -------
15.8 Mbp      10.9%   62.7%    SSXJ01000001.1 Cryptococcus neoforman...
11.3 Mbp       7.8%   97.6%    WMJW01000001.1 Saccharomyces cerevisi...
7.4 Mbp        5.1%   67.7%    JSPL01000060.1 Escherichia coli strai...
4.7 Mbp        3.2%   98.4%    VFAF01000002.1 Salmonella enterica st...
4.0 Mbp        2.7%   99.8%    CP039755.1 Bacillus subtilis strain N...
5.2 Mbp        2.2%   43.3%    FREP01002036.1 Escherichia coli isola...
2.8 Mbp        1.9%  100.0%    CP039751.1 Listeria monocytogenes str...
2.7 Mbp        1.9%   99.8%    VFAE01000004.1 Staphylococcus aureus ...
10.3 Mbp       1.3%   10.0%    AE017341.1 Cryptococcus neoformans va...
1.8 Mbp        1.2%   99.8%    CP039750.1 Limosilactobacillus fermen...
6.8 Mbp        1.2%   26.3%    CP041013.1 Pseudomonas aeruginosa str...
8.4 Mbp        1.1%    9.1%    CP003820.1 Cryptococcus neoformans va...
2.8 Mbp        0.7%   36.0%    CP039752.1 Enterococcus faecalis stra...
4.7 Mbp        0.6%   20.0%    VFAG01000002.1 Escherichia coli strai...
10.6 Mbp       0.5%    2.4%    LOQK01000001.1 Saccharomyces pastoria...
0.5 Mbp        0.4%    9.6%    VEMH01000100.1 Bacillus paranthracis ...
10.3 Mbp       0.2%    2.5%    CM010209.1 Saccharomyces cerevisiae s...
10.4 Mbp       0.1%    1.3%    CM006175.1 Saccharomyces cerevisiae s...
found less than 50.0 kbp in common. => exiting

found 18 matches total;
the recovered matches hit 42.9% of the query

between that and some of @drtamermansour's experiences here I think we need to be clearer about these numbers...

on the plus side, it's pretty clear in this particular mock metagenome's case that most of the low-abundance k-mers are errors! (this is zymo mock, so is a low complexity mock community)

@ctb
Copy link
Contributor Author

ctb commented Feb 2, 2022

fixed in #1819.

@ctb ctb closed this as completed in #1819 Feb 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant