[EXP] add sourmash distance estimation #1788

bluegenes · 2022-01-13T21:13:25Z

Implement a minimal version of https://github.com/KoslickiLab/mutation-rate-ci-calculator to estimate distance from scaled containment and jaccard.

See https://www.biorxiv.org/content/10.1101/2022.01.11.475870v1

May fix #1242

actually use distance utilities within sourmash:
- add signature methods for jaccard, containment, max_containment --> ANI
- compare
- prefetch
- gather
- search

add'l to do:

Add functions for edge case detection/reporting (no sketch overlap, 100% sketch overlap by chance).
decide when/where to report confidence intervals.
- Report in search, prefetch, gather, NOT in compare. This maintains compatibility with sourmash plot.
handle sketch downsampling in signature ANI functions
add confidence param to minhash, signature ANI functions to allow tuning confidence level. Defaults to 0.95.
add --ani-threshold to tax genome
ensure @dkoslicki and @mahmudhera are in sourmash authors list (for JOSS papers)
understand benchmarking issues

Questions/Thoughts:

Do we also want to report additional metrics, e.g. fraction genome matched?
Is there any issue with directly reporting the estimated ANI's (1-distance)? Should we enable both "distance" and ANI and let user choose?

NOTE: ANI estimation is not backwards compatible with previous prefetch/gather results (needs additional columns).

codecov · 2022-01-13T21:19:36Z

Codecov Report

Merging #1788 (e8c9997) into latest (6f7eb06) will decrease coverage by 62.68%.
The diff coverage is 0.00%.

@@             Coverage Diff             @@
##           latest    #1788       +/-   ##
===========================================
- Coverage   83.17%   20.48%   -62.69%     
===========================================
  Files         126      124        -2     
  Lines       13954    13666      -288     
  Branches     1910     1871       -39     
===========================================
- Hits        11606     2800     -8806     
- Misses       2075    10842     +8767     
+ Partials      273       24      -249

Flag	Coverage Δ
python	`0.11% <0.00%> (-91.03%)`	⬇️
rust	`65.16% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
src/sourmash/cli/compare.py	`0.00% <0.00%> (-100.00%)`	⬇️
src/sourmash/cli/utils.py	`0.00% <0.00%> (-100.00%)`	⬇️
src/sourmash/commands.py	`0.00% <0.00%> (-88.56%)`	⬇️
src/sourmash/compare.py	`0.00% <0.00%> (-100.00%)`	⬇️
src/sourmash/search.py	`0.00% <0.00%> (-97.01%)`	⬇️
src/sourmash/tax/__main__.py	`0.00% <0.00%> (-87.57%)`	⬇️
src/sourmash/tax/tax_utils.py	`0.00% <0.00%> (-97.78%)`	⬇️
src/sourmash/__main__.py	`0.00% <0.00%> (-100.00%)`	⬇️
src/sourmash/cli/info.py	`0.00% <0.00%> (-100.00%)`	⬇️
src/sourmash/cli/plot.py	`0.00% <0.00%> (-100.00%)`	⬇️
... and 91 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6f7eb06...e8c9997. Read the comment docs.

ctb · 2022-01-14T19:14:48Z

lookin' good. do you want comments now or do you want to wait until it's more fully baked?

bluegenes · 2022-01-14T19:17:35Z

@ctb Comments now would be good! I'm done adding things for now, just gonna go take it for a spin :)

src/sourmash/commands.py

src/sourmash/compare.py

ctb · 2022-01-14T19:27:11Z

Questions/Thoughts:

* Currently not reporting confidence interval in `compare`, as the results are stored as a single np matrix. How can we store + report these? Larger CSV output, rather than symmetric matrix? This would elim compatibility with current`plot`. What other issues would larger CSV bring about?

I think leave it out for now. Would it be included in search output?

* Do we also want to report additional metrics, e.g. fraction genome matched? If yes, same issues as ^.

Same answer :). But we might want to NA or zero out the numbers where they are particularly sketch-y (heh)

* Is there any issue with directly reporting the estimated ANI's (1-distance)?  Should we enable both "distance" and ANI and let user choose?

ANI good, IMO!

Co-authored-by: C. Titus Brown <titus@idyll.org>

mahmudhera · 2022-02-18T21:58:24Z

Excellent! The point estimate formula would essentially stay the same, I just want to do some theoretical analysis first, so that we have justification for using the formula. I will do some analyses in the next couple of days and give you an update on this.

mahmudhera · 2022-02-23T06:10:56Z

It looks like the point estimate that we have right now: it would only be reasonable to use when the ANI is large. Otherwise, the formula may work well by chance, but there is no theoretical guarantee. The exact range of ANI depends on other parameters.

We can approach this in one of two ways: we can either keep the code as is, and besides the point estimate, we can return an estimated error. If the error is above some threshold, we can report that Jaccard->mutation_rate code cannot be trusted. Or, we can try to improve the error rate by solving for the point estimate differently, which may take a bit more work. I will try both locally, let me know which one of these you would prefer.

Meanwhile, do you think you can send the exact settings where the code failed? Sending the values of L, k, p, s for the cases where you got this RuntimeWarning would suffice.

bluegenes · 2022-02-23T20:47:19Z

thanks @mahmudhera!

It looks like the point estimate that we have right now: it would only be reasonable to use when the ANI is large. Otherwise, the formula may work well by chance, but there is no theoretical guarantee. The exact range of ANI depends on other parameters.

ok, good to know!

We can approach this in one of two ways: we can either keep the code as is, and besides the point estimate, we can return an estimated error. If the error is above some threshold, we can report that Jaccard->mutation_rate code cannot be trusted. Or, we can try to improve the error rate by solving for the point estimate differently, which may take a bit more work. I will try both locally, let me know which one of these you would prefer.

I think if it's possible to generate the jaccard ANI point estimate differently without too much extra work, that would be great. If not, the other option is ok -- I would probably just zero out the ANI estimate when the error is too high (using some default, user-modifiable threshold). If the error is much higher, we can recommend only containment --> ANI and provide jaccard equations with warnings. Alternatively, we could recommend using containment only.

Meanwhile, do you think you can send the exact settings where the code failed? Sending the values of L, k, p, s for the cases where you got this RuntimeWarning would suffice.

I need to do some digging to get the exact parameters for when the warnings appeared -- will send over when I have them!

bluegenes · 2022-02-23T22:09:12Z

Relevant portion of a traceback for the RuntimeWarnings encountered w/ jaccard --> ANI

  File "/home/ntpierce/miniconda3/8d22aeebeaab4ce824f4fd7c3e1b98c6/lib/python3.8/site-packages/sourmash/distance_utils.py", line 148, in jaccard_to_distance
    sol1 = brentq(f1, 0.0000001, 0.9999999)
  File "/home/ntpierce/miniconda3/8d22aeebeaab4ce824f4fd7c3e1b98c6/lib/python3.8/site-packages/scipy/optimize/_zeros_py.py", line 783, in brentq
    r = _zeros._brentq(f, a, b, xtol, rtol, maxiter, args, full_output, disp)
  File "/home/ntpierce/miniconda3/8d22aeebeaab4ce824f4fd7c3e1b98c6/lib/python3.8/site-packages/sourmash/distance_utils.py", line 145, in <lambda>
    f1 = lambda pest: 2.0/(2- (1-pest)**ksize ) - 1 + z_alpha * sqrt(var_direct(pest)) - jaccard
RuntimeWarning: invalid value encountered in sqrt

Note, when this shows up, it shows up for both f1 and f2, never just one.

I re-ran a number of analyses and recorded incidence of errors. Sent csv via email!

mahmudhera · 2022-03-04T05:35:17Z

I did this PR #1860, which adds codes for jaccard to point-estimate into your branch. I tested the obvious corner cases, seem to be passing. Just to note: this code does have some error, and therefore also returns an error measure with the point-estimate. More details are in the code docstrings. Let me know if you have any questions. I also added code to calculate the likelihood that nothing may be common in sketches. Look at dist_utils.py for more details.

* enabled running main * added point estimate and error bound * added docstring * added usage code * added nothing_common code, usage, docstrings, tested

bluegenes · 2022-04-27T17:45:22Z

This PR was split & refactored into

The only remaining functionality needed is adding ANI thresholding to tax summarization.

bluegenes · 2022-04-27T18:47:23Z

punted tax ANI to #2005 and JOSS updates to #2006. I think this can safely be closed now!

* init with tax code from #1788 * tax threshold arg from #1788 * add back sql changes from latest * make sure we can still use tax without ANI col * dont warn about ignoring ANI more than once * check query stats * fix gather test * add sentence on ani thresh * apply sugg from code review

init dist utils

f728db1

bluegenes added 10 commits January 13, 2022 13:22

add confidence tests

70d2ae2

optionally return sim/ident instead

6c061a0

add ani estimation to signature

6da758e

return confidence interval in addition to point estimate

bd6485d

return ANI from compare

2177843

add containment ANI tests to test_compare

4d61a38

init compare ANI tests

0e351b8

add option

66a570f

ani in prefetch and gather

530ab76

test ani from precalculated score

361bb4d