-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[EXP] add sourmash distance estimation #1788
Conversation
Codecov Report
@@ Coverage Diff @@
## latest #1788 +/- ##
===========================================
- Coverage 83.17% 20.48% -62.69%
===========================================
Files 126 124 -2
Lines 13954 13666 -288
Branches 1910 1871 -39
===========================================
- Hits 11606 2800 -8806
- Misses 2075 10842 +8767
+ Partials 273 24 -249
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
lookin' good. do you want comments now or do you want to wait until it's more fully baked? |
@ctb Comments now would be good! I'm done adding things for now, just gonna go take it for a spin :) |
I think leave it out for now. Would it be included in
Same answer :). But we might want to NA or zero out the numbers where they are particularly sketch-y (heh)
ANI good, IMO! |
Co-authored-by: C. Titus Brown <titus@idyll.org>
Excellent! The point estimate formula would essentially stay the same, I just want to do some theoretical analysis first, so that we have justification for using the formula. I will do some analyses in the next couple of days and give you an update on this. |
It looks like the point estimate that we have right now: it would only be reasonable to use when the ANI is large. Otherwise, the formula may work well by chance, but there is no theoretical guarantee. The exact range of ANI depends on other parameters. We can approach this in one of two ways: we can either keep the code as is, and besides the point estimate, we can return an estimated error. If the error is above some threshold, we can report that Jaccard->mutation_rate code cannot be trusted. Or, we can try to improve the error rate by solving for the point estimate differently, which may take a bit more work. I will try both locally, let me know which one of these you would prefer. Meanwhile, do you think you can send the exact settings where the code failed? Sending the values of L, k, p, s for the cases where you got this RuntimeWarning would suffice. |
thanks @mahmudhera!
ok, good to know!
I think if it's possible to generate the jaccard ANI point estimate differently without too much extra work, that would be great. If not, the other option is ok -- I would probably just zero out the ANI estimate when the error is too high (using some default, user-modifiable threshold). If the error is much higher, we can recommend only containment --> ANI and provide jaccard equations with warnings. Alternatively, we could recommend using containment only.
I need to do some digging to get the exact parameters for when the warnings appeared -- will send over when I have them! |
Relevant portion of a traceback for the RuntimeWarnings encountered w/ jaccard --> ANI
Note, when this shows up, it shows up for both I re-ran a number of analyses and recorded incidence of errors. Sent csv via email! |
I did this PR #1860, which adds codes for jaccard to point-estimate into your branch. I tested the obvious corner cases, seem to be passing. Just to note: this code does have some error, and therefore also returns an error measure with the point-estimate. More details are in the code docstrings. Let me know if you have any questions. I also added code to calculate the likelihood that nothing may be common in sketches. Look at dist_utils.py for more details. |
* enabled running main * added point estimate and error bound * added docstring * added usage code * added nothing_common code, usage, docstrings, tested
This PR was split & refactored into
The only remaining functionality needed is adding ANI thresholding to |
Implement a minimal version of https://github.com/KoslickiLab/mutation-rate-ci-calculator to estimate distance from scaled containment and jaccard.
See https://www.biorxiv.org/content/10.1101/2022.01.11.475870v1
May fix #1242
compare
prefetch
gather
search
add'l to do:
search
,prefetch
,gather
, NOT incompare
. This maintains compatibility withsourmash plot
.signature
ANI functionsconfidence
param tominhash
,signature
ANI functions to allow tuning confidence level. Defaults to0.95
.--ani-threshold
totax genome
Questions/Thoughts:
NOTE: ANI estimation is not backwards compatible with previous prefetch/gather results (needs additional columns).