what to do about ANI estimate for two very small scaled sketches? #2003

bluegenes · 2022-04-27T17:19:08Z

With #1967, we will now estimate ANI for any scaled sketch comparisons, regardless of sketch size. These estimates may be inaccurate for viruses/small genomes.

context from #1967 (comment):
@ctb:

A question that I didn't see clearly addressed anywhere (but I might have missed!) is what happens when you try to calculate ANI for two very small scaled sketches? Does it just end up being 0, or None, or ?

@bluegenes:
At the moment, we just report the ANI, except for extremely tiny test data where we can't actually estimate ANI.

For jaccard --> ANI, we estimate the error on the jaccard estimate itself, and raise a warning when the error may be too high (but still currently report ANI). I have an item in the SearchResult class that keeps track of whether the jaccard estimate error is too high -- I think we should at least consider doing that, but would also be open to zeroing out the ANI estimate.

From #1798 (comment):

For example, using a scale factor of s=1/1000, if you want to be at least 95% sure that the FracMinHash cardinality is off by less than 5% relative error, you want to be sure that your set has more than ~4.4x10^6 elements.
Interestingly, this approach will always be less space efficient than HLL. HLL experiences a relative error of something like 1.04/Sqrt(m) for a sketch of size m. For FracMinHash, you end up with a relative error of (at absolute minimum, often much worse) 2.07944/Sqrt(m) for a sketch of size m.

I was hoping we might be able to use HLL to avoid issues with small sketches, but I suppose instead we could use this to estimate an error based on the sketch size, and raise a warning when the error/ zero out the ANI when the error is too high?

The text was updated successfully, but these errors were encountered:

ctb · 2022-04-28T15:08:22Z

Hot take: Warning + None-ing it out as in #2004 seems good for now.

bluegenes · 2022-05-13T22:11:14Z

Handled by #2032.

bluegenes · 2022-05-19T16:28:35Z

This is happening much more often than I expected, and for some applications (prefetch, gather), can yield very verbose output (#2058).

Can deal with verbosity by changing the warning strategy, but I'm not sure zeroing out is the right call, if it's happening this often...

bluegenes · 2022-07-08T22:26:58Z

thresholds modified in #2074

bluegenes mentioned this issue Apr 27, 2022

[MRG] output estimated ANI from sourmash compare, search, prefetch, and gather #1967

Merged

6 tasks

bluegenes mentioned this issue Apr 29, 2022

debiasing FracMinHash - plans and progress #1798

Open

bluegenes closed this as completed May 13, 2022

bluegenes mentioned this issue May 16, 2022

add script that computes and prints the three different ways to make … mahmudhera/phylogenetic-tree-using-fracminhash#2

Merged

bluegenes reopened this May 19, 2022

bluegenes mentioned this issue Jun 2, 2022

Sourmash ANI estimate in some cases does not match manual computation, although using the same sketch signature mahmudhera/sourmash-ani-implementation-test#1

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

what to do about ANI estimate for two very small scaled sketches? #2003

what to do about ANI estimate for two very small scaled sketches? #2003

bluegenes commented Apr 27, 2022 •

edited

Loading

ctb commented Apr 28, 2022

bluegenes commented May 13, 2022

bluegenes commented May 19, 2022 •

edited

Loading

bluegenes commented Jul 8, 2022

what to do about ANI estimate for two very small scaled sketches? #2003

what to do about ANI estimate for two very small scaled sketches? #2003

Comments

bluegenes commented Apr 27, 2022 • edited Loading

ctb commented Apr 28, 2022

bluegenes commented May 13, 2022

bluegenes commented May 19, 2022 • edited Loading

bluegenes commented Jul 8, 2022

bluegenes commented Apr 27, 2022 •

edited

Loading

bluegenes commented May 19, 2022 •

edited

Loading