Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

what to do about ANI estimate for two very small scaled sketches? #2003

Open
bluegenes opened this issue Apr 27, 2022 · 4 comments
Open

what to do about ANI estimate for two very small scaled sketches? #2003

bluegenes opened this issue Apr 27, 2022 · 4 comments

Comments

@bluegenes
Copy link
Contributor

bluegenes commented Apr 27, 2022

With #1967, we will now estimate ANI for any scaled sketch comparisons, regardless of sketch size. These estimates may be inaccurate for viruses/small genomes.

context from #1967 (comment):
@ctb:

A question that I didn't see clearly addressed anywhere (but I might have missed!) is what happens when you try to calculate ANI for two very small scaled sketches? Does it just end up being 0, or None, or ?

@bluegenes:
At the moment, we just report the ANI, except for extremely tiny test data where we can't actually estimate ANI.

For jaccard --> ANI, we estimate the error on the jaccard estimate itself, and raise a warning when the error may be too high (but still currently report ANI). I have an item in the SearchResult class that keeps track of whether the jaccard estimate error is too high -- I think we should at least consider doing that, but would also be open to zeroing out the ANI estimate.

From #1798 (comment):

For example, using a scale factor of s=1/1000, if you want to be at least 95% sure that the FracMinHash cardinality is off by less than 5% relative error, you want to be sure that your set has more than ~4.4x10^6 elements.
Interestingly, this approach will always be less space efficient than HLL. HLL experiences a relative error of something like 1.04/Sqrt(m) for a sketch of size m. For FracMinHash, you end up with a relative error of (at absolute minimum, often much worse) 2.07944/Sqrt(m) for a sketch of size m.

I was hoping we might be able to use HLL to avoid issues with small sketches, but I suppose instead we could use this to estimate an error based on the sketch size, and raise a warning when the error/ zero out the ANI when the error is too high?

@ctb
Copy link
Contributor

ctb commented Apr 28, 2022

Hot take: Warning + None-ing it out as in #2004 seems good for now.

@bluegenes
Copy link
Contributor Author

Handled by #2032.

@bluegenes
Copy link
Contributor Author

bluegenes commented May 19, 2022

This is happening much more often than I expected, and for some applications (prefetch, gather), can yield very verbose output (#2058).

Can deal with verbosity by changing the warning strategy, but I'm not sure zeroing out is the right call, if it's happening this often...

@bluegenes
Copy link
Contributor Author

thresholds modified in #2074

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants