-
Notifications
You must be signed in to change notification settings - Fork 125
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
use hash bits more effectively #69
base: master
Are you sure you want to change the base?
Conversation
Any data on how much this helps? It seems like this is going to be
single-digit percent in hash utilization, and it that doesn’t directly
translate into improved recognition or discrimination.
Using a non-power-of-2 FFT size might make the FFT much, much slower. I
think the default FFT is 2^9 = 512. That plus 2 is 514, whose prime
factors are 2 and 257. So it’s essentially a non-fast FT over 257 points
plus a few tweaks, which might be ~20x slower than 512 point.
DAn.
…On Sun, Sep 22, 2019 at 20:19 Daniel Brooks ***@***.***> wrote:
Audfprint uses most of the possible hash values, but leaves out a few.
Fixing this improves recognition rates while storing the same number of
hashes, or allows you to reduce the number of hashes stored while keeping
acceptable recognition rates.
Of course, this is not backwards-compatible; queries generated with these
fixes won't match very well against databases generated without them. I've
not made any changes to the hash table code here, so these get saved with
the current version number.
------------------------------
You can view, comment on, or merge this pull request online at:
#69
Commit Summary
- use all bits of dt in the hash
- use all bits of df in the hash
- compute two additional fft bins, then discard the bottom and top bin
File Changes
- *M* audfprint_analyze.py (19)
Patch Links:
- https://github.com/dpwe/audfprint/pull/69.patch
- https://github.com/dpwe/audfprint/pull/69.diff
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#69?email_source=notifications&email_token=AAEGZUJRCYNDSXGGG3Z4JVTQLADRHA5CNFSM4IZFIC22YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HM5LUOA>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEGZULDR4UUS27USL2446TQLADRHANCNFSM4IZFIC2Q>
.
|
With these changes I've reduced the number of hashes that I need to keep by ~40%.
I hadn't considered this, and I'm not really measuring cpu performance at all. Performance while creating fingerprints is fine, and querying is slow but not as far as I can tell slower than before. |
The total loss of hash values is 9%. Before the fixes this sample of audio resulted in 895846 unique hashes (out of 2^20 possible hash values); after the fix it results in 983840 unique values. The remaining missing values are due to the dependency between f and df; when f is low, then df is likely to be positive (if f is zero, then df cannot be negative at all). I've been thinking about a fix for that as well, but I've not tried it yet. |
More fully occupying the hash space is nice, but it seems second-order. It
certainly doesn't translate into needing fewer hashes to maintain match
accuracy. I don't mean to dismiss your improvements, but a 40% reduction in
hashes is huge, and I'd really love to understand what it relies on. I can
imagine that setting a larger minimum time-difference for hashes might
exclude a lot of high-occurrence, low-value hashes, I wonder if that's it?
One problem is that it's very hard to measure "efficiency" here (even
ignoring the fact that the existing implementation doesn't gain much from a
less-full database). The real performance limit is the false alarm - false
reject tradeoff, but the false alarm curve is *very* steep: you get almost
none, then if you drop your --min-count threshold too low, you get lots.
It might be interesting to look at statistics of the number of hashes at
each time/frequency offset. I'm not sure how best to illustrate that 3D
dataset, but even just looking at a histogram of overall hash frequency vs.
overall hash count might be informative. We could implement a pruning
strategy that fills the hash table for a while, then identifies the most
popular hashes, then "blacklists" them. Eliminating a small fraction of
the hash values could result in removing a large portion of the database,
which could in theory save space (for some alternative hash storage
mechanism) and will also accelerate matching (because fewer matching hashes
have to be tallied). The *question* is how much it hurts retrieval.
DAn.
…On Mon, Sep 23, 2019 at 12:47 AM Daniel Brooks ***@***.***> wrote:
The total loss of hash values is 9%. Before the fixes this sample of audio
resulted in 895846 unique hashes (out of 2^20 possible hash values); after
the fix it results in 983840 unique values. The remaining missing values
are due to the dependency between f and df; when f is low, then df is
likely to be positive (if f is zero, then df cannot be negative at all).
I've been thinking about a fix for that as well, but I've not tried it yet.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#69?email_source=notifications&email_token=AAEGZULPJKOII7RBSVRXBD3QLBC5RA5CNFSM4IZFIC22YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7JZABI#issuecomment-533958661>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEGZUOXBJQZCOMO3JGEXZDQLBC5RANCNFSM4IZFIC2Q>
.
|
It may be that the gains more than expected because my audio is very clean. The queries aren't recorded on a cell phone in a nightclub, they come directly from the corpus. We're not looking to duplicate Shazam, we're looking to build a way to find all the television and radio programs that played a commercial, or jingle, soundbite, etc. That said, I think the dt component has the same minimum value of 2 time hops as before; I simply encode that as a zero and extend the top end of the range to fill out the space in the hash. You can actually see that in the graph.
Agreed, but as I've said I've never even seen any false-positives. The closest I've seen is matches between very closely related pieces of audio: two versions of a commercial where the start and end are identical, but there's an extra line of dialogue in the middle of one version. It may be that the 40% reduction is overly optimistic, and
This is probably true; some hashes occur in more than half of the files in a dataset of 2k files that I've tested with: |
Audfprint uses most of the possible hash values, but leaves out a few. Fixing this improves recognition rates while storing the same number of hashes, or allows you to reduce the number of hashes stored while keeping acceptable recognition rates.
Of course, this is not backwards-compatible; queries generated with these fixes won't match very well against databases generated without them. I've not made any changes to the hash table code here, so these get saved with the current version number.