-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speeding up qgram distances with pre-counting of qgrams #34
Comments
Yeah, using sorted arrays (sorted on the qgram) for pre-calculation makes quite a big difference (7x to 13x faster than "vanilla" StringDistancces.evaluate) on each comparison on my machine) in scenarios where you need to compare multiple times on some strings: https://gist.github.com/robertfeldt/072147dc606c878080cd70972d76c8dd Note also the possible re-design of the QgramDistances by using counters. To me it seems a bit cleaner and more flexible than having the counting code inside each Qgramdistance, but YMMV. |
Actually this can be sped up further with an optimized loop over the sorted qgram count arrays by:
Full code is here: This takes the speedup from the 7x-13x range up to 17-20x faster than vanilla StringDistances.evaluate, so about another factor of about 1.5. This makes a big difference in for example distance matrix calculations. |
Thanks Robert. This looks amazing. Please do a PR if you can! I have two comments for the PR. First, could you find a way to implement it without increasing timings for simple comparisons? I'm fine having two code paths, even it leads to some code duplication. Second, I'd like to avoid defining |
Ok, thanks Matthie. I've started converting to a PR. No major problems so far... |
Ok, PR now done. Hopefully I understood your two comments and have managed to align with them: When we have discussed and hopefully merged this there are two more PRs I can do if there is interest:
|
Note that there are some additional speedup ideas that can be done for the IntersectionDists that I didn't do now since I wanted to stay close to the code of my gists. I doubt it will make a big difference but this: @inline countleft!(c::ThreeCounters{Int, QD}, n1::Integer) where {QD<:IntersectionDist} =
c.left += (n1 > 0) can actually be written simply: @inline countleft!(c::ThreeCounters{Int, QD}, n1::Integer) where {QD<:IntersectionDist} =
c.left += 1 since A potentially larger gain might be had by introducing calls to something like |
Yes, a |
Beyond |
Closing this since the speedup PR has been merged. Will now think about the
and post as issues (before implementing) or do PRs, as needed. Thanks. |
Thanks again, for a great package.
I repeatedly need to calculate distances to some set of strings so I tried to speed up by pre-counting of qgrams. This can be very useful when calculating for example distance matrices etc.
I can get about 2.5-3 times speedups after pre-counting. Of course, it is slower (double time on my machine) if you only want to calculate distances once.
If there is any interest in merging this I can try doing a PR at some point, if not here is the code if someone else have a similar need:
https://gist.github.com/robertfeldt/103f078b3154c5621f52cee3d061bf81
I'll try to also use a pre-sorted array of counts rather than a dict and see if I can push this further.
The text was updated successfully, but these errors were encountered: