Speeding up qgram distances with pre-counting of qgrams #34

robertfeldt · 2020-10-20T09:40:07Z

Thanks again, for a great package.

I repeatedly need to calculate distances to some set of strings so I tried to speed up by pre-counting of qgrams. This can be very useful when calculating for example distance matrices etc.

I can get about 2.5-3 times speedups after pre-counting. Of course, it is slower (double time on my machine) if you only want to calculate distances once.

If there is any interest in merging this I can try doing a PR at some point, if not here is the code if someone else have a similar need:

https://gist.github.com/robertfeldt/103f078b3154c5621f52cee3d061bf81

I'll try to also use a pre-sorted array of counts rather than a dict and see if I can push this further.

robertfeldt · 2020-10-21T08:12:26Z

Yeah, using sorted arrays (sorted on the qgram) for pre-calculation makes quite a big difference (7x to 13x faster than "vanilla" StringDistancces.evaluate) on each comparison on my machine) in scenarios where you need to compare multiple times on some strings:

https://gist.github.com/robertfeldt/072147dc606c878080cd70972d76c8dd

Note also the possible re-design of the QgramDistances by using counters. To me it seems a bit cleaner and more flexible than having the counting code inside each Qgramdistance, but YMMV.

robertfeldt · 2020-10-21T11:59:49Z

Actually this can be sped up further with an optimized loop over the sorted qgram count arrays by:

Using an inner for-loop over one of the arrays when the other one is "empty"
Using Base.cmp once instead of multiple comparisons on the SubString's

Full code is here:
https://gist.github.com/robertfeldt/289b58f6efecaf051b3c0ad69cd0bd85
but the only real change is the _yield_on_co_count_pairs2

This takes the speedup from the 7x-13x range up to 17-20x faster than vanilla StringDistances.evaluate, so about another factor of about 1.5. This makes a big difference in for example distance matrix calculations.

matthieugomez · 2020-10-21T15:41:33Z

Thanks Robert. This looks amazing. Please do a PR if you can!

I have two comments for the PR. First, could you find a way to implement it without increasing timings for simple comparisons? I'm fine having two code paths, even it leads to some code duplication.

Second, I'd like to avoid defining(dist::QGramDistance)(s1::Vector{Pair}, s2::Vector{Pair}) since it conflicts with the general definition of dist(::QGramDistance)(s1, s2) for two iterators. It seems better to only define (dist::QGramDistance)(s1::QGramCount, s2::QGramCount), which is not ambiguous if QGramCount is not an iterator.

robertfeldt · 2020-10-23T13:41:47Z

Ok, thanks Matthie. I've started converting to a PR. No major problems so far...

robertfeldt · 2020-10-23T21:45:22Z

Ok, PR now done. Hopefully I understood your two comments and have managed to align with them:

#36

When we have discussed and hopefully merged this there are two more PRs I can do if there is interest:

pairwise(dist, strings::Vector{AbstractString}; precalc::Bool = true) - which calculated the distance matrix between all pairs of strings and uses precalculation if the flag is true (and if there are at least, say, 5 strings in the vector, since precalculation saves time from 3-4 or so)
Add more Qgram distances, in particular the different compression distances based on qgram dictionaries. This should be easy with the new design of the PR.

robertfeldt · 2020-10-23T22:05:45Z

Note that there are some additional speedup ideas that can be done for the IntersectionDists that I didn't do now since I wanted to stay close to the code of my gists. I doubt it will make a big difference but this:

@inline countleft!(c::ThreeCounters{Int, QD}, n1::Integer) where {QD<:IntersectionDist} =
	c.left += (n1 > 0)

can actually be written simply:

@inline countleft!(c::ThreeCounters{Int, QD}, n1::Integer) where {QD<:IntersectionDist} =
	c.left += 1

since countleft! (and its "friends" for the same counter) will only be called when the arguments are >0. I doubt this will make a big difference and there is some risk in this if someone "forgets" about this in the future.

A potentially larger gain might be had by introducing calls to something like initleft!(counter, length(d1)) before starting to loop in the countmatches! methods since we actually know how many unique qgrams each of the strings/objects has upfront. Then the countleft!and countright! methods can be blank/do nothing for the IntersectionDists which Julia can hopefully optimize quite a lot. Thus Jaccard, SorensenDice, and Overlap should have quite some gains from this, I think.

matthieugomez · 2020-10-24T18:37:06Z

Yes, a pairwise method, following the Distances API, would be great too.

matthieugomez · 2020-10-24T19:20:24Z

Beyond pairwise, it would also be great to use this precounting for the findnearest and findall methods

robertfeldt · 2020-10-25T09:03:11Z

Closing this since the speedup PR has been merged. Will now think about the

pairwise,
more qgram distances, and
findnearest/findall updates

and post as issues (before implementing) or do PRs, as needed. Thanks.

robertfeldt closed this as completed Oct 25, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speeding up qgram distances with pre-counting of qgrams #34

Speeding up qgram distances with pre-counting of qgrams #34

robertfeldt commented Oct 20, 2020

robertfeldt commented Oct 21, 2020

robertfeldt commented Oct 21, 2020

matthieugomez commented Oct 21, 2020 •

edited

Loading

robertfeldt commented Oct 23, 2020

robertfeldt commented Oct 23, 2020

robertfeldt commented Oct 23, 2020 •

edited

Loading

matthieugomez commented Oct 24, 2020

matthieugomez commented Oct 24, 2020

robertfeldt commented Oct 25, 2020

Speeding up qgram distances with pre-counting of qgrams #34

Speeding up qgram distances with pre-counting of qgrams #34

Comments

robertfeldt commented Oct 20, 2020

robertfeldt commented Oct 21, 2020

robertfeldt commented Oct 21, 2020

matthieugomez commented Oct 21, 2020 • edited Loading

robertfeldt commented Oct 23, 2020

robertfeldt commented Oct 23, 2020

robertfeldt commented Oct 23, 2020 • edited Loading

matthieugomez commented Oct 24, 2020

matthieugomez commented Oct 24, 2020

robertfeldt commented Oct 25, 2020

matthieugomez commented Oct 21, 2020 •

edited

Loading

robertfeldt commented Oct 23, 2020 •

edited

Loading