Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pairwise for calculating distance matrices #38

Merged
merged 15 commits into from
Nov 9, 2020

Conversation

robertfeldt
Copy link
Collaborator

Adds pairwise methods similar to Distances.pairwise for calculating distance matrices. Both symmetric and asymmetric modes are supported and multiple threads should be used if available. Pre-calculation is used by default if there are more than 5 objects to be compared.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.1%) to 97.753% when pulling ff1daea on robertfeldt:master into 6b32f2d on matthieugomez:master.

@coveralls
Copy link

coveralls commented Nov 7, 2020

Coverage Status

Coverage increased (+0.1%) to 98.015% when pulling 9267d07 on robertfeldt:master into 6b32f2d on matthieugomez:master.

@robertfeldt
Copy link
Collaborator Author

Now also added docs for pairwise.

@robertfeldt
Copy link
Collaborator Author

I'm seeing quite healthy speedups from precalculation and also from multithreading:

$ julia -t 2 test/performance/pairwise.jl 1000 1000
For 2 threads and 1000 strings of max length 1000:
  - time WITHOUT pre-calculation: 72.019872545
  - time WITH    pre-calculation: 2.746832681
  - speedup with pre-calculation: 26.219

$ julia -t 2 test/performance/pairwise.jl 100 1000
For 2 threads and 100 strings of max length 1000:
  - time WITHOUT pre-calculation: 0.544617846
  - time WITH    pre-calculation: 0.034407444
  - speedup with pre-calculation: 15.828

$ julia test/performance/pairwise.jl 100 1000
For 1 threads and 100 strings of max length 1000:
  - time WITHOUT pre-calculation: 1.026
  - time WITH    pre-calculation: 0.058
  - speedup with pre-calculation: 17.7

Copy link
Owner

@matthieugomez matthieugomez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! That looks great. I have written a few small comments.

Btw, all edit distances (and maybe qgram distances?) compute the length of the String at some point (this is done in reorder, which returns a tuple of StringWithLength). One thing that would be useful is to see whether, when X and Y are AbstractArray{<:AbstractString}, is it worth preprocessing them using StringWithLength to compute length once and for all? Of course, not required for me to merge this.

src/pairwise.jl Outdated
_allocmatrix(X, Y, T) = Matrix{T}(undef, length(X), length(Y))
_allocmatrix(X, T) = Matrix{T}(undef, length(X), length(X))

import Distances: pairwise
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you remove import Distances: pairwise, and just do function Distances.pairwise() when defining the function?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, yes will do. For my own understanding, is there a benefit to doing it this way?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, not sure the doc string is registered in the correct way after I changed this. Please check and see if I got it wrong...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be fixed now.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No benefit — it's just how I do it in the rest of the package. I personally find it clearer.

src/pairwise.jl Outdated

@doc """
pairwise(dist::StringDistance, itr; eltype = Float64, precalc = nothing)
pairwise(dist::StringDistance, itr1, itr2; eltype = Float64, precalc = nothing)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

preprocessing instead or precalc? I think verbose is better, especially for options that are not used that often

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer preprocess so changed to that. Ok?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, thanks

src/pairwise.jl Outdated
import Distances: pairwise

@doc """
pairwise(dist::StringDistance, itr; eltype = Float64, precalc = nothing)
Copy link
Owner

@matthieugomez matthieugomez Nov 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you add a nthreads = Threads.nthreads() option to allow users to change the number of threads?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, but I'm not sure how to control the number of threads used after Julia starts. We would have to manually spawn then or what do you propose?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, nevermind. You cannot change the number of threads when using @threads.

@robertfeldt
Copy link
Collaborator Author

Ok, I think I fixed most of the things based on your comments but a few additional questions from me above.

It doesn't sound to me like the preprocessing by calling length once and for all would make much difference; isn't that time very small compared to the actual calculation of the edit distances?

For QGramDistances you are right and the plan is to fix this directly on the implementation for each distance, I think we discussed adding, say, an init function on the counters so that they can extract the number of qgrams once and for all, upfront, and then would not need to count them as they loop. However, for my next PR I plan to generalize the QGramDistances to DictionaryDistances since they really can take any dictionary as input, having for example variable-length q-gram "words". This is quite often used nowadays and would allow for example Lempel-Ziv based dictionaries which, at least in theory, could be better than fixed-length ones. Then we can add for example Normalized Dictionary Distance and other dictionary distances that have been almost as good as normalized compression distances while not needing to actually compress strings. And their performance could be very close to the current QGramDistances for fixed-length dictionaries/q-grams (ok, the Lempel-ziv will likely take somewhat longer than creating fixed-length dictionaires, but not by much).

@robertfeldt
Copy link
Collaborator Author

Ran some performance testing with different number of threads on a 16 vCPU machine and it seems that with preprocessing there are benefits all the way up to 16 threads but without preprocessing the performance tapers off already for 8 threads. Not really sure why this happens but in general, preprocessing is a good default it seems:

$ julia -t 1 test/performance/pairwise.jl 500 1000
Creating 500 random strings.
Saving cache file with 500 strings: /home/ubuntu/.julia/packages/StringDistances/j8PlO/test/performance/perfteststrings_1000.juliabin
For 1 threads and 500 strings of max length 1000:
  - time WITHOUT pre-calculation: 16.129
  - time WITH    pre-calculation: 1.216
  - speedup with pre-calculation: 13.3

$ julia -t 2 test/performance/pairwise.jl 500 1000
Read 500 strings from cache file: /home/ubuntu/.julia/packages/StringDistances/j8PlO/test/performance/perfteststrings_1000.juliabin
For 2 threads and 500 strings of max length 1000:
  - time WITHOUT pre-calculation: 8.181
  - time WITH    pre-calculation: 0.642
  - speedup with pre-calculation: 12.7

$ julia -t 4 test/performance/pairwise.jl 500 1000
Read 500 strings from cache file: /home/ubuntu/.julia/packages/StringDistances/j8PlO/test/performance/perfteststrings_1000.juliabin
For 4 threads and 500 strings of max length 1000:
  - time WITHOUT pre-calculation: 4.472
  - time WITH    pre-calculation: 0.386
  - speedup with pre-calculation: 11.6

$ julia -t 8 test/performance/pairwise.jl 500 1000
Read 500 strings from cache file: /home/ubuntu/.julia/packages/StringDistances/j8PlO/test/performance/perfteststrings_1000.juliabin
For 8 threads and 500 strings of max length 1000:
  - time WITHOUT pre-calculation: 7.236
  - time WITH    pre-calculation: 0.251
  - speedup with pre-calculation: 28.8

$ julia -t 12 test/performance/pairwise.jl 500 1000
Read 500 strings from cache file: /home/ubuntu/.julia/packages/StringDistances/j8PlO/test/performance/perfteststrings_1000.juliabin
For 12 threads and 500 strings of max length 1000:
  - time WITHOUT pre-calculation: 11.343
  - time WITH    pre-calculation: 0.204
  - speedup with pre-calculation: 55.7

$ julia -t 16 test/performance/pairwise.jl 500 1000
Read 500 strings from cache file: /home/ubuntu/.julia/packages/StringDistances/j8PlO/test/performance/perfteststrings_1000.juliabin
For 16 threads and 500 strings of max length 1000:
  - time WITHOUT pre-calculation: 15.288
  - time WITH    pre-calculation: 0.189
  - speedup with pre-calculation: 81.1
```

@matthieugomez
Copy link
Owner

matthieugomez commented Nov 9, 2020

Thanks for this! Computing length requires to iterate the whole string in Julia (i.e. O(n) rather than O(1)), so it may be a not negligible cost. Anyway, one could do that in another pull request in the future.

@matthieugomez matthieugomez merged commit a0e5347 into matthieugomez:master Nov 9, 2020
@matthieugomez
Copy link
Owner

Ok, the length time seems negligible, so not worth it. I updated a bit the code to make it simpler for me. I hope it's ok for you too.

@robertfeldt
Copy link
Collaborator Author

Looks great thanks.

Just realized that since we need to use length to decide the dims of the resulting matrix it might not make sense to allow any iterator as argument (since we would in general not know their length). But, in practice this should rarely be a problem so leave as is?

Maybe we also want to add a brief doc section to the readme to mention pairwise?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants