-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pairwise for calculating distance matrices #38
Conversation
Now also added docs for |
I'm seeing quite healthy speedups from precalculation and also from multithreading: $ julia -t 2 test/performance/pairwise.jl 1000 1000
For 2 threads and 1000 strings of max length 1000:
- time WITHOUT pre-calculation: 72.019872545
- time WITH pre-calculation: 2.746832681
- speedup with pre-calculation: 26.219
$ julia -t 2 test/performance/pairwise.jl 100 1000
For 2 threads and 100 strings of max length 1000:
- time WITHOUT pre-calculation: 0.544617846
- time WITH pre-calculation: 0.034407444
- speedup with pre-calculation: 15.828
$ julia test/performance/pairwise.jl 100 1000
For 1 threads and 100 strings of max length 1000:
- time WITHOUT pre-calculation: 1.026
- time WITH pre-calculation: 0.058
- speedup with pre-calculation: 17.7 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! That looks great. I have written a few small comments.
Btw, all edit distances (and maybe qgram distances?) compute the length of the String at some point (this is done in reorder
, which returns a tuple of StringWithLength). One thing that would be useful is to see whether, when X and Y are AbstractArray{<:AbstractString}
, is it worth preprocessing them using StringWithLength to compute length once and for all? Of course, not required for me to merge this.
src/pairwise.jl
Outdated
_allocmatrix(X, Y, T) = Matrix{T}(undef, length(X), length(Y)) | ||
_allocmatrix(X, T) = Matrix{T}(undef, length(X), length(X)) | ||
|
||
import Distances: pairwise |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you remove import Distances: pairwise
, and just do function Distances.pairwise()
when defining the function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, yes will do. For my own understanding, is there a benefit to doing it this way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, not sure the doc string is registered in the correct way after I changed this. Please check and see if I got it wrong...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be fixed now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No benefit — it's just how I do it in the rest of the package. I personally find it clearer.
src/pairwise.jl
Outdated
|
||
@doc """ | ||
pairwise(dist::StringDistance, itr; eltype = Float64, precalc = nothing) | ||
pairwise(dist::StringDistance, itr1, itr2; eltype = Float64, precalc = nothing) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
preprocessing instead or precalc? I think verbose is better, especially for options that are not used that often
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer preprocess
so changed to that. Ok?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, thanks
src/pairwise.jl
Outdated
import Distances: pairwise | ||
|
||
@doc """ | ||
pairwise(dist::StringDistance, itr; eltype = Float64, precalc = nothing) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you add a nthreads = Threads.nthreads()
option to allow users to change the number of threads?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, but I'm not sure how to control the number of threads used after Julia starts. We would have to manually spawn then or what do you propose?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, nevermind. You cannot change the number of threads when using @threads
.
Ok, I think I fixed most of the things based on your comments but a few additional questions from me above. It doesn't sound to me like the preprocessing by calling length once and for all would make much difference; isn't that time very small compared to the actual calculation of the edit distances? For QGramDistances you are right and the plan is to fix this directly on the implementation for each distance, I think we discussed adding, say, an |
Ran some performance testing with different number of threads on a 16 vCPU machine and it seems that with preprocessing there are benefits all the way up to 16 threads but without preprocessing the performance tapers off already for 8 threads. Not really sure why this happens but in general, preprocessing is a good default it seems: $ julia -t 1 test/performance/pairwise.jl 500 1000
Creating 500 random strings.
Saving cache file with 500 strings: /home/ubuntu/.julia/packages/StringDistances/j8PlO/test/performance/perfteststrings_1000.juliabin
For 1 threads and 500 strings of max length 1000:
- time WITHOUT pre-calculation: 16.129
- time WITH pre-calculation: 1.216
- speedup with pre-calculation: 13.3
$ julia -t 2 test/performance/pairwise.jl 500 1000
Read 500 strings from cache file: /home/ubuntu/.julia/packages/StringDistances/j8PlO/test/performance/perfteststrings_1000.juliabin
For 2 threads and 500 strings of max length 1000:
- time WITHOUT pre-calculation: 8.181
- time WITH pre-calculation: 0.642
- speedup with pre-calculation: 12.7
$ julia -t 4 test/performance/pairwise.jl 500 1000
Read 500 strings from cache file: /home/ubuntu/.julia/packages/StringDistances/j8PlO/test/performance/perfteststrings_1000.juliabin
For 4 threads and 500 strings of max length 1000:
- time WITHOUT pre-calculation: 4.472
- time WITH pre-calculation: 0.386
- speedup with pre-calculation: 11.6
$ julia -t 8 test/performance/pairwise.jl 500 1000
Read 500 strings from cache file: /home/ubuntu/.julia/packages/StringDistances/j8PlO/test/performance/perfteststrings_1000.juliabin
For 8 threads and 500 strings of max length 1000:
- time WITHOUT pre-calculation: 7.236
- time WITH pre-calculation: 0.251
- speedup with pre-calculation: 28.8
$ julia -t 12 test/performance/pairwise.jl 500 1000
Read 500 strings from cache file: /home/ubuntu/.julia/packages/StringDistances/j8PlO/test/performance/perfteststrings_1000.juliabin
For 12 threads and 500 strings of max length 1000:
- time WITHOUT pre-calculation: 11.343
- time WITH pre-calculation: 0.204
- speedup with pre-calculation: 55.7
$ julia -t 16 test/performance/pairwise.jl 500 1000
Read 500 strings from cache file: /home/ubuntu/.julia/packages/StringDistances/j8PlO/test/performance/perfteststrings_1000.juliabin
For 16 threads and 500 strings of max length 1000:
- time WITHOUT pre-calculation: 15.288
- time WITH pre-calculation: 0.189
- speedup with pre-calculation: 81.1
``` |
Thanks for this! Computing length requires to iterate the whole string in Julia (i.e. |
Ok, the |
Looks great thanks. Just realized that since we need to use length to decide the dims of the resulting matrix it might not make sense to allow any iterator as argument (since we would in general not know their length). But, in practice this should rarely be a problem so leave as is? Maybe we also want to add a brief doc section to the readme to mention |
Adds
pairwise
methods similar toDistances.pairwise
for calculating distance matrices. Both symmetric and asymmetric modes are supported and multiple threads should be used if available. Pre-calculation is used by default if there are more than 5 objects to be compared.