-
Notifications
You must be signed in to change notification settings - Fork 98
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add efficient SparseVector
method for some metrics
#235
Conversation
The new method calls a method _binary_map_reduce1 which is also introduced in this PR. It is general enough to handle other similar use cases.
In the original PR, two useless loops are executed in I'll simplify the function. |
This method is only for the non-weighted metrics. It's not clear what a sparse version of weighted metrics would look like.
I added support for |
Codecov Report
@@ Coverage Diff @@
## master #235 +/- ##
==========================================
+ Coverage 97.39% 97.57% +0.17%
==========================================
Files 8 8
Lines 806 865 +59
==========================================
+ Hits 785 844 +59
Misses 21 21
Continue to review full report at Codecov.
|
These tests just convert dense matrices to sparse. EDIT: the following is done. |
df239f4
to
1517138
Compare
SparseVector
method for some metrics
SparseVector
method for some metricsSparseVector
method for some metrics
If you have a sparse matrix m and take @view m[:, i], the result is a SparseVectorView, which also is made efficient by the routines in this PR.
For nice code coverage, can you add a quick test with two sparse vectors of (i) different lengths and (ii) both of length 0, please? That should yield 100% diff coverage, and then we're ready to go IMO. |
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few (minor style) comments.
Co-authored-by: Daniel Karrasch <daniel.karrasch@posteo.de>
Co-authored-by: Daniel Karrasch <daniel.karrasch@posteo.de>
Shall we leave a comment about assumptions underlying the sparse |
Yes, I think this is correct. |
This PR makes the following efficient for
SparseVector
, partially addressing #5.bhattacharyya
,bhattacharyya_coeff
HellingerDist
, which follow directly from the previous itemUnionMetrics
. For exampleeuclidean(v1, v2)
. Specifically a method for_evaluate(d::UnionMetrics, a::SparseVector, b::SparseVector, ::Nothing)
is included.EDIT: The method
_binary_map_reduce1
mentioned below has been removed. The implementation has been simplified.The new method calls a method_binary_map_reduce1
which is also introduced in this PR. It is general enough to handle other similar use cases. I appended the1
in (anticipation of several of these methods, in analogy to these functions inSparseArrays
in the stdlib. Perhaps removing the1
in the name until it is needed would be a good idea.The function_binary_map_reduce1
works at least iff(0, 0) == 0
andop(v, 0) == v
, as well as some other cases noted in the code.The numbered functions linked to above are for binary maps, not for binary mapreduce. The reasons for implementing variants (those with numbers appended) is different. In the former case, it is because an output vector is built and the functions are split according to which inputs can produces a zero.Which functions like_binary_map_reduce1
to implement, and where to put them is to be decided. It may make sense to decide on some generally useful binary mapreduce helper functions and put them inSparseArrays
. In the short term, I put one of them here. They are not in the API so they can easily be moved to another repo later.I also tried this function
This is also efficient for
SparseVector
becausesum
is efficient for a singleSparseVector
. Yesterday, I thought I found some cases in which it was faster than using_binary_map_reduce1
, but I can't find them now. However, it is often within 10 or 20 percent as fast. It is much less complex, which makes it attractive. Also,sum
uses pairwise summation, which is more accurate in general. HasDistances.jl
consciously opted for speed over correctness? I can understand making this choice for image processing.See also https://github.com/JuliaLang/julia/issues/43248