Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add efficient SparseVector method for some metrics #235

Merged
merged 12 commits into from
Dec 2, 2021

Conversation

jlapeyre
Copy link
Contributor

@jlapeyre jlapeyre commented Nov 28, 2021

This PR makes the following efficient for SparseVector, partially addressing #5.

  • bhattacharyya, bhattacharyya_coeff
  • HellingerDist, which follow directly from the previous item
  • non-weighted UnionMetrics. For example euclidean(v1, v2). Specifically a method for _evaluate(d::UnionMetrics, a::SparseVector, b::SparseVector, ::Nothing) is included.
julia> n = 1000; dens = 0.1; v1 = sprand(n, dens); v2 = sprand(n, dens);

julia> @btime bhattacharyya($v1, $v2)  # before this PR
  14.877 μs (0 allocations: 0 bytes)
2.098185998707887

julia> @btime bhattacharyya($v1, $v2)  # after this PR
  240.100 ns (0 allocations: 0 bytes)
2.098185998707887

julia> n = 100; dens = 0.1; v1 = sprand(n, dens); v2 = sprand(n, dens);

julia> @btime bhattacharyya($v1, $v2) # before this PR
  714.130 ns (0 allocations: 0 bytes)
1.251741974341252

julia> @btime bhattacharyya($v1, $v2)  # after this PR
  45.974 ns (0 allocations: 0 bytes)
1.251741974341252

EDIT: The method _binary_map_reduce1 mentioned below has been removed. The implementation has been simplified.

  • The new method calls a method _binary_map_reduce1 which is also introduced in this PR. It is general enough to handle other similar use cases. I appended the 1 in (anticipation of several of these methods, in analogy to these functions in SparseArrays in the stdlib. Perhaps removing the 1 in the name until it is needed would be a good idea.

  • The function _binary_map_reduce1 works at least if f(0, 0) == 0 and op(v, 0) == v, as well as some other cases noted in the code.

  • The numbered functions linked to above are for binary maps, not for binary mapreduce. The reasons for implementing variants (those with numbers appended) is different. In the former case, it is because an output vector is built and the functions are split according to which inputs can produces a zero.

  • Which functions like _binary_map_reduce1 to implement, and where to put them is to be decided. It may make sense to decide on some generally useful binary mapreduce helper functions and put them in SparseArrays. In the short term, I put one of them here. They are not in the API so they can easily be moved to another repo later.

I also tried this function

_bhattacharyya_coeff(x, y) = (sum(sqrt, x .* y), sum(x), sum(y))

This is also efficient for SparseVector because sum is efficient for a single SparseVector. Yesterday, I thought I found some cases in which it was faster than using _binary_map_reduce1, but I can't find them now. However, it is often within 10 or 20 percent as fast. It is much less complex, which makes it attractive. Also, sum uses pairwise summation, which is more accurate in general. Has Distances.jl consciously opted for speed over correctness? I can understand making this choice for image processing.

See also https://github.com/JuliaLang/julia/issues/43248

The new method calls a method _binary_map_reduce1 which is also introduced
in this PR. It is general enough to handle other similar use cases.
@jlapeyre
Copy link
Contributor Author

In the original PR, two useless loops are executed in _binary_map_reduce1. I think it's better to use a more specialized mapreduce helper function in order to avoid this. The compiler has enough information to realize that repeated
one_nonz_count += 1 can be replaced by a single addition. But, I doubt it does this (@code_llvm shows the loops and operations) so the loops likely run.

I'll simplify the function.

@jlapeyre jlapeyre changed the title Add efficient SparseVector method for _bhattacharyya_coefficient [WIP] Add efficient SparseVector method for _bhattacharyya_coefficient Nov 28, 2021
This method is only for the non-weighted metrics. It's not clear
what a sparse version of weighted metrics would look like.
@jlapeyre
Copy link
Contributor Author

I added support for UnionMetrics. I tested most of the metrics at the cli for efficiency. I added a test in the test suite.

@jlapeyre jlapeyre changed the title [WIP] Add efficient SparseVector method for _bhattacharyya_coefficient [WIP] Add efficient SparseVector method for some metrics Nov 29, 2021
@codecov-commenter
Copy link

codecov-commenter commented Nov 29, 2021

Codecov Report

Merging #235 (93fa138) into master (9e23809) will increase coverage by 0.17%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #235      +/-   ##
==========================================
+ Coverage   97.39%   97.57%   +0.17%     
==========================================
  Files           8        8              
  Lines         806      865      +59     
==========================================
+ Hits          785      844      +59     
  Misses         21       21              
Impacted Files Coverage Δ
src/Distances.jl 100.00% <ø> (ø)
src/bhattacharyya.jl 98.21% <100.00%> (+0.99%) ⬆️
src/metrics.jl 96.90% <100.00%> (+0.29%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9e23809...93fa138. Read the comment docs.

@jlapeyre
Copy link
Contributor Author

jlapeyre commented Nov 29, 2021

These tests just convert dense matrices to sparse. EDIT: the following is done. More tests are needed to exercise code paths for different numbers of non-zeros.

@jlapeyre jlapeyre force-pushed the sparse-bhattacharyya branch from df239f4 to 1517138 Compare November 30, 2021 01:09
@jlapeyre jlapeyre changed the title [WIP] Add efficient SparseVector method for some metrics Add efficient SparseVector method for some metrics Nov 30, 2021
@jlapeyre jlapeyre changed the title Add efficient SparseVector method for some metrics Add efficient *SparseVector method for some metrics Nov 30, 2021
@jlapeyre jlapeyre changed the title Add efficient *SparseVector method for some metrics Add efficient SparseVector method for some metrics Nov 30, 2021
If you have a sparse matrix m and take @view m[:, i], the
result is a SparseVectorView, which also is made efficient
by the routines in this PR.
@dkarrasch
Copy link
Member

dkarrasch commented Dec 2, 2021

For nice code coverage, can you add a quick test with two sparse vectors of (i) different lengths and (ii) both of length 0, please? That should yield 100% diff coverage, and then we're ready to go IMO.

@jlapeyre
Copy link
Contributor Author

jlapeyre commented Dec 2, 2021

different lengths
You mean different lengths so that an error is thrown, right ?

Copy link
Member

@dkarrasch dkarrasch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few (minor style) comments.

jlapeyre and others added 3 commits December 2, 2021 08:29
Co-authored-by: Daniel Karrasch <daniel.karrasch@posteo.de>
Co-authored-by: Daniel Karrasch <daniel.karrasch@posteo.de>
@dkarrasch
Copy link
Member

Shall we leave a comment about assumptions underlying the sparse UnionMetric implementation? AFAIU, the only assumption is that eval_reduce(d, s, eval_op(d, 0, 0)) == s, right? Because you jump over zero pairs.

@jlapeyre
Copy link
Contributor Author

jlapeyre commented Dec 2, 2021

the only assumption is that eval_reduce(d, s, eval_op(d, 0, 0)) == s

Yes, I think this is correct.

@dkarrasch dkarrasch merged commit 91f51b5 into JuliaStats:master Dec 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants