Add efficient `SparseVector` method for some metrics #235

jlapeyre · 2021-11-28T20:59:27Z

This PR makes the following efficient for SparseVector, partially addressing #5.

bhattacharyya, bhattacharyya_coeff
HellingerDist, which follow directly from the previous item
non-weighted UnionMetrics. For example euclidean(v1, v2). Specifically a method for _evaluate(d::UnionMetrics, a::SparseVector, b::SparseVector, ::Nothing) is included.

julia> n = 1000; dens = 0.1; v1 = sprand(n, dens); v2 = sprand(n, dens);

julia> @btime bhattacharyya($v1, $v2)  # before this PR
  14.877 μs (0 allocations: 0 bytes)
2.098185998707887

julia> @btime bhattacharyya($v1, $v2)  # after this PR
  240.100 ns (0 allocations: 0 bytes)
2.098185998707887

julia> n = 100; dens = 0.1; v1 = sprand(n, dens); v2 = sprand(n, dens);

julia> @btime bhattacharyya($v1, $v2) # before this PR
  714.130 ns (0 allocations: 0 bytes)
1.251741974341252

julia> @btime bhattacharyya($v1, $v2)  # after this PR
  45.974 ns (0 allocations: 0 bytes)
1.251741974341252

EDIT: The method _binary_map_reduce1 mentioned below has been removed. The implementation has been simplified.

The new method calls a method _binary_map_reduce1 which is also introduced in this PR. It is general enough to handle other similar use cases. I appended the 1 in (anticipation of several of these methods, in analogy to these functions in SparseArrays in the stdlib. Perhaps removing the 1 in the name until it is needed would be a good idea.
~~The function _binary_map_reduce1 works at least if f(0, 0) == 0 and op(v, 0) == v, as well as some other cases noted in the code.~~
The numbered functions linked to above are for binary maps, not for binary mapreduce. The reasons for implementing variants (those with numbers appended) is different. In the former case, it is because an output vector is built and the functions are split according to which inputs can produces a zero.
Which functions like _binary_map_reduce1 to implement, and where to put them is to be decided. It may make sense to decide on some generally useful binary mapreduce helper functions and put them in SparseArrays. In the short term, I put one of them here. They are not in the API so they can easily be moved to another repo later.

I also tried this function

_bhattacharyya_coeff(x, y) = (sum(sqrt, x .* y), sum(x), sum(y))

This is also efficient for SparseVector because sum is efficient for a single SparseVector. Yesterday, I thought I found some cases in which it was faster than using _binary_map_reduce1, but I can't find them now. However, it is often within 10 or 20 percent as fast. It is much less complex, which makes it attractive. Also, sum uses pairwise summation, which is more accurate in general. Has Distances.jl consciously opted for speed over correctness? I can understand making this choice for image processing.

See also https://github.com/JuliaLang/julia/issues/43248

The new method calls a method _binary_map_reduce1 which is also introduced in this PR. It is general enough to handle other similar use cases.

jlapeyre · 2021-11-28T21:15:06Z

In the original PR, two useless loops are executed in _binary_map_reduce1. I think it's better to use a more specialized mapreduce helper function in order to avoid this. The compiler has enough information to realize that repeated
one_nonz_count += 1 can be replaced by a single addition. But, I doubt it does this (@code_llvm shows the loops and operations) so the loops likely run.

I'll simplify the function.

This method is only for the non-weighted metrics. It's not clear what a sparse version of weighted metrics would look like.

jlapeyre · 2021-11-29T16:28:49Z

I added support for UnionMetrics. I tested most of the metrics at the cli for efficiency. I added a test in the test suite.

codecov-commenter · 2021-11-29T16:33:56Z

Codecov Report

Merging #235 (93fa138) into master (9e23809) will increase coverage by 0.17%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #235      +/-   ##
==========================================
+ Coverage   97.39%   97.57%   +0.17%     
==========================================
  Files           8        8              
  Lines         806      865      +59     
==========================================
+ Hits          785      844      +59     
  Misses         21       21

Impacted Files	Coverage Δ
src/Distances.jl	`100.00% <ø> (ø)`
src/bhattacharyya.jl	`98.21% <100.00%> (+0.99%)`	⬆️
src/metrics.jl	`96.90% <100.00%> (+0.29%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9e23809...93fa138. Read the comment docs.

jlapeyre · 2021-11-29T17:26:21Z

These tests just convert dense matrices to sparse. EDIT: the following is done. ~~More tests are needed to exercise code paths for different numbers of non-zeros.~~

src/metrics.jl

If you have a sparse matrix m and take @view m[:, i], the result is a SparseVectorView, which also is made efficient by the routines in this PR.

dkarrasch · 2021-12-02T10:42:11Z

For nice code coverage, can you add a quick test with two sparse vectors of (i) different lengths and (ii) both of length 0, please? That should yield 100% diff coverage, and then we're ready to go IMO.

jlapeyre · 2021-12-02T12:12:43Z

different lengths
You mean different lengths so that an error is thrown, right ?

dkarrasch

Just a few (minor style) comments.

src/bhattacharyya.jl

Co-authored-by: Daniel Karrasch <daniel.karrasch@posteo.de>

dkarrasch · 2021-12-02T15:48:31Z

Shall we leave a comment about assumptions underlying the sparse UnionMetric implementation? AFAIU, the only assumption is that eval_reduce(d, s, eval_op(d, 0, 0)) == s, right? Because you jump over zero pairs.

jlapeyre · 2021-12-02T16:39:36Z

the only assumption is that eval_reduce(d, s, eval_op(d, 0, 0)) == s

Yes, I think this is correct.

Add efficient SparseVector method for _bhattacharyya_coefficient

96b089f

The new method calls a method _binary_map_reduce1 which is also introduced in this PR. It is general enough to handle other similar use cases.

Simplify the binary mapreduce function

311a410

jlapeyre changed the title ~~Add efficient SparseVector method for _bhattacharyya_coefficient~~ [WIP] Add efficient SparseVector method for _bhattacharyya_coefficient Nov 28, 2021

Add a method for SparseVector for UnionMetrics

b629440

This method is only for the non-weighted metrics. It's not clear what a sparse version of weighted metrics would look like.

jlapeyre changed the title ~~[WIP] Add efficient SparseVector method for _bhattacharyya_coefficient~~ [WIP] Add efficient SparseVector method for some metrics Nov 29, 2021

jlapeyre added 2 commits November 29, 2021 12:51

Test Bhattacharyya for SparseVector

ccadacb

Refactor SparseVector code

1517138

jlapeyre force-pushed the sparse-bhattacharyya branch from df239f4 to 1517138 Compare November 30, 2021 01:09

Test density of SparseVector a > than that of b and vice versa

73c6f94

jlapeyre changed the title ~~[WIP] Add efficient SparseVector method for some metrics~~ Add efficient SparseVector method for some metrics Nov 30, 2021

jlapeyre changed the title ~~Add efficient SparseVector method for some metrics~~ Add efficient *SparseVector method for some metrics Nov 30, 2021

jlapeyre changed the title ~~Add efficient *SparseVector method for some metrics~~ Add efficient SparseVector method for some metrics Nov 30, 2021

jlapeyre mentioned this pull request Nov 30, 2021

Support for sparse vectors/matrices #5

Open

dkarrasch reviewed Nov 30, 2021

View reviewed changes

src/metrics.jl Outdated Show resolved Hide resolved

jlapeyre added 2 commits November 30, 2021 17:20

Support SparseVectorUnion instead of only SparseVector

3a688b2

If you have a sparse matrix m and take @view m[:, i], the result is a SparseVectorView, which also is made efficient by the routines in this PR.

Bump patch version

91c5002

dkarrasch reviewed Dec 2, 2021

View reviewed changes

src/bhattacharyya.jl Outdated Show resolved Hide resolved

src/bhattacharyya.jl Outdated Show resolved Hide resolved

src/bhattacharyya.jl Show resolved Hide resolved

jlapeyre and others added 3 commits December 2, 2021 08:29

Update src/bhattacharyya.jl

7b372c0

Co-authored-by: Daniel Karrasch <daniel.karrasch@posteo.de>

Update src/bhattacharyya.jl

88fa030

Co-authored-by: Daniel Karrasch <daniel.karrasch@posteo.de>

Add corner-case tests for SparseVector

9b05b6f

Add comment on applicability of sparse method

93fa138

dkarrasch merged commit 91f51b5 into JuliaStats:master Dec 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add efficient `SparseVector` method for some metrics #235

Add efficient `SparseVector` method for some metrics #235

jlapeyre commented Nov 28, 2021 •

edited

Loading

jlapeyre commented Nov 28, 2021

jlapeyre commented Nov 29, 2021

codecov-commenter commented Nov 29, 2021 •

edited

Loading

jlapeyre commented Nov 29, 2021 •

edited

Loading

dkarrasch commented Dec 2, 2021 •

edited

Loading

jlapeyre commented Dec 2, 2021

dkarrasch left a comment

dkarrasch commented Dec 2, 2021

jlapeyre commented Dec 2, 2021

Add efficient SparseVector method for some metrics #235

Add efficient SparseVector method for some metrics #235

Conversation

jlapeyre commented Nov 28, 2021 • edited Loading

jlapeyre commented Nov 28, 2021

jlapeyre commented Nov 29, 2021

codecov-commenter commented Nov 29, 2021 • edited Loading

Codecov Report

jlapeyre commented Nov 29, 2021 • edited Loading

dkarrasch commented Dec 2, 2021 • edited Loading

jlapeyre commented Dec 2, 2021

dkarrasch left a comment

Choose a reason for hiding this comment

dkarrasch commented Dec 2, 2021

jlapeyre commented Dec 2, 2021

Add efficient `SparseVector` method for some metrics #235

Add efficient `SparseVector` method for some metrics #235

jlapeyre commented Nov 28, 2021 •

edited

Loading

codecov-commenter commented Nov 29, 2021 •

edited

Loading

jlapeyre commented Nov 29, 2021 •

edited

Loading

dkarrasch commented Dec 2, 2021 •

edited

Loading