RFC: sorting for SVectors #754

stev47 · 2020-03-21T10:55:44Z

I didn't find allocation-free, fast sorting for small SVectors, so I implemented bitonic sorting networks, which I'd like to contribute back. Sorting networks are static and thus generally make good use of modern processor magic (branch prediction, speculative execution, ...), they also allow for parallel execution.

julia> @btime sort!(Base.copymutable(x),lt=<) setup=(x=rand(SVector{12,Float64}));
  101.653 ns (1 allocation: 112 bytes)
julia> @btime sort!(x,lt=<) setup=(x=rand(MVector{12,Float64}));
  84.734 ns (0 allocations: 0 bytes)
julia> @btime sort(x,lt=<) setup=(x=rand(SVector{12,Float64}));
  11.820 ns (0 allocations: 0 bytes)

Implementation notes:

bitonic sorting networks are not asymptotically optimal in the number of comparisons, but reasonably simple and general
for lengths less that 10, sort(SVector()) seems to consistently outperform the allocation-free sort!(MVector())
i prefer inlining, but i found generated functions were able to scale to longer vectors
parallelism (simd, etc.) is possible but not included here

Questions:

do we instead want manually implemented minimal (size/depth) sorting networks (only known for small lengths)?
why is julia deciding to allocate something for larger SVectors (here 13 for Float64, 20 for Int), can somebody explain this? The generated code is completely static but @llvm_code shows unrelated allocations.
for large lengths, llvm takes a long time to compile the generated expression. what is a reasonable cutoff length?

tkf · 2020-03-21T21:16:19Z

I think it's better to define the sort algorithms for tuples and then just call it via dispatch for static arrays.

There are a few packages that have sort for tuples (though I guess there is no package that implements bitonic sorting networks?):
https://github.com/Jutho/TupleTools.jl
https://github.com/peterahrens/TupleSort.jl
https://github.com/JeffreySarnoff/SortingNetworks.jl
https://github.com/tkf/Baselet.jl

There is a PR to add sort for tuples in Base JuliaLang/julia#32710

stev47 · 2020-03-21T23:46:11Z

thanks, I totally missed the work ongoing for tuples.
I guess StaticArrays does not want to depend on those external packages, so do you plan to wait for tuple sort in Base or would you accept an tuple-based interim implementation as a pull-request?

tkf · 2020-03-22T00:02:03Z

In #703 (comment), @c42f was open to include StaticNumbers.jl as a dependency. So my guess is that it's OK to rely on one of those packages that can sort tuples. Though ideally it'd be nice if it is in somewhere like https://github.com/JuliaCollections/SortingAlgorithms.jl that can be maintained by multiple people.

stev47 · 2020-03-22T01:01:12Z

Thank you for the pointers, the addition of bitonic sort to SortingAlgorithms.jl is certainly a good idea. But isn't that package for algorithms on mutable containers? I see no straightforward way to reuse a mutable implementation for a static one (which involves a different metaprogramming implementation style), so code would probably be duplicated anyway.

There seem to be a number of approaches to sorting networks for tuples and benchmarking methods differ. Personally I'd be fine with hardcoded networks (as in JuliaLang/julia#32710), but they are even less general or maintainable. Really any simple non-allocating implementation will probably be better than what we have now though.

tkf · 2020-03-22T01:22:54Z

Yeah, we need an entry point for sorting immutables. It'd be nice if there is something like Base.sort(::Tuple, ::Algorithm, ::Ordering) as the overloadable API. We need to add this to Base, though.

andyferris · 2020-03-22T02:28:46Z

I don't see why we couldn't host some algorithms for sorting immutable containers here (for now, and switch over to Base if and when that makes sense - to me the whole concept of this package is to evaporate into Base one day :) ). So far we've managed to stay dependency-free, and while that's not essential it would seem strange to go out of our way to add functionality to another package just so we can then rely on it. (Noteable exceptions might be something like ArrayInterface.jl or StaticNumbers.jl).

andyferris

Personally, I kind of like this approach - thanks @stev47.

It would be really good to support StaticVector instead of SVector.

I'm not sure what's causing allocations on large vectors. It might be best to put in a size cut-off, like defalg(a::StaticArray) = length(a) < 13 || BitonicSort || SomeOtherSort, especially if bitonic sort isn't O(N log N) - we have a variety of cutoffs for matrix-matrix multiplication, for example. (We do get the occassional person using StaticArrays for very large arrays, even if only experimentally, and it's nice not to explode the compiler in these cases).

src/sort.jl

c42f · 2020-03-22T14:09:17Z

So far we've managed to stay dependency-free, and while that's not essential it would seem strange to go out of our way to add functionality to another package just so we can then rely on it.

Yes, I do agree with that. I feel like I'd be happy to add dependencies only if they provide some pretty core value proposition. I'm not sure sorting does so, but I'd be happy to host an implementation. This one looks quite neat at a quick glance.

tkf · 2020-03-22T23:48:01Z

I'm not against StaticArrays maintaining its own sort algorithm. The benefit I was considering was rather the other direction. The sort algorithm StaticArrays is going to use likely is going to be one of the most well-optimized sorting algorithms for small containers. It'd be nice to able to use it for other small immutable containers. But code duplication is not always bad. It's probably not so hard to copy code from StaticArrays and use it for tuples.

stev47 · 2020-03-23T21:50:59Z

Thank you all for the valuable feedback. I updated the pull-request with your recommendations and used an internal tuple interface as @tkf suggested so it can easily be reused in other code.
I added some tests and even managed to get rid of the allocations for larger vectors (forgot one @inline).
The cutoff lengh is now at 20 but up for debate.

src/sort.jl

stev47 · 2020-03-24T21:21:44Z

pushed a new update. @allocated since Julia 1.4 seems to be counting weird allocations that don't agree with @ballocated, so I disabled that test for now.

c42f · 2020-03-31T04:12:56Z

Looks great, thanks!

c42f · 2020-04-11T01:51:17Z

I did a little performance microbenchmarking. The version for Int is just surprisingly far faster than the Float64 version here, I'm not sure quite why!

julia> @btime sort(x) setup=(x=@SVector rand(Int,6););
  5.029 ns (0 allocations: 0 bytes)

julia> @btime sort(x) setup=(x=@SVector rand(Float64,6););
  27.424 ns (0 allocations: 0 bytes)

One possible reason is that the integer version seems to compile down to a lot of cmov and is branchless AFAICT from a quick look. Not so for the Float64 version which has a few branches, even though the compiler seems to be fairly clever at removing a lot of them with some SIMD comparisons and masking.

stev47 · 2020-04-11T06:06:19Z

as in my initial post, you may want to try sort(x, lt=<) since the default isless comparator for floats in julia has some additional handling for NaN values.
The normal julia sort for floats is actually special-cased (see fpsort! in base and the left, right orderings that reduce to slt_int) in order to achieve better performance. We would need to do something similar to optimize that.

c42f · 2020-04-11T06:59:57Z

Oh of course, that should teach me to reach straight for @code_native! It's a very unfortunate performance hit for a relatively unusual edge case. Using lt=< is 7x faster!

julia> @btime sort(x,lt=<) setup=(x=@SVector rand(Float64,6););
  3.700 ns (0 allocations: 0 bytes)

andyferris · 2020-04-11T09:50:15Z

Haha. Damn IEEE ordering. Sometimes I wonder if we should just give up and have -NaN... or is this branch a 0.0/-0.0 thing?

stev47 · 2020-04-11T13:17:49Z

the julia intrinsic for isless on floats goes like this (src/runtime_intrinsics.c)

#define fpislt_n(c_type, nbits)                                         \
    static inline int fpislt##nbits(c_type a, c_type b) JL_NOTSAFEPOINT \
    {                                                                   \
        bits##nbits ua, ub;                                             \
        ua.f = a;                                                       \
        ub.f = b;                                                       \
        if (!isnan(a) && isnan(b))                                      \
            return 1;                                                   \
        if (isnan(a) || isnan(b))                                       \
            return 0;                                                   \
        if (ua.d >= 0 && ua.d < ub.d)                                   \
            return 1;                                                   \
        if (ua.d < 0 && ua.ud > ub.ud)                                  \
            return 1;                                                   \
        return 0;                                                       \
    }

so I guess it is both. You can also compare sort([0., NaN, -0.], lt=<) vs sort([0., NaN, -0.]).

stev47 force-pushed the feature/sort branch 2 times, most recently from 2ca8a3e to 7a3ceed Compare March 21, 2020 18:38

andyferris requested changes Mar 22, 2020

View reviewed changes

src/sort.jl Outdated Show resolved Hide resolved

src/sort.jl Outdated Show resolved Hide resolved

src/sort.jl Outdated Show resolved Hide resolved

src/sort.jl Outdated Show resolved Hide resolved

stev47 force-pushed the feature/sort branch from 7a3ceed to 482af35 Compare March 23, 2020 21:50

andyferris approved these changes Mar 24, 2020

View reviewed changes

src/sort.jl Outdated Show resolved Hide resolved

c42f reviewed Mar 24, 2020

View reviewed changes

src/sort.jl Outdated Show resolved Hide resolved

tkf reviewed Mar 24, 2020

View reviewed changes

src/sort.jl Outdated Show resolved Hide resolved

tkf reviewed Mar 24, 2020

View reviewed changes

src/sort.jl Outdated Show resolved Hide resolved

stev47 force-pushed the feature/sort branch from 482af35 to ff27540 Compare March 24, 2020 21:17

implement bitonic sorting network for SVectors

ff27540

c42f merged commit f82bb09 into JuliaArrays:master Mar 31, 2020

tpapp mentioned this pull request Apr 24, 2020

implement sortperm #772

Closed

vtjnash mentioned this pull request Apr 13, 2021

Let's sort tuples! JuliaLang/julia#32710

Closed

vyu mentioned this pull request Aug 20, 2021

Improve BitonicSort performance for sorting floats #952

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: sorting for SVectors #754

RFC: sorting for SVectors #754

stev47 commented Mar 21, 2020

tkf commented Mar 21, 2020

stev47 commented Mar 21, 2020

tkf commented Mar 22, 2020

stev47 commented Mar 22, 2020

tkf commented Mar 22, 2020

andyferris commented Mar 22, 2020 •

edited

Loading

andyferris left a comment

c42f commented Mar 22, 2020

tkf commented Mar 22, 2020

stev47 commented Mar 23, 2020

stev47 commented Mar 24, 2020

c42f commented Mar 31, 2020

c42f commented Apr 11, 2020

stev47 commented Apr 11, 2020

c42f commented Apr 11, 2020

andyferris commented Apr 11, 2020 •

edited

Loading

stev47 commented Apr 11, 2020

RFC: sorting for SVectors #754

RFC: sorting for SVectors #754

Conversation

stev47 commented Mar 21, 2020

tkf commented Mar 21, 2020

stev47 commented Mar 21, 2020

tkf commented Mar 22, 2020

stev47 commented Mar 22, 2020

tkf commented Mar 22, 2020

andyferris commented Mar 22, 2020 • edited Loading

andyferris left a comment

Choose a reason for hiding this comment

c42f commented Mar 22, 2020

tkf commented Mar 22, 2020

stev47 commented Mar 23, 2020

stev47 commented Mar 24, 2020

c42f commented Mar 31, 2020

c42f commented Apr 11, 2020

stev47 commented Apr 11, 2020

c42f commented Apr 11, 2020

andyferris commented Apr 11, 2020 • edited Loading

stev47 commented Apr 11, 2020

andyferris commented Mar 22, 2020 •

edited

Loading

andyferris commented Apr 11, 2020 •

edited

Loading