Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: sorting for SVectors #754

Merged
merged 1 commit into from
Mar 31, 2020
Merged

RFC: sorting for SVectors #754

merged 1 commit into from
Mar 31, 2020

Conversation

stev47
Copy link
Contributor

@stev47 stev47 commented Mar 21, 2020

I didn't find allocation-free, fast sorting for small SVectors, so I implemented bitonic sorting networks, which I'd like to contribute back. Sorting networks are static and thus generally make good use of modern processor magic (branch prediction, speculative execution, ...), they also allow for parallel execution.

julia> @btime sort!(Base.copymutable(x),lt=<) setup=(x=rand(SVector{12,Float64}));
  101.653 ns (1 allocation: 112 bytes)
julia> @btime sort!(x,lt=<) setup=(x=rand(MVector{12,Float64}));
  84.734 ns (0 allocations: 0 bytes)
julia> @btime sort(x,lt=<) setup=(x=rand(SVector{12,Float64}));
  11.820 ns (0 allocations: 0 bytes)

Implementation notes:

  • bitonic sorting networks are not asymptotically optimal in the number of comparisons, but reasonably simple and general
  • for lengths less that 10, sort(SVector()) seems to consistently outperform the allocation-free sort!(MVector())
  • i prefer inlining, but i found generated functions were able to scale to longer vectors
  • parallelism (simd, etc.) is possible but not included here

Questions:

  • do we instead want manually implemented minimal (size/depth) sorting networks (only known for small lengths)?
  • why is julia deciding to allocate something for larger SVectors (here 13 for Float64, 20 for Int), can somebody explain this? The generated code is completely static but @llvm_code shows unrelated allocations.
  • for large lengths, llvm takes a long time to compile the generated expression. what is a reasonable cutoff length?

@stev47 stev47 force-pushed the feature/sort branch 2 times, most recently from 2ca8a3e to 7a3ceed Compare March 21, 2020 18:38
@tkf
Copy link
Member

tkf commented Mar 21, 2020

I think it's better to define the sort algorithms for tuples and then just call it via dispatch for static arrays.

There are a few packages that have sort for tuples (though I guess there is no package that implements bitonic sorting networks?):
https://github.com/Jutho/TupleTools.jl
https://github.com/peterahrens/TupleSort.jl
https://github.com/JeffreySarnoff/SortingNetworks.jl
https://github.com/tkf/Baselet.jl

There is a PR to add sort for tuples in Base JuliaLang/julia#32710

@stev47
Copy link
Contributor Author

stev47 commented Mar 21, 2020

thanks, I totally missed the work ongoing for tuples.
I guess StaticArrays does not want to depend on those external packages, so do you plan to wait for tuple sort in Base or would you accept an tuple-based interim implementation as a pull-request?

@tkf
Copy link
Member

tkf commented Mar 22, 2020

In #703 (comment), @c42f was open to include StaticNumbers.jl as a dependency. So my guess is that it's OK to rely on one of those packages that can sort tuples. Though ideally it'd be nice if it is in somewhere like https://github.com/JuliaCollections/SortingAlgorithms.jl that can be maintained by multiple people.

@stev47
Copy link
Contributor Author

stev47 commented Mar 22, 2020

Thank you for the pointers, the addition of bitonic sort to SortingAlgorithms.jl is certainly a good idea. But isn't that package for algorithms on mutable containers? I see no straightforward way to reuse a mutable implementation for a static one (which involves a different metaprogramming implementation style), so code would probably be duplicated anyway.

There seem to be a number of approaches to sorting networks for tuples and benchmarking methods differ. Personally I'd be fine with hardcoded networks (as in JuliaLang/julia#32710), but they are even less general or maintainable. Really any simple non-allocating implementation will probably be better than what we have now though.

@tkf
Copy link
Member

tkf commented Mar 22, 2020

Yeah, we need an entry point for sorting immutables. It'd be nice if there is something like Base.sort(::Tuple, ::Algorithm, ::Ordering) as the overloadable API. We need to add this to Base, though.

@andyferris
Copy link
Member

andyferris commented Mar 22, 2020

I don't see why we couldn't host some algorithms for sorting immutable containers here (for now, and switch over to Base if and when that makes sense - to me the whole concept of this package is to evaporate into Base one day :) ). So far we've managed to stay dependency-free, and while that's not essential it would seem strange to go out of our way to add functionality to another package just so we can then rely on it. (Noteable exceptions might be something like ArrayInterface.jl or StaticNumbers.jl).

Copy link
Member

@andyferris andyferris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I kind of like this approach - thanks @stev47.

It would be really good to support StaticVector instead of SVector.

I'm not sure what's causing allocations on large vectors. It might be best to put in a size cut-off, like defalg(a::StaticArray) = length(a) < 13 || BitonicSort || SomeOtherSort, especially if bitonic sort isn't O(N log N) - we have a variety of cutoffs for matrix-matrix multiplication, for example. (We do get the occassional person using StaticArrays for very large arrays, even if only experimentally, and it's nice not to explode the compiler in these cases).

@c42f
Copy link
Member

c42f commented Mar 22, 2020

So far we've managed to stay dependency-free, and while that's not essential it would seem strange to go out of our way to add functionality to another package just so we can then rely on it.

Yes, I do agree with that. I feel like I'd be happy to add dependencies only if they provide some pretty core value proposition. I'm not sure sorting does so, but I'd be happy to host an implementation. This one looks quite neat at a quick glance.

@tkf
Copy link
Member

tkf commented Mar 22, 2020

I'm not against StaticArrays maintaining its own sort algorithm. The benefit I was considering was rather the other direction. The sort algorithm StaticArrays is going to use likely is going to be one of the most well-optimized sorting algorithms for small containers. It'd be nice to able to use it for other small immutable containers. But code duplication is not always bad. It's probably not so hard to copy code from StaticArrays and use it for tuples.

@stev47
Copy link
Contributor Author

stev47 commented Mar 23, 2020

Thank you all for the valuable feedback. I updated the pull-request with your recommendations and used an internal tuple interface as @tkf suggested so it can easily be reused in other code.
I added some tests and even managed to get rid of the allocations for larger vectors (forgot one @inline).
The cutoff lengh is now at 20 but up for debate.

@stev47
Copy link
Contributor Author

stev47 commented Mar 24, 2020

pushed a new update. @allocated since Julia 1.4 seems to be counting weird allocations that don't agree with @ballocated, so I disabled that test for now.

@c42f c42f merged commit f82bb09 into JuliaArrays:master Mar 31, 2020
@c42f
Copy link
Member

c42f commented Mar 31, 2020

Looks great, thanks!

@c42f
Copy link
Member

c42f commented Apr 11, 2020

I did a little performance microbenchmarking. The version for Int is just surprisingly far faster than the Float64 version here, I'm not sure quite why!

julia> @btime sort(x) setup=(x=@SVector rand(Int,6););
  5.029 ns (0 allocations: 0 bytes)

julia> @btime sort(x) setup=(x=@SVector rand(Float64,6););
  27.424 ns (0 allocations: 0 bytes)

One possible reason is that the integer version seems to compile down to a lot of cmov and is branchless AFAICT from a quick look. Not so for the Float64 version which has a few branches, even though the compiler seems to be fairly clever at removing a lot of them with some SIMD comparisons and masking.

@stev47
Copy link
Contributor Author

stev47 commented Apr 11, 2020

as in my initial post, you may want to try sort(x, lt=<) since the default isless comparator for floats in julia has some additional handling for NaN values.
The normal julia sort for floats is actually special-cased (see fpsort! in base and the left, right orderings that reduce to slt_int) in order to achieve better performance. We would need to do something similar to optimize that.

@c42f
Copy link
Member

c42f commented Apr 11, 2020

Oh of course, that should teach me to reach straight for @code_native! It's a very unfortunate performance hit for a relatively unusual edge case. Using lt=< is 7x faster!

julia> @btime sort(x,lt=<) setup=(x=@SVector rand(Float64,6););
  3.700 ns (0 allocations: 0 bytes)

@andyferris
Copy link
Member

andyferris commented Apr 11, 2020

Haha. Damn IEEE ordering. Sometimes I wonder if we should just give up and have -NaN... or is this branch a 0.0/-0.0 thing?

@stev47
Copy link
Contributor Author

stev47 commented Apr 11, 2020

the julia intrinsic for isless on floats goes like this (src/runtime_intrinsics.c)

#define fpislt_n(c_type, nbits)                                         \
    static inline int fpislt##nbits(c_type a, c_type b) JL_NOTSAFEPOINT \
    {                                                                   \
        bits##nbits ua, ub;                                             \
        ua.f = a;                                                       \
        ub.f = b;                                                       \
        if (!isnan(a) && isnan(b))                                      \
            return 1;                                                   \
        if (isnan(a) || isnan(b))                                       \
            return 0;                                                   \
        if (ua.d >= 0 && ua.d < ub.d)                                   \
            return 1;                                                   \
        if (ua.d < 0 && ua.ud > ub.ud)                                  \
            return 1;                                                   \
        return 0;                                                       \
    }

so I guess it is both. You can also compare sort([0., NaN, -0.], lt=<) vs sort([0., NaN, -0.]).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants