Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Added timsort, plus misc. sort updates #1691

Merged
merged 25 commits into from
Dec 18, 2012

Conversation

kmsquire
Copy link
Member

@kmsquire kmsquire commented Dec 6, 2012

EDIT: (Original message is below)

This patch includes an implementation of timsort for julia, and an update of the sort functions. See also a comparison of the sort routines available in julia at https://github.com/kmsquire/SortPerf.jl (the pdf file contains relevant plots).

Patch summary:

  • Modularized sort functions (Base.Sort)
  • Moved each_row, each_col, and each_vec to AbstractArray.jl
  • Added timsort implementation (timsort.jl)
  • Renamed and exported insertionsort, quicksort, mergesort, and timsort (mutating and non-mutating variants).
  • For each of these sorts, added (but did not export) *sort_r (reverse) and *sort_by (sort by function).
  • For each function except quicksort, added/renamed (but did not export) versions which return the sort permutation: *sort_perm, *sort_perm_r, and *sort_perm_by
  • Updated search_sorted, search_sorted_first and search_sorted_last as in RFC: Clean up search_sorted* functions #1620
  • Added and exported additional variants for sort-related functions: issorted_r, issorted_by, search_sorted*_r, search_sorted*_by, select_r, and select_by.
  • Added tests for various sort algorithms/functions
  • Added documentation for (almost?) all sort and related functions.

Unexported functions can be accessed with Base.Sort.<function> or with using Base.Sort

This patch might be a little big, so I can probably break it up into multiple patches if needed. It does all kind of go together, though.

No decision was made regarding replacing mergesort with timsort. See the pdf file referenced at the top.

ORIGINAL MESSAGE:

See https://groups.google.com/forum/?fromgroups=#!searchin/julia-dev/timsort/julia-dev/bgFzFVT403s/tm1oe7vIVWAJ and https://gist.github.com/4168004

timsort! is almost complete, and is only missing the function versions which permute indices. Right now, it's in a separate file which is included in sort, simply because adding 500 lines to sort.jl didn't seem like a good idea.

In the original gist, there was some code generated in a loop, which I unrolled in this version because I didn't trust myself to get it right inside of a macro.

I'll retest performance once everything is ready.

Additional Miscellaneous updates:

  • removed _jl_ prefix from front of *sort function calls
  • removed _lt suffix from function variants

As a result, the *sort functions: quicksort!, mergesort!, insertionsort!, and timsort! are available in Base (not exported), along with *_r! variants for reverse sorting, and _by! variants for sorting by a function, (EDIT: and _lt variants for sorting with a custom comparison function) so they can be accessed with quicksort!(), insertionsort_r!(), etc.

@StefanKarpinski
Copy link
Member

This is great work. I really like having all of these available by name like that so easily. Of course we should generally choose good default sorts, but sometimes the programmer just knows better and want to use a specific sort. I wouldn't mind exporting these names either since they're unlikely to clash and the using mechanism means that people can still use the names if they really want to.

@kmsquire
Copy link
Member Author

kmsquire commented Dec 7, 2012

Thanks Stephan. I'll export those names and try to round out everything else soon.

@ViralBShah
Copy link
Member

I think it is ok to add this to sort.jl.

@kmsquire
Copy link
Member Author

kmsquire commented Dec 8, 2012

Timsort should be feature complete now.

The permutation-based functions introduced a lot of potentially unnecessary code duplication. In particular, merge_collapse() and merge() simply pass the parameter through. Is there a good way to eliminate this, without affecting performance?

I want to run performance tests to make sure I didn't screw anything up and to figure out which sorts are actually fastest in different situations (e.g., sortperm{}), then move timsort.jl into sort.jl if that makes sense.

Any other suggestions to DRY this out are welcome. timsort.jl is currently at 800 lines. In particular, I'm unclear on how to generate multiple versions of a function from within a loop (e.g., with the {r}gallop_* functions). Or should I bother?

Kevin

@kmsquire
Copy link
Member Author

kmsquire commented Dec 8, 2012

The build failure is caused by a missing extras/arpack.jl.

@ViralBShah
Copy link
Member

It is the recent updates to the paths for loading arpack.jl and suitesparse.jl. make testall on my setup used to work fine, but somehow travis-ci is unable to load.

@staticfloat Do we do something different on the travis-ci setup for tests? I suspect that the paths are different due to the tests running out of make install.

@staticfloat
Copy link
Member

It's because the relative path from the julia binary to extras and test etc. changes when you install. You shouldn't use require("$JULIA_HOME/../extras/arpack.jl"), you should use require("extras/arpack"). I've submitted a pull request fixing the paths for all the tests (It looks like suitesparse would have also failed, but its disabled).

* removed _jl_ prefix from front of function calls
* removed _lt suffix from function variants
* provided sort functions: quicksort!, mergesort!, insertionsort!,
  and timsort!, along with *_r! variants for reverse sorting,
  and _by! variants for sorting with a function.  These functions
  are not exported, but can easily be accessed with

    Base.quicksort!()
    Base.mergesort!()
    Base.insertionsort!()
    Base.timsort!()
so that their first parameter can be a lt comparison function.
Added _r, _by, function forms of select, search_sorted
Only exporting basic *sort functions for now;
 the rest may be accessed with Base.*sort and Base.*sort_perm
@kmsquire
Copy link
Member Author

Almost there. I've filled out reverse (*sort_r), component (*sort_by), and permutation (*sort_perm) for almost all functions. I think that all are potentially useful, but at this point, again, I'm only exporting the original sort functions in mutating and nonmutating forms (i.e., timsort(), timsort!(), quicksort(), quicksort!(), etc.) The rest are available via Base.sort

I have some comparison plots among the different sort methods that I'll post tomorrow, along with code for comparing them.

@johnmyleswhite
Copy link
Member

Sorry to hijack this thread, but since people are working on better sorting functionality, I'm wondering whether the following is a bug or just shows that I don't understand order:

julia> order([2, 4, 3, 3])
4-element Int64 Array:
 1
 3
 4
 2

I, for one, would put 4 at the end of a sorted version of that vector.

@pao
Copy link
Member

pao commented Dec 14, 2012

@johnmyleswhite I think those are indexes.

@johnmyleswhite
Copy link
Member

The indices of what? I thought that order would return the indices in the sorted array, but the indices in the sorted array should be [1, 4, 2, 3] or [1, 4, 3, 2], no?

@pao
Copy link
Member

pao commented Dec 14, 2012

The indices of the original array. Wouldn't the ordered indices of the sorted array just be [1,2,3,4] regardless?

@johnmyleswhite
Copy link
Member

Ah, I see. Right now, order produces indices from the original array such that original[order(original)] == sort(original). Based on R, I would assume that order produces sort(original)[order(original)] == original.

@johnmyleswhite
Copy link
Member

Put another way: the R approach transforms each element in the original array into its ascending rank order.

@kmsquire
Copy link
Member Author

Love the sortperf graphs. What would you recommend for the default sort?
Would I be correct in saying that timsort is fast for non-numeric arrays.
In the case of numeric arrays, except for totally random input, it seems
that timsort performs well, but is slow for small sized inputs.

All done in Julia! Regarding the default sort: Hmmm. It certainly wasn't
clear cut from the graphs. Easy stuff:

  • quicksort should continue to be used by default for BitTypes
  • quicksort might be best for ASCIIStrings
  • timsort should probably be used in DataFrames, where there is more likely
    to be ordered inputs, and stable sorting is required. But that's a package
    decision.

After that, I don't have a good feeling as to how often sorting is applied
to data which is partially sorted, and it really might depend on the
field/problem/situation. If we're being conservative, we leave things the
way they are. If we want to assume people often work with partially sorted
data, or resort the same data frequently after adding, removing, or making
small changes, then timsort is a good default for non-BitType data.

Questions

  • for unicode strings, are we sorting properly? (c.f.
    http://en.wikipedia.org/wiki/Unicode_collation_algorithm) A quick google
    search suggests that libicu has some functionality related to this. I'm
    depending too much on Wikipedia, but
    http://en.wikipedia.org/wiki/Unicode_equivalence suggests that a stable
    sort is required, although I don't understand the subtlties.
  • How are arrays of strings stored internally? Since insertion sort was
    slow even for short string arrays, I was wondering if the strings might be
    copied

Comments:

  • Quicksort overflows the stack for some inputs at large size (though I
    forget which tests). Introsort is a quicksort variant commonly used in STL
    library in which quicksort switches to a heapsort when the recursion depth
    becomes too deep. We might consider adding this variant.
  • While implementing timsort, I read about quicksort variants which also
    check for sorted inputs. This might be a win for BitTypes, but it would
    have to be tested.

Perhaps it would be good to include a function in Base.Sort that runs all

sort algorithms on a given input, and gives the fastest one.

That would probably be worthwhile.

For now, my request is to have this patch reviewed as is, without changing
current defaults yet, and then open up another issue (or rather, leave the
original one open, but bump it), and make the decision about changing
default sorts separately. After it's been committed, we can ask people on
julia-users or julia-dev to run tests on their favorite data sets and
report (for which a function which runs all sort algorithms on an input
would be useful).

Kevin

@ViralBShah
Copy link
Member

I am going to leave this open for a couple more days, before merging it, so that others can review it. I wish that we could have had a better way to deal with the _r and _by versions, but I do not yet have good suggestions.

@kmsquire
Copy link
Member Author

I agree--they're less than desirable.

As with the current sort functions, all of the "regular" *sort functions accept a "lessthan" function as an optional first argument, and can use that instead of the default (isless).  So the *sort_r versions, at least, could be specified by passing a "greaterthan" (>) function in.  I'm assuming that because those functions are not known at compile time, they are less efficient, but I haven't actually tested this yet.

@johnmyleswhite
Copy link
Member

Mulling over the rank-ordering issue, how about introducing a new function called ranks or ranked that returns an integral rank ordering? We could also have an additional argument that specifies alternative tie-breaking rules. R implements all of the following rules:

  • first: The rank values increase sequentially for each item in a set of ties
  • random: The ranks values are distinct, but arbitrary, within a set of ties
  • average: All items in a set of ties receive the mean rank across the set
  • max: All items in a set of ties receive the max rank across the set
  • min: All items in a set of ties receive the min rank across the set. This is how sports rankings behave

Looking into R's behavior more, I now realize that I was wrong about R's order definition because I had used examples in which the rank-orders coincided with the sortperm results. R's order is just like Julia's order. The only thing missing is some way of expressing rank-order.

@StefanKarpinski
Copy link
Member

All done in Julia!

So cool that the graphs are all done in Julia.

quicksort should continue to be used by default for BitTypes

Yes – for sorting random floating-point values, our default quicksort/insertionsort hybrid kills it. However, since timsort starts to really kill it on some data patterns, we should make sure people are aware they can explicitly call timsort instead, if they happen to expect one of those.

quicksort might be best for ASCIIStrings

I dunno – timsort look like a winner to me for strings. When it wins, it wins big, but when it loses, it loses by only a bit.

for unicode strings, are we sorting properly?

Unicode sorting is complicated and the required order is application-and-language/locale-specific, so we should rely on something like @nolta's ICU package when people have special needs for Unicode string sorting. Most string sorting, however, just needs some sane lexicographical ordering, which for UTF-8, is happily the same thing as lexicographical ordering of byte sequences (thank you, Ken Thompson and Rob Pike). That also means that the common pure-ASCII case doesn't have to pay a performance or complexity penalty at all.

How are arrays of strings stored internally? Since insertion sort was slow even for short string arrays, I was wondering if the strings might be copied

It's an array of pointers to heap-allocated String objects. I'm fairly certain there's no copying happening – although you can't mutate a string through any "official" interface, you can get at the underlying .data field and mutate it, allowing you to see which strings share the same data.

Quicksort overflows the stack for some inputs at large size (though I forget which tests). Introsort is a quicksort variant commonly used in STL library in which quicksort switches to a heapsort when the recursion depth becomes too deep. We might consider adding this variant.
While implementing timsort, I read about quicksort variants which also check for sorted inputs. This might be a win for BitTypes, but it would have to be tested.

We may even want to do introsort by default, and the quicksort variant you mention certainly sounds interesting. Do you have a link? I would be interested in looking at smoothsort too – although the implementation is kind of complex, I've had very good experience using it on data that's already partially sorted.

@StefanKarpinski
Copy link
Member

I would favor merging this sooner rather than later. On looking through the code, it looks good to me and the changes introduced don't break anything. The only questionable functions that are introduced are the *sort_r ones, which are kind of ugly. Those can all go away once specialization of higher-order functions on their function arguments is possible, since then that can be written only slightly less conveniently with (a,b)->isless(b,a) as a function argument. Currently, that introduces a lot of function call overhead since no inlining can happen.

@StefanKarpinski
Copy link
Member

@johnmyleswhite: regarding the ranking choices, one issue is that the average strategy doesn't necessarily preserve the array type, so that should possibly be a different function. This could also be done with a higher order function argument, although I'm not sure how happy type inference will be about that.

@johnmyleswhite
Copy link
Member

Buf. That's a bummer.

@kmsquire
Copy link
Member Author

quicksort should continue to be used by default for BitTypes

Yes – for sorting random floating-point values, our default quicksort/insertionsort hybrid kills it. However, since timsort starts to really kill it on some data patterns, we should make sure people are aware they can explicitly call timsort instead, if they happen to expect one of those.

quicksort might be best for ASCIIStrings

I dunno – timsort look like a winner to me for strings. When it wins, it wins big, but when it loses, it loses by only a bit.

The log plots are a little misleading here. For randomly ordered arrays of strings, quicksort is always about 2x as fast as timsort, although timsort wins big if there is any order to the strings.

for unicode strings, are we sorting properly?

Unicode sorting is complicated and the required order is application-and-language/locale-specific, so we should rely on something like @nolta's ICU package when people have special needs for Unicode string sorting. Most string sorting, however, just needs some sane lexicographical ordering, which for UTF-8, is happily the same thing as lexicographical ordering of byte sequences (thank you, Ken Thompson and Rob Pike). That also means that the common pure-ASCII case doesn't have to pay a performance or complexity penalty at all.

Cool.

How are arrays of strings stored internally? Since insertion sort was slow even for short string arrays, I was wondering if the strings might be copied

It's an array of pointers to heap-allocated String objects. I'm fairly certain there's no copying happening – although you can't mutate a string through any "official" interface, you can get at the underlying .data field and mutate it, allowing you to see which strings share the same data.

Okay. Insertionsort is sensitive to the cost of comparsions, so it's just the string comparison which is making it slower.

Quicksort overflows the stack for some inputs at large size (though I forget which tests). Introsort is a quicksort variant commonly used in STL library in which quicksort switches to a heapsort when the recursion depth becomes too deep. We might consider adding this variant.
While implementing timsort, I read about quicksort variants which also check for sorted inputs. This might be a win for BitTypes, but it would have to be tested.

We may even want to do introsort by default, and the quicksort variant you mention certainly sounds interesting. Do you have a link? I would be interested in looking at smoothsort too – although the implementation is kind of complex, I've had very good experience using it on data that's already partially sorted.

Introsort as default sounds good. It depends on heapsort, but it would take only a handful of changes to test and switch to mergesort.

I don't have a link to source for a quicksort which checks for sorted inputs, but the link I found is a discussion on stackoverflow:

http://stackoverflow.com/questions/6567326/does-stdsort-check-if-a-vector-is-already-sorted

Regarding smoothsort: it's quite a cool algorithm. Python's previous sort was based on it. I think it's worth exploring in the next round.

@StefanKarpinski
Copy link
Member

I wonder if it makes sense to split all this sorting code out into a package... If we're going to have implementations of lots of different sort algorithms, it starts to seem a bit crowded for Base. Although if our default sort ends up using insertion, quick, and merge/heap, then those all end up in Base anyway.

@StefanKarpinski
Copy link
Member

@johnmyleswhite, it may not be so bad. I'm thinking that there's two very different purposes here that should be different functions. In some cases, you always want indices into the original array, in which case Array{Int} should be returned. That's what the current order does. In other cases, you want to map elements to their ranks, in which case, you may want different strategies for breaking ties, but you can return Array{Float64} consistently, even in cases where the rank values will always be integral. (And since we allow indexing with integer-valued floats, those cases will still work even if you use one of the ranks to do indexing.)

@johnmyleswhite
Copy link
Member

That's a good point. I was thinking that you wouldn't be able to use the ranks as indices. Using Array{Float64} everywhere works for me.

@kmsquire
Copy link
Member Author

Just noticed that @JeffBezanson changed the sort macro to an eval call in c6a3c85

@kmsquire
Copy link
Member Author

Fixed up sort/timsort to match @JeffBezanson's recent macro removal.

Regarding @StefanKarpinski's last suggestion (on timsort),

  • I could easily move timsort.jl to a package, as it's the only sort not used by default right now (and it's still in its own file). Of course, this doesn't leave any default sort which handles sortperm has poor performance #939.
  • I think that exposing the other sorts by name and the addition of search_sorted* and issorted* variants are still worthwhile
  • I think that implementing a quicksort/heapsort (aka introsort) variant, as mentioned previously, would be worthwhile. There's a heap implementation by @timholy in examples that could be adapted for heapsort, though it might take some work to match the style of the current sort functions.

And for anyone interested:

I don't think I'll be implementing any other sorts anytime soon.

After this Friday, I also likely won't be available to do much to this patch for a few weeks, so if you want anything changed/fixed, please ask soon.

Cheers! :-)

StefanKarpinski added a commit that referenced this pull request Dec 18, 2012
Added timsort, plus misc. sort updates
@StefanKarpinski StefanKarpinski merged commit 5bd7ab0 into JuliaLang:master Dec 18, 2012
@StefanKarpinski
Copy link
Member

Lets keep the sorts in Base for now. Man, you've done a lot of research on sorting! @ViralBShah coauthored a paper with what is to the best of my knowledge the best distributed sort. We should implement that for DArrays (allowing different sequential sorts, of course) and we'll have a pretty sweet sorting story!

@StefanKarpinski
Copy link
Member

It's quite disappointing that the "dual pivot quicksort" paper doesn't provide any actual performance comparison, just a theoretical proof that it swaps fewer elements on average. That makes sense, but seems like it would continue to hold for 3, 4, 5, etc. pivots. So the real question is whether the additional complexity costs more in instruction cache misses, etc. than it saves in element swapping. And the paper leaves that question completely unanswered. (Also, who typesets a paper with that much math in Word? Not even using Microsoft's Equation Editor – just text.)

@kmsquire
Copy link
Member Author

It's interesting that "a lot of research on sorting" = "a lot of google searches". ;-)

Yeah, the "dual pivot quicksort" paper I pointed to is a little sketchy. But... for your reading pleasure, here's a paper with detailed analysis justifying the approach: Average Case Analysis of Java 7’s Dual Pivot Quicksort

There are also slides and audio of a talk by the same authors:

Seems worthwhile (and it's nice to have an algorithmic description which isn't subject to the GPL).

Other than that, @ViralBShah's distributed sort would be quite nice to have.

Cheers!

@StefanKarpinski
Copy link
Member

OMG, LaTeX!

@kmsquire
Copy link
Member Author

OMG LOL! ;-)

simonster added a commit to simonster/julia that referenced this pull request Dec 19, 2012
I could be misunderstanding the purpose of these functions, but I think
this was an oversight in JuliaLang#1691
@ViralBShah
Copy link
Member

This is the paper on distributed sorting that I co-authored:

http://dspace.mit.edu/bitstream/handle/1721.1/7418/CS008.pdf?sequence=1

Just like our julia paper, this one was too useful (it shipped as the default parallel sort on Matlab Star-P), and hence never got accepted. ;-)

@kmsquire
Copy link
Member Author

Looks interesting... I'm a little confused why it didn't get accepted,
though.

On Thu, Dec 20, 2012 at 5:38 AM, Viral B. Shah notifications@github.comwrote:

This is the paper on distributed sorting that I co-authored:

http://dspace.mit.edu/bitstream/handle/1721.1/7418/CS008.pdf?sequence=1

Just like our julia paper, this one was too useful (it shipped as the
default parallel sort on Matlab Star-P), and hence never got accepted. ;-)


Reply to this email directly or view it on GitHubhttps://github.com//pull/1691#issuecomment-11573588.

@kmsquire kmsquire mentioned this pull request Feb 10, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants