-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: Added timsort, plus misc. sort updates #1691
Conversation
This is great work. I really like having all of these available by name like that so easily. Of course we should generally choose good default sorts, but sometimes the programmer just knows better and want to use a specific sort. I wouldn't mind exporting these names either since they're unlikely to clash and the |
Thanks Stephan. I'll export those names and try to round out everything else soon. |
I think it is ok to add this to |
Timsort should be feature complete now. The permutation-based functions introduced a lot of potentially unnecessary code duplication. In particular, I want to run performance tests to make sure I didn't screw anything up and to figure out which sorts are actually fastest in different situations (e.g., sortperm{}), then move timsort.jl into sort.jl if that makes sense. Any other suggestions to DRY this out are welcome. timsort.jl is currently at 800 lines. In particular, I'm unclear on how to generate multiple versions of a function from within a loop (e.g., with the Kevin |
The build failure is caused by a missing |
It is the recent updates to the paths for loading arpack.jl and suitesparse.jl. @staticfloat Do we do something different on the travis-ci setup for tests? I suspect that the paths are different due to the tests running out of |
It's because the relative path from the julia binary to extras and test etc. changes when you install. You shouldn't use |
* removed _jl_ prefix from front of function calls * removed _lt suffix from function variants * provided sort functions: quicksort!, mergesort!, insertionsort!, and timsort!, along with *_r! variants for reverse sorting, and _by! variants for sorting with a function. These functions are not exported, but can easily be accessed with Base.quicksort!() Base.mergesort!() Base.insertionsort!() Base.timsort!()
so that their first parameter can be a lt comparison function.
Added _r, _by, function forms of select, search_sorted
Only exporting basic *sort functions for now; the rest may be accessed with Base.*sort and Base.*sort_perm
Almost there. I've filled out reverse ( I have some comparison plots among the different sort methods that I'll post tomorrow, along with code for comparing them. |
Sorry to hijack this thread, but since people are working on better sorting functionality, I'm wondering whether the following is a bug or just shows that I don't understand
I, for one, would put 4 at the end of a sorted version of that vector. |
@johnmyleswhite I think those are indexes. |
The indices of what? I thought that |
The indices of the original array. Wouldn't the ordered indices of the sorted array just be |
Ah, I see. Right now, |
Put another way: the R approach transforms each element in the original array into its ascending rank order. |
All done in Julia! Regarding the default sort: Hmmm. It certainly wasn't
After that, I don't have a good feeling as to how often sorting is applied Questions
Comments:
Perhaps it would be good to include a function in Base.Sort that runs all
For now, my request is to have this patch reviewed as is, without changing Kevin |
I am going to leave this open for a couple more days, before merging it, so that others can review it. I wish that we could have had a better way to deal with the |
I agree--they're less than desirable. As with the current |
Mulling over the rank-ordering issue, how about introducing a new function called
Looking into R's behavior more, I now realize that I was wrong about R's |
So cool that the graphs are all done in Julia.
Yes – for sorting random floating-point values, our default quicksort/insertionsort hybrid kills it. However, since timsort starts to really kill it on some data patterns, we should make sure people are aware they can explicitly call timsort instead, if they happen to expect one of those.
I dunno – timsort look like a winner to me for strings. When it wins, it wins big, but when it loses, it loses by only a bit.
Unicode sorting is complicated and the required order is application-and-language/locale-specific, so we should rely on something like @nolta's ICU package when people have special needs for Unicode string sorting. Most string sorting, however, just needs some sane lexicographical ordering, which for UTF-8, is happily the same thing as lexicographical ordering of byte sequences (thank you, Ken Thompson and Rob Pike). That also means that the common pure-ASCII case doesn't have to pay a performance or complexity penalty at all.
It's an array of pointers to heap-allocated String objects. I'm fairly certain there's no copying happening – although you can't mutate a string through any "official" interface, you can get at the underlying
We may even want to do introsort by default, and the quicksort variant you mention certainly sounds interesting. Do you have a link? I would be interested in looking at smoothsort too – although the implementation is kind of complex, I've had very good experience using it on data that's already partially sorted. |
I would favor merging this sooner rather than later. On looking through the code, it looks good to me and the changes introduced don't break anything. The only questionable functions that are introduced are the |
@johnmyleswhite: regarding the ranking choices, one issue is that the average strategy doesn't necessarily preserve the array type, so that should possibly be a different function. This could also be done with a higher order function argument, although I'm not sure how happy type inference will be about that. |
Buf. That's a bummer. |
The log plots are a little misleading here. For randomly ordered arrays of strings, quicksort is always about 2x as fast as timsort, although timsort wins big if there is any order to the strings.
Cool.
Okay. Insertionsort is sensitive to the cost of comparsions, so it's just the string comparison which is making it slower.
Introsort as default sounds good. It depends on heapsort, but it would take only a handful of changes to test and switch to mergesort. I don't have a link to source for a quicksort which checks for sorted inputs, but the link I found is a discussion on stackoverflow: http://stackoverflow.com/questions/6567326/does-stdsort-check-if-a-vector-is-already-sorted Regarding smoothsort: it's quite a cool algorithm. Python's previous sort was based on it. I think it's worth exploring in the next round. |
I wonder if it makes sense to split all this sorting code out into a package... If we're going to have implementations of lots of different sort algorithms, it starts to seem a bit crowded for Base. Although if our default sort ends up using insertion, quick, and merge/heap, then those all end up in Base anyway. |
@johnmyleswhite, it may not be so bad. I'm thinking that there's two very different purposes here that should be different functions. In some cases, you always want indices into the original array, in which case |
That's a good point. I was thinking that you wouldn't be able to use the ranks as indices. Using |
Just noticed that @JeffBezanson changed the sort macro to an eval call in c6a3c85 |
Minor speed improvement for sorted arrays.
Fixed up sort/timsort to match @JeffBezanson's recent macro removal. Regarding @StefanKarpinski's last suggestion (on timsort),
And for anyone interested:
I don't think I'll be implementing any other sorts anytime soon. After this Friday, I also likely won't be available to do much to this patch for a few weeks, so if you want anything changed/fixed, please ask soon. Cheers! :-) |
Added timsort, plus misc. sort updates
Lets keep the sorts in Base for now. Man, you've done a lot of research on sorting! @ViralBShah coauthored a paper with what is to the best of my knowledge the best distributed sort. We should implement that for DArrays (allowing different sequential sorts, of course) and we'll have a pretty sweet sorting story! |
It's quite disappointing that the "dual pivot quicksort" paper doesn't provide any actual performance comparison, just a theoretical proof that it swaps fewer elements on average. That makes sense, but seems like it would continue to hold for 3, 4, 5, etc. pivots. So the real question is whether the additional complexity costs more in instruction cache misses, etc. than it saves in element swapping. And the paper leaves that question completely unanswered. (Also, who typesets a paper with that much math in Word? Not even using Microsoft's Equation Editor – just text.) |
It's interesting that "a lot of research on sorting" = "a lot of google searches". ;-) Yeah, the "dual pivot quicksort" paper I pointed to is a little sketchy. But... for your reading pleasure, here's a paper with detailed analysis justifying the approach: Average Case Analysis of Java 7’s Dual Pivot Quicksort There are also slides and audio of a talk by the same authors:
Seems worthwhile (and it's nice to have an algorithmic description which isn't subject to the GPL). Other than that, @ViralBShah's distributed sort would be quite nice to have. Cheers! |
OMG, LaTeX! |
OMG LOL! ;-) |
I could be misunderstanding the purpose of these functions, but I think this was an oversight in JuliaLang#1691
This is the paper on distributed sorting that I co-authored: http://dspace.mit.edu/bitstream/handle/1721.1/7418/CS008.pdf?sequence=1 Just like our julia paper, this one was too useful (it shipped as the default parallel sort on Matlab Star-P), and hence never got accepted. ;-) |
Looks interesting... I'm a little confused why it didn't get accepted, On Thu, Dec 20, 2012 at 5:38 AM, Viral B. Shah notifications@github.comwrote:
|
EDIT: (Original message is below)
This patch includes an implementation of timsort for julia, and an update of the sort functions. See also a comparison of the sort routines available in julia at https://github.com/kmsquire/SortPerf.jl (the pdf file contains relevant plots).
Patch summary:
each_row
,each_col
, andeach_vec
to AbstractArray.jlinsertionsort
,quicksort
,mergesort
, andtimsort
(mutating and non-mutating variants).*sort_r
(reverse) and*sort_by
(sort by function).*sort_perm
,*sort_perm_r
, and*sort_perm_by
search_sorted
,search_sorted_first
andsearch_sorted_last
as in RFC: Clean up search_sorted* functions #1620issorted_r
,issorted_by
,search_sorted*_r
,search_sorted*_by
,select_r
, andselect_by
.Unexported functions can be accessed with
Base.Sort.<function>
or withusing Base.Sort
This patch might be a little big, so I can probably break it up into multiple patches if needed. It does all kind of go together, though.
No decision was made regarding replacing mergesort with timsort. See the pdf file referenced at the top.
ORIGINAL MESSAGE:
See https://groups.google.com/forum/?fromgroups=#!searchin/julia-dev/timsort/julia-dev/bgFzFVT403s/tm1oe7vIVWAJ and https://gist.github.com/4168004
timsort!
is almost complete, and is only missing the function versions which permute indices. Right now, it's in a separate file which is included in sort, simply because adding 500 lines tosort.jl
didn't seem like a good idea.In the original gist, there was some code generated in a loop, which I unrolled in this version because I didn't trust myself to get it right inside of a macro.
I'll retest performance once everything is ready.
Additional Miscellaneous updates:
_jl_
prefix from front of*sort
function calls_lt
suffix from function variantsAs a result, the
*sort
functions:quicksort!
,mergesort!
,insertionsort!
, andtimsort!
are available in Base(not exported), along with*_r!
variants for reverse sorting, and_by!
variants for sorting by a function,(EDIT: andso they can be accessed with_lt
variants for sorting with a custom comparison function)quicksort!()
,insertionsort_r!()
, etc.