-
Notifications
You must be signed in to change notification settings - Fork 997
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
in some cases setkey could save some more sort operations #1321
Comments
I'm sorry I don't follow this FR. |
@arunsrinivasan I think the idea is to skip some steps in
(Not knowing the internals of |
I'm guessing his suggestion is to use the "smart" method? But how does one know the data is still sorted by require(data.table)
set.seed(1L);
dt = data.table(x=sample(3,10,TRUE), y=sample(2,10,TRUE))
setkey(dt, y, x) # sorted by 'y,x'
# perform some operations
dt = dt[, .(x=sample(x)), by=y]
# x isn't sorted within y anymore..
key(dt) # [1] y
setattr(setkey(dt, y), 'sorted', c('y', 'x')) # <~~~ wrong!
# setkey() does a first pass to see if columns are already sorted by the columns specified
# and if so, avoids computing order and reordering. It also warns if key is set on those
# columns but data isn't sorted.
setkey(dt, y,x)
# Warning message:
# In setkeyv(x, cols, verbose = verbose, physical = physical) :
# Already keyed by this key but had invalid row order, key rebuilt. If you didn't go under the hood please let datatable-help know so the root cause can be fixed. |
@arunsrinivasan by checking the key. Your operation destroyed the key. |
Okay, I've: setkey(dt, y,x) How do I check that that's the key when doing: setkey(setkey(dt, y), y,x) ? |
I don't know how far this can be pushed, but what OP's example points out is the following - the first column (of a multicolumn) key never needs to be resorted - it will always be automatically correctly sorted. So if your current key is Hmm.. Correction - I'm not sure if the above is true - it's certainly true in the case where |
Yes but that's already taken care of internally. The order routine is only called on unsorted data. Okay, just saw the timings.. woah!?!? sorting (and reordering) thro' 100 million elements in 0.6s? What machine is that? These are the timings on my laptop. system.time(onUnsorted())
# user system elapsed
# 11.521 0.386 11.967
system.time(onSorted())
# user system elapsed
# 8.939 0.366 9.331
system.time(onSortedSmart())
# user system elapsed
# 0.287 0.090 0.378 Note that the 2nd timing is slightly quicker (only first column needs to be ordered, the 2nd column's check for sorted-ness returns TRUE and therefore never reordered). Also note that the 3rd timing doesn't do any ordering as both columns are already sorted. |
Okay, the OP's example is pretty confusing and seems to reach the wrong conclusion.
So |
@franknarf1 thanks for confirming. |
@franknarf1 I don't think those are the right timings to looks at. This is my understanding of the relevant timings: dts = setkey(copy(dtu),x,y)
setkey(dts, y, x, verbose = T)
# forder took 9.52 sec
# reorder took 6.64 sec
dts = setkey(copy(dtu),x,y)
setkey(dts, y, verbose = T)
# forder took 5.42 sec
# reorder took 7.11 sec
dts = setkey(copy(dtu),x,y)
system.time(data.table:::is.sorted(dts$x))
# user system elapsed
# 0.29 0.00 0.30 You can see that |
@eantonya Oh ok, I see what you mean. Whether that test is sufficient depends on knowledge of It's clear to me that the test |
The order algorithm is fairly trivial - iterate from last column of the key to the first and sort all of them. The result will be a key-sorted table (assuming the sorting algo keeps existing order in equivalent cases, which it does for us). The OP's observation is fairly simple - if that last column is already sorted - don't sort it again. This happens when e.g. old key = |
Wow. 😄 You are waaay quicker than my 1yr old Lenovo y510p ( @arunsrinivasan - maybe not a mac book next time (just guessing)) :D @eantonya hit the nail on the head. Through I think it is minimally more general than as I understood him: when setting a key the data does not not need to be sorted for the last new key columns (i.e. the first sorts can be omitted) as long as they have been part of the old key in the same order (allowing arbitrary gaps; columns removed from the key count as if they were in their order at the end of the key) See code below I case you are interested. Thanks for the discussion to everyone! I had a fundamentally wrong Idea of how data.table works - I thought it had indices on all key columns - now I see why joins on a secondary key are not that easy. I'll edit the title to match what's left of the issue. "Proof":
Example:
|
@eantonya we perform MSD radix sorting (cache efficient + benefits from partial ordering). See slide 19 here from Matt's UseR'15 talk. I think this does it: # handle identical case separately first
function(old, new) {
len = min(length(old), length(new))
ans = vapply(seq_len(len), function(i) identical(head(old, i), tail(new, i)), TRUE)
new_filtered = if (any(ans)) head(new, -which.max(ans)) else new
} Note that we'll have to verify that the |
@jan-glx just checked the configurations -- here and here. Still can't figure out the reason for speed difference to my Macbook Pro 13' with 4MB L3 cache, 256KB L2 cache, 16GB RAM 1600MHz DDR3, i7 3GHz. The only thing that stands out for the processor compared to mine is 6MB (smart?) cache... unless I'm missing something..
|
@jan-glx you seem to be right, omitting also works and I don't understand why, since sorting by @arunsrinivasan I didn't realize the algo was changed to MSD, thanks for pointing that out (I don't think that changes too much for this FR - the timings above are all vs that algo). I'm not sure why you'd need to check |
@jan-glx hm, I'm stumped as to why that works as well.
Not really. We already check when the columns to key by are identical to existing keys. It has helped catch bugs in places where we wrongly retain keys. Also |
@arunsrinivasan I'm still unconvinced. Sure it's nice for catching And as far as direct setting the attribute goes - I see no problems with complete failure if user decides to tinker with internals and does so incorrectly. |
@jan-glx Here's a counter-example that shows that omitting doesn't work: > dt = data.table(a = 1, b= c(1,1,2,2), c = c(3,4,1,2), e = c(4,1,4,1), key = c('a','b','c'))
> dt
a b c e
1: 1 1 3 4
2: 1 1 4 1
3: 1 2 1 4
4: 1 2 2 1
> setkey(copy(dt), e)[]
a b c e
1: 1 1 4 1
2: 1 2 2 1
3: 1 1 3 4
4: 1 2 1 4
> setkey(copy(dt), e, a, c)[]
a b c e
1: 1 2 2 1
2: 1 1 4 1
3: 1 2 1 4
4: 1 1 3 4 |
|
@arunsrinivasan Unfortunately I have to admit that your macbook is not worse than a cheap lenove - it was just a consequence of the stupid mistakes I made when trying to bring the observation into a presentable form...your initial response was absolute suitable... here is my correct timing for the initial data:(
@eantonya well that was what I meant with
But please stop tinkering about it there is not much to gain here and there are really more import thinks to do for this package. E.g. a nice documentation for this wonderful forderv you just told me about. I once unsuccessfully searched an hour for a R (C) implementation of "bucket sort" just to read now randomly that it is already in base as "sort.list"...sometimes one underestimates base... Thanks again for all that nice information! |
This should be nicely handled by #4386 |
When
setkey
is called on aDT
with sorted attribute, existing keys should be reused if possible.Most importantly in the
setkey(setkey(dt, x, y), x)
case but also in a little more advanced cases likesetkey(setkey(dt, x, y), y, x)
. If you per se do not want to trust thesorted
attribute at least check for the column being sorted before sorting if it is marked as sorted.-best, Jan
example:
The text was updated successfully, but these errors were encountered: