You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This question was originally posted on SO. Arun asked me to file an issue, so here we go. Not sure about the ethics on data.table github, if a 'SO link only' is OK? Anyway, here's the question, verbatim from SO.
I'm trying to understand the underlying logic of how the result of a non-equi join in data.table is ordered within each level of the on-variable.
Just to make it clear from the start: I have no problem with the order itself, or to order the output in a desired way after the join. However, because I find the output from all other data.table operations highly consistent, I suspect there is a ordering pattern to be revealed in non-equi joins as well.
I will give two examples, where two different 'large' data sets are joined with a smaller. I have tried to describe the most obvious patterns in the output, as well as instances where the pattern differs between the joins of the two data sets.
# the first 'large' data set
d1 <- data.table(x = c(rep(c("b", "a", "c"), each = 3), c("a", "b")),
y = c(rep(c(1, 3, 6), 3), 6, 6),
id = 1:11) # to make it easier to track the original order in the output
# x y id
# 1: b 1 1
# 2: b 3 2
# 3: b 6 3
# 4: a 1 4
# 5: a 3 5
# 6: a 6 6
# 7: c 1 7
# 8: c 3 8
# 9: c 6 9
# 10: a 6 10
# 11: b 6 11
# the small data set
d2 <- data.table(id = 1:2, val = c(4, 2))
# id val
# 1: 1 4
# 2: 2 2
Non-equi join between the first large data set and the small, on = .(y >= val).
d1[d2, on = .(y >= val)]
# x y id i.id
# 1: b 4 3 1 # Row 1-5, first match: y >= val[1]; y >= 4
# 2: a 4 6 1 # The rows within this match have the same order as the original data
# 3: c 4 9 1 # and runs consecutively from first to last match
# 4: a 4 10 1
# 5: b 4 11 1
# 6: b 2 2 2 # Row 6-13, second match: y >= val[2]; y >= 2
# 7: a 2 5 2 # The rows within this match do not have the same order as the original data
# 8: c 2 8 2 # Rather, they seem to be come in chunks (6-8, 9-11, 12-13)
# First chunk starts with the match with lowest index, y[2]
# 9: b 2 3 2
# 10: a 2 6 2
# 11: c 2 9 2
# 12: a 2 10 2
# 13: b 2 11 2
The second 'large' data set:
d3 <- data.table(x = rep(c("a", "b", "c"), each = 3),
y = c(6, 1, 3),
id = 1:9)
# x y id
# 1: a 6 1
# 2: a 1 2
# 3: a 3 3
# 4: b 6 4
# 5: b 1 5
# 6: b 3 6
# 7: c 6 7
# 8: c 1 8
# 9: c 3 9
Same non-equi join between the second large data set with the small:
d3[d2, on = .(y >= val)]
# x y id i.id
# 1: a 4 1 1 # Row 1-3, first match (y >= 4), similar to output above
# 2: b 4 4 1
# 3: c 4 7 1
# 4: a 2 3 2 # Row 4-9, second match (y >= 2).
# 5: b 2 6 2 # Again, rows not consecutive.
# 6: c 2 9 2 # However, now the first chunk does not start with the match with lowest index,
# y[3] instead of y[1]
# 7: a 2 1 2 # y[1] appears after y[3]
# 8: b 2 4 2 # ditto
# 9: c 2 7 2
Can anyone explain the logic of (1) the order within each level of the on-variable, here especially within the second match, where original order of the data isn't kept in the result. And (2) why does the order between chunks within matches differ when the two different data sets are used?
An even smaller example which makes it easier to track the re-ordering spotted by @franknarf1; in the result of the join, x is sorted by its join variable:
d1 <- data.table(end = c(4, 1, 3), ix = 1:3)
d1
# end ix
# 1: 4 1
# 2: 1 2
# 3: 3 3
d2 <- data.table(start = 2)
d2
# start
# 1: 2
# result of join not in original order, but in order of "x.end"
d1[d2, .(x.end, ix), on = .(end > start)]
# x.end ix
# 1: 3 3
# 2: 4 1
The text was updated successfully, but these errors were encountered:
I cannot explain the logic, but the pattern is d1[order(y)][y > 2, id] and d3[order(y)][y > 2, id], right? (Or in place of 2, put whatever other value from i.)
This question was originally posted on SO. Arun asked me to file an issue, so here we go. Not sure about the ethics on
data.table
github, if a 'SO link only' is OK? Anyway, here's the question, verbatim from SO.I'm trying to understand the underlying logic of how the result of a non-equi join in
data.table
is ordered within each level of theon
-variable.Just to make it clear from the start: I have no problem with the order itself, or to order the output in a desired way after the join. However, because I find the output from all other
data.table
operations highly consistent, I suspect there is a ordering pattern to be revealed in non-equi joins as well.I will give two examples, where two different 'large' data sets are joined with a smaller. I have tried to describe the most obvious patterns in the output, as well as instances where the pattern differs between the joins of the two data sets.
Non-equi join between the first large data set and the small,
on = .(y >= val)
.The second 'large' data set:
Same non-equi join between the second large data set with the small:
Can anyone explain the logic of (1) the order within each level of the
on
-variable, here especially within the second match, where original order of the data isn't kept in the result. And (2) why does the order between chunks within matches differ when the two different data sets are used?An even smaller example which makes it easier to track the re-ordering spotted by @franknarf1; in the result of the join,
x
is sorted by its join variable:The text was updated successfully, but these errors were encountered: