-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[R-Forge #5248] implement "set" (as in setdiff, union etc..) operations for data.tables for convenience #547
Comments
First attempt at setdiff for DTs
|
Nice reference regarding that feature: http://www.postgresql.org/docs/9.5/static/queries-union.html |
basic benchmark on my machine. require(data.table)
set.seed(1L)
dt1 = setDT(lapply(1:3, function(x) sample(500:600, 5e7, TRUE)))
dt2 = setDT(lapply(1:3, function(x) sample(560:660, 1e6, TRUE)))
system.time(dt_s <- fsetdiff(dt1, dt2))
# 5.479 0.738 6.264
system.time(dt_i <- fintersect(dt1, dt2))
# 2.356 0.095 2.475
system.time(dt_u <- funion(dt1, dt2))
# 2.423 0.333 2.786
system.time(dt_s2 <- fsetdiff(dt1, dt2, all=TRUE))
# 7.280 0.733 8.080
system.time(dt_i2 <- fintersect(dt1, dt2, all=TRUE))
# 6.476 0.311 6.851
system.time(dt_u2 <- funion(dt1, dt2, all=TRUE))
# 0.166 0.163 0.336 |
another good reference presenting some graphs on set operators and corresponding SQL queries (sometimes SQL workarounds) for INTERSECT ALL and EXCEPT ALL: http://www.gplivna.eu/papers/sql_set_operators.htm |
Oh this one's much nicer indeed. |
Also leaving some benchmarks on latest data.table and dplyr: Set operators in R |
small volume data re-run: data 1e6 unique#DT1: 1000000 rows, 1000000 unq rows
#DT2: 1000000 rows, 1000000 unq rows
#DT1 intersect DT2: 499109 rows
#DT1 setdiff DT2: 500891 rows
#DT1 union DT2: 1500891 rows
## intersect
#Unit: milliseconds
# expr min lq mean median uq max neval
# df 1599.0095 1651.5923 1743.7876 1714.8873 1826.2016 1929.4169 10
# dt 818.0777 846.5367 894.1579 902.0847 917.2623 989.7598 10
## setdiff
#Unit: milliseconds
# expr min lq mean median uq max neval
# df 1610.5838 1648.2851 1862.0443 1771.8324 2054.5137 2331.562 10
# dt 689.9719 693.1796 722.6169 711.1247 725.3226 807.525 10
## union
#Unit: milliseconds
# expr min lq mean median uq max neval
# df 2171.7295 2374.7261 2473.9191 2429.1520 2616.0950 2911.9051 10
# dt 418.4617 433.3666 487.3305 446.4286 573.3986 620.1645 10
## setequal
#Unit: milliseconds
# expr min lq mean median uq max neval
# df 2083.8287 2125.593 2241.1061 2187.7803 2353.0298 2510.583 10
# dt 789.9313 795.676 837.7273 811.4293 841.5154 1011.968 10 data 1e6 duplicates#DT1: 1000000 rows, 432159 unq rows
#DT2: 1000000 rows, 432456 unq rows
#DT1 intersect DT2: 93474 rows
#DT1 setdiff DT2: 338685 rows
#DT1 union DT2: 771141 rows
## intersect
#Unit: milliseconds
# expr min lq mean median uq max neval
# df 1282.7889 1295.7182 1361.1802 1336.6922 1413.4156 1485.1399 10
# dt 529.3002 534.2962 554.2797 539.0737 555.6799 668.7858 10
## setdiff
#Unit: milliseconds
# expr min lq mean median uq max neval
# df 1510.8082 1514.792 1708.682 1776.9783 1821.2516 1912.1937 10
# dt 642.8345 670.856 710.395 707.8905 757.1905 771.2871 10
## union
#Unit: milliseconds
# expr min lq mean median uq max neval
# df 1794.22 1795.561 2066.1633 2025.5487 2347.7738 2495.5978 10
# dt 336.82 357.294 391.5749 377.1024 407.0398 545.6484 10
## setequal
#Unit: milliseconds
# expr min lq mean median uq max neval
# df 1904.0811 1923.1564 1992.1805 1983.6733 2032.273 2122.502 10
# dt 929.4357 936.4791 969.8934 947.6446 1015.818 1056.489 10 |
question where set operators could benefit from |
Submitted by: Arun ; Assigned to: Nobody; R-Forge link
data.table
has already a very efficient way for gettingunique
andduplicated
. It takes 0.5 seconds where base's unique takes 8 seconds for example - check this benchmark on rpub.Similarly, why not use the internal engine to generate set operations on
data.table
?Condition: Number of columns and column classes should match.
The text was updated successfully, but these errors were encountered: