[R-Forge #5248] implement "set" (as in setdiff, union etc..) operations for data.tables for convenience #547

arunsrinivasan · 2014-06-08T13:15:44Z

Submitted by: Arun ; Assigned to: Nobody; R-Forge link

data.table has already a very efficient way for getting unique and duplicated. It takes 0.5 seconds where base's unique takes 8 seconds for example - check this benchmark on rpub.

Similarly, why not use the internal engine to generate set operations on data.table?

setdiff
union
intersect
setequal

Condition: Number of columns and column classes should match.

The text was updated successfully, but these errors were encountered:

First attempt at setdiff for DTs

jangorecki · 2015-01-10T10:20:50Z

setdiff function name might be replaced except (postgres, sqlserver naming convention), to not interfere with set* used by by reference functions.

jangorecki · 2016-03-04T00:34:07Z

Nice reference regarding that feature: http://www.postgresql.org/docs/9.5/static/queries-union.html
Set ops * ALL basic examples: http://gplivna.blogspot.co.uk/2008/01/minus-all-and-intersect-all-in-oracle.html
Set ops * ALL design description: https://db.apache.org/derby/papers/Intersect-design.html

arunsrinivasan · 2016-03-06T22:25:42Z

basic benchmark on my machine.

require(data.table)
set.seed(1L)
dt1 = setDT(lapply(1:3, function(x) sample(500:600, 5e7, TRUE)))
dt2 = setDT(lapply(1:3, function(x) sample(560:660, 1e6, TRUE)))

system.time(dt_s <- fsetdiff(dt1, dt2))
#   5.479   0.738   6.264 
system.time(dt_i <- fintersect(dt1, dt2))
#   2.356   0.095   2.475 
system.time(dt_u <- funion(dt1, dt2))
#   2.423   0.333   2.786 

system.time(dt_s2 <- fsetdiff(dt1, dt2, all=TRUE))
#   7.280   0.733   8.080 
system.time(dt_i2 <- fintersect(dt1, dt2, all=TRUE))
#   6.476   0.311   6.851 
system.time(dt_u2 <- funion(dt1, dt2, all=TRUE))
#   0.166   0.163   0.336

jangorecki · 2016-03-06T22:46:00Z

another good reference presenting some graphs on set operators and corresponding SQL queries (sometimes SQL workarounds) for INTERSECT ALL and EXCEPT ALL: http://www.gplivna.eu/papers/sql_set_operators.htm

arunsrinivasan · 2016-03-06T22:48:18Z

Oh this one's much nicer indeed.

jangorecki · 2016-03-17T01:02:04Z

Also leaving some benchmarks on latest data.table and dplyr: Set operators in R
source: https://gist.github.com/jangorecki/0a542fbb8157dec958f5

jangorecki · 2016-03-18T22:54:49Z

small volume data re-run:

data 1e6 unique

#DT1: 1000000 rows, 1000000 unq rows
#DT2: 1000000 rows, 1000000 unq rows
#DT1 intersect DT2: 499109 rows
#DT1 setdiff DT2: 500891 rows
#DT1 union DT2: 1500891 rows

## intersect
#Unit: milliseconds
# expr       min        lq      mean    median        uq       max neval
#   df 1599.0095 1651.5923 1743.7876 1714.8873 1826.2016 1929.4169    10
#   dt  818.0777  846.5367  894.1579  902.0847  917.2623  989.7598    10

## setdiff
#Unit: milliseconds
# expr       min        lq      mean    median        uq      max neval
#   df 1610.5838 1648.2851 1862.0443 1771.8324 2054.5137 2331.562    10
#   dt  689.9719  693.1796  722.6169  711.1247  725.3226  807.525    10

## union
#Unit: milliseconds
# expr       min        lq      mean    median        uq       max neval
#   df 2171.7295 2374.7261 2473.9191 2429.1520 2616.0950 2911.9051    10
#   dt  418.4617  433.3666  487.3305  446.4286  573.3986  620.1645    10

## setequal
#Unit: milliseconds
# expr       min       lq      mean    median        uq      max neval
#   df 2083.8287 2125.593 2241.1061 2187.7803 2353.0298 2510.583    10
#   dt  789.9313  795.676  837.7273  811.4293  841.5154 1011.968    10

data 1e6 duplicates

#DT1: 1000000 rows, 432159 unq rows
#DT2: 1000000 rows, 432456 unq rows
#DT1 intersect DT2: 93474 rows
#DT1 setdiff DT2: 338685 rows
#DT1 union DT2: 771141 rows

## intersect
#Unit: milliseconds
# expr       min        lq      mean    median        uq       max neval
#   df 1282.7889 1295.7182 1361.1802 1336.6922 1413.4156 1485.1399    10
#   dt  529.3002  534.2962  554.2797  539.0737  555.6799  668.7858    10

## setdiff
#Unit: milliseconds
# expr       min       lq     mean    median        uq       max neval
#   df 1510.8082 1514.792 1708.682 1776.9783 1821.2516 1912.1937    10
#   dt  642.8345  670.856  710.395  707.8905  757.1905  771.2871    10

## union
#Unit: milliseconds
# expr     min       lq      mean    median        uq       max neval
#   df 1794.22 1795.561 2066.1633 2025.5487 2347.7738 2495.5978    10
#   dt  336.82  357.294  391.5749  377.1024  407.0398  545.6484    10

## setequal
#Unit: milliseconds
# expr       min        lq      mean    median       uq      max neval
#   df 1904.0811 1923.1564 1992.1805 1983.6733 2032.273 2122.502    10
#   dt  929.4357  936.4791  969.8934  947.6446 1015.818 1056.489    10

jangorecki · 2016-04-01T23:49:20Z

question where set operators could benefit from by argument: http://stackoverflow.com/q/36112817/2490497 leaving just for future reference

arunsrinivasan added a commit that referenced this issue Aug 5, 2014

Implemented setdiff #547 task list 1. Not exported.

d4445b6

First attempt at setdiff for DTs

arunsrinivasan added this to the v1.9.6 milestone Sep 24, 2014

arunsrinivasan added the Low label Sep 25, 2014

arunsrinivasan modified the milestones: v1.9.6, v1.9.8 Oct 10, 2014

arunsrinivasan mentioned this issue Oct 31, 2014

forderv should be able to deal with more than one data.table efficiently. #925

Open

jangorecki self-assigned this Mar 4, 2016

jangorecki closed this as completed in 5fafcc9 Mar 6, 2016

arunsrinivasan added a commit that referenced this issue Mar 6, 2016

Updated ?setops and added NEWS entry, #547.

9c93779

jangorecki added a commit that referenced this issue Mar 9, 2016

improvement for scalability: fintersect fsetdiff #547

d99c967

franknarf1 mentioned this issue Jul 31, 2017

Add %fin% set operator #2279

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[R-Forge #5248] implement "set" (as in setdiff, union etc..) operations for data.tables for convenience #547

[R-Forge #5248] implement "set" (as in setdiff, union etc..) operations for data.tables for convenience #547

arunsrinivasan commented Jun 8, 2014

jangorecki commented Jan 10, 2015

jangorecki commented Mar 4, 2016

arunsrinivasan commented Mar 6, 2016

jangorecki commented Mar 6, 2016

arunsrinivasan commented Mar 6, 2016

jangorecki commented Mar 17, 2016

jangorecki commented Mar 18, 2016

jangorecki commented Apr 1, 2016

[R-Forge #5248] implement "set" (as in setdiff, union etc..) operations for data.tables for convenience #547

[R-Forge #5248] implement "set" (as in setdiff, union etc..) operations for data.tables for convenience #547

Comments

arunsrinivasan commented Jun 8, 2014

jangorecki commented Jan 10, 2015

jangorecki commented Mar 4, 2016

arunsrinivasan commented Mar 6, 2016

jangorecki commented Mar 6, 2016

arunsrinivasan commented Mar 6, 2016

jangorecki commented Mar 17, 2016

jangorecki commented Mar 18, 2016

data 1e6 unique

data 1e6 duplicates

jangorecki commented Apr 1, 2016