fsetequal() fails for single character columns #2318

pschil · 2017-08-23T17:40:04Z

fsetequal() on 2 data.tables only containing a single character column fails to detect that they are not equal.

The following reprex:

dt.1 <- data.table(Id=(1:10))
dt.2 <- data.table(Id=(1:10))
dt.2[1, Id:=99]

# First row in dt.1 and dt.2 now are different

# This works as expected
setequal(dt.1$Id, dt.2$Id) # FALSE
fsetequal(dt.1, dt.2)      # FALSE

# ---- Inequality not detected when Id is character 
dt.1[, Id := as.character(Id)]
dt.2[, Id := as.character(Id)]
setequal(dt.1$Id, dt.2$Id) # FALSE
fsetequal(dt.1, dt.2)      # TRUE - should be FALSE as the first row is 1 vs 99

The issue seems to be with ignore.row.order in all.equal:

# In-equality detected
data.table:::all.equal.data.table(dt.1, dt.2, check.attributes = F)
# In-equality NOT detected
data.table:::all.equal.data.table(dt.1, dt.2, check.attributes = F, ignore.row.order = T)

When adding an additional column, numeric or character, fsetequal works as expected.

dt.1[, extracol1 := 1:10]
dt.2[, extracol1 := 1:10]
# Correct
fsetequal(dt.1, dt.2)    # FALSE

dt.1[, extracol1 := as.character(extracol1)]
dt.2[, extracol1 := as.character(extracol1)]
# Also correct
fsetequal(dt.1, dt.2)    # FALSE

data.table version: 1.10.4

Session Info

R version 3.4.1 (2017-06-30)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.6
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     
other attached packages:
[1] data.table_1.10.4
loaded via a namespace (and not attached):
[1] compiler_3.4.1 tools_3.4.1

The text was updated successfully, but these errors were encountered:

jangorecki · 2017-08-23T19:20:51Z

Thanks for report. I don't have R at the moment so cannot check... maybe it is related to the way that rolling join handle single character column? https://github.com/Rdatatable/data.table/blob/master/R/setops.R#L235
If you would have a chance please check also on latest dev.

pschil · 2017-08-23T20:51:35Z

Installing the latest dev version from the repo here (devtools::install_github("Rdatatable/data.table")) together with brew's g++7 for OpenMP support unfortunately results in the same issue, ie:

setequal(dt.1$Id, dt.2$Id) # FALSE
[1] FALSE
fsetequal(dt.1, dt.2)      # TRUE - should be FALSE as the first row is 1 vs 99
[1] TRUE

data.table version:
data.table 1.10.5 IN DEVELOPMENT built 2017-08-23 20:40:13 UTC

franknarf1 · 2018-12-13T21:52:48Z

The case is somewhat broader -- it happens whenever the final column is a string:

library(data.table)

DT1 = data.table(v = "foo", a = "my string")
DT2 = data.table(v = "foo", a = "not my string")
fsetequal(DT1, DT2) # TRUE, wrongly
fsetequal(rev(DT1), rev(DT2)) # FALSE, correctly

I'm on a months-old version: data.table 1.11.9 IN DEVELOPMENT built 2018-10-06 03:32:28 UTC

jangorecki · 2018-12-14T06:09:19Z

data.table/R/setops.R

Lines 233 to 244 in 5b16cc5

    
             ans1 = target[current, roll=tolerance, rollends=TRUE, which=TRUE, on=jn.on] 
        
             ans2 = target[current, roll=-tolerance, rollends=TRUE, which=TRUE, on=jn.on] 
        
             pmin(ans1, ans2, na.rm=TRUE) 
        
           } 
        
           if (any_na(as_list(ans))) { 
        
             msg = c(msg, sprintf("Dataset 'current' has rows not present in 'target'%s%s", if (target_dup || current_dup) " or present in different quantity" else "", tolerance.msg)) 
        
             return(msg) 
        
           } 
        
           ans = if (identical(tolerance, 0)) current[target, nomatch=NA, which=TRUE, on=jn.on] else { 
        
             ans1 = current[target, roll=tolerance, rollends=TRUE, which=TRUE, on=jn.on] 
        
             ans2 = current[target, roll=-tolerance, rollends=TRUE, which=TRUE, on=jn.on] 
        
             pmin(ans1, ans2, na.rm=TRUE)

those rolling joins reports matching rows for @franknarf1 example
I tried to play with rollends, below setup solve the particular issue but makes existing unit tests failing

ans1 = target[current, roll=tolerance, rollends=c(TRUE,FALSE), which=TRUE, on=jn.on]
ans2 = target[current, roll=-tolerance, rollends=c(TRUE,FALSE), which=TRUE, on=jn.on]
 #...
ans1 = current[target, roll=tolerance, rollends=c(FALSE,TRUE), which=TRUE, on=jn.on]
ans2 = current[target, roll=-tolerance, rollends=c(FALSE,TRUE), which=TRUE, on=jn.on]

If issue occurs only when there are character columns in a data.table then we don't have to do rolling join at all. It is there only to handle floating point tolerance.
For all character columns we could set tolerance=0 and then and regular join will be made instead of rolling.

jangorecki self-assigned this May 30, 2018

jangorecki added a commit that referenced this issue Dec 14, 2018

fsetequal handle last col a character, closes #2318

8d2867a

jangorecki mentioned this issue Dec 14, 2018

fsetequal handle last col a character #3219

Merged

jangorecki added this to the 1.12.0 milestone Dec 14, 2018

mattdowle closed this as completed in #3219 Dec 14, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fsetequal() fails for single character columns #2318

fsetequal() fails for single character columns #2318

pschil commented Aug 23, 2017

jangorecki commented Aug 23, 2017

pschil commented Aug 23, 2017

franknarf1 commented Dec 13, 2018

jangorecki commented Dec 14, 2018 •

edited

Loading

fsetequal() fails for single character columns #2318

fsetequal() fails for single character columns #2318

Comments

pschil commented Aug 23, 2017

jangorecki commented Aug 23, 2017

pschil commented Aug 23, 2017

franknarf1 commented Dec 13, 2018

jangorecki commented Dec 14, 2018 • edited Loading

jangorecki commented Dec 14, 2018 •

edited

Loading