Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fsetequal() fails for single character columns #2318

Closed
pschil opened this issue Aug 23, 2017 · 4 comments
Closed

fsetequal() fails for single character columns #2318

pschil opened this issue Aug 23, 2017 · 4 comments
Assignees
Milestone

Comments

@pschil
Copy link

pschil commented Aug 23, 2017

fsetequal() on 2 data.tables only containing a single character column fails to detect that they are not equal.

The following reprex:

dt.1 <- data.table(Id=(1:10))
dt.2 <- data.table(Id=(1:10))
dt.2[1, Id:=99]

# First row in dt.1 and dt.2 now are different

# This works as expected
setequal(dt.1$Id, dt.2$Id) # FALSE
fsetequal(dt.1, dt.2)      # FALSE

# ---- Inequality not detected when Id is character 
dt.1[, Id := as.character(Id)]
dt.2[, Id := as.character(Id)]
setequal(dt.1$Id, dt.2$Id) # FALSE
fsetequal(dt.1, dt.2)      # TRUE - should be FALSE as the first row is 1 vs 99

The issue seems to be with ignore.row.order in all.equal:

# In-equality detected
data.table:::all.equal.data.table(dt.1, dt.2, check.attributes = F)
# In-equality NOT detected
data.table:::all.equal.data.table(dt.1, dt.2, check.attributes = F, ignore.row.order = T)

When adding an additional column, numeric or character, fsetequal works as expected.

dt.1[, extracol1 := 1:10]
dt.2[, extracol1 := 1:10]
# Correct
fsetequal(dt.1, dt.2)    # FALSE

dt.1[, extracol1 := as.character(extracol1)]
dt.2[, extracol1 := as.character(extracol1)]
# Also correct
fsetequal(dt.1, dt.2)    # FALSE

data.table version: 1.10.4

Session Info

R version 3.4.1 (2017-06-30)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.6
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     
other attached packages:
[1] data.table_1.10.4
loaded via a namespace (and not attached):
[1] compiler_3.4.1 tools_3.4.1   
@jangorecki
Copy link
Member

Thanks for report. I don't have R at the moment so cannot check... maybe it is related to the way that rolling join handle single character column? https://github.com/Rdatatable/data.table/blob/master/R/setops.R#L235
If you would have a chance please check also on latest dev.

@pschil
Copy link
Author

pschil commented Aug 23, 2017

Installing the latest dev version from the repo here (devtools::install_github("Rdatatable/data.table")) together with brew's g++7 for OpenMP support unfortunately results in the same issue, ie:

setequal(dt.1$Id, dt.2$Id) # FALSE
[1] FALSE
fsetequal(dt.1, dt.2)      # TRUE - should be FALSE as the first row is 1 vs 99
[1] TRUE

data.table version:
data.table 1.10.5 IN DEVELOPMENT built 2017-08-23 20:40:13 UTC

@jangorecki jangorecki self-assigned this May 30, 2018
@franknarf1
Copy link
Contributor

The case is somewhat broader -- it happens whenever the final column is a string:

library(data.table)

DT1 = data.table(v = "foo", a = "my string")
DT2 = data.table(v = "foo", a = "not my string")
fsetequal(DT1, DT2) # TRUE, wrongly
fsetequal(rev(DT1), rev(DT2)) # FALSE, correctly

I'm on a months-old version: data.table 1.11.9 IN DEVELOPMENT built 2018-10-06 03:32:28 UTC

@jangorecki
Copy link
Member

jangorecki commented Dec 14, 2018

data.table/R/setops.R

Lines 233 to 244 in 5b16cc5

ans1 = target[current, roll=tolerance, rollends=TRUE, which=TRUE, on=jn.on]
ans2 = target[current, roll=-tolerance, rollends=TRUE, which=TRUE, on=jn.on]
pmin(ans1, ans2, na.rm=TRUE)
}
if (any_na(as_list(ans))) {
msg = c(msg, sprintf("Dataset 'current' has rows not present in 'target'%s%s", if (target_dup || current_dup) " or present in different quantity" else "", tolerance.msg))
return(msg)
}
ans = if (identical(tolerance, 0)) current[target, nomatch=NA, which=TRUE, on=jn.on] else {
ans1 = current[target, roll=tolerance, rollends=TRUE, which=TRUE, on=jn.on]
ans2 = current[target, roll=-tolerance, rollends=TRUE, which=TRUE, on=jn.on]
pmin(ans1, ans2, na.rm=TRUE)

those rolling joins reports matching rows for @franknarf1 example
I tried to play with rollends, below setup solve the particular issue but makes existing unit tests failing

ans1 = target[current, roll=tolerance, rollends=c(TRUE,FALSE), which=TRUE, on=jn.on]
ans2 = target[current, roll=-tolerance, rollends=c(TRUE,FALSE), which=TRUE, on=jn.on]
 #...
ans1 = current[target, roll=tolerance, rollends=c(FALSE,TRUE), which=TRUE, on=jn.on]
ans2 = current[target, roll=-tolerance, rollends=c(FALSE,TRUE), which=TRUE, on=jn.on]      

If issue occurs only when there are character columns in a data.table then we don't have to do rolling join at all. It is there only to handle floating point tolerance.
For all character columns we could set tolerance=0 and then and regular join will be made instead of rolling.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants