Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More weird behaviour with colnames and setcolorder #5494

Open
sch56 opened this issue Oct 20, 2022 · 3 comments
Open

More weird behaviour with colnames and setcolorder #5494

sch56 opened this issue Oct 20, 2022 · 3 comments

Comments

@sch56
Copy link

sch56 commented Oct 20, 2022

setcolorder messes with the names of a vector previously defined using colnames.
It appears that assigning names to a vector using colnames() is by reference, so reordering the columns of the table results in those vector names also being re-ordered, but the elements of the vector are not. This results in the vector elements no longer corresponding to the actual data columns of the table.

Is there a good reason that colnames should not provide a copy of the column names rather than a pointer?

library(data.table)
dt <- data.table(a=c(1,2,3), b=c('x','y','z'), c=as.POSIXct(Sys.time(), '2022-10-21 12:19:21'), d=as.integer(c(10,20,30)))
print(dt)
#>    a b                   c  d
#> 1: 1 x 2022-10-21 12:33:39 10
#> 2: 2 y 2022-10-21 12:33:39 20
#> 3: 3 z 2022-10-21 12:33:39 30
cl <- sapply(dt, function(x) class(x)[1])
names(cl) <- colnames(dt)
print(cl)
#>           a           b           c           d 
#>   "numeric" "character"   "POSIXct"   "integer"
str(cl)
#>  Named chr [1:4] "numeric" "character" "POSIXct" "integer"
#>  - attr(*, "names")= chr [1:4] "a" "b" "c" "d"

setcolorder(dt, c('b','c','d','a'))
print(cl)
#>           b           c           d           a 
#>   "numeric" "character"   "POSIXct"   "integer"
#The names of the vector cl have been re-order but the values not
#So now cl is completely erroneous even though it was not referenced in the setcolorder

Created on 2022-10-21 by the reprex package (v2.0.0)

@jangorecki

This comment was marked as off-topic.

@ben-schwen
Copy link
Member

Seems related to #5133

@cthombor
Copy link

cthombor commented Dec 17, 2022

I'm pretty sure that assigning new names using colnames() may invisibly corrupt a data.table. The damage will not become apparent until you notice that an object-level attribute is missing, or you get a warning from a subsequent [.data.table invocation: "Invalid .internal.selfref detected and fixed...". A possible start on an MRE is

dt <- data.table(exptID = matrix(character(0), ncol = 1))
colnames(dt) <- c("newColname")

Please note this isn't an MRE, because I lack a reliable way to determine whether or not I have actually corrupted dt at this point. I'm only "pretty sure" I have done so. Certainly dt[, .(newColname)] doesn't produce a warning message. So: perhaps this particular parameterisation of the [.data.table method doesn't cause it to check the validity of the .internal.selfref sentinel, or perhaps dt's copy-detection sentinel is still valid at this point and will only become invalid due to some other operation on dt? Anyway: I now know it is extremely hazardous to modify a DT using any base method, and I have rewritten my dt-creation code to use the (less mnemonic, but much safer) setnames(dt, ...)

I'm hoping the data.table team will consider adding a colnames() method to the data.table class, so that naive users will be a bit less likely to trigger the "Invalid .internal.selfref detected and fixed..." warning at a codepoint which is far distant from their hazardous use of colnames(). Also I think it'd be helpful if, in some future release, all of the methods in data.table (especially the set* methods) will test the .internal.selfref sentinel and issue a warning if copying is detected. My thinking is that this very inexpensive runtime check would make it quite likely that the user is warned, at a point near to where they had corrupted their data.table, about the hazard of any subsequent modification to its top-level structure. The corruption hazard is just a conjecture that's based on my (still quite incomplete and tentative!) understanding of the issues outlined in Warning: 'Invalid .internal.selfref detected' when adding a column to a data.table returned from a function. And ... my informal (developmental) testing log includes several cases in which one of my data tables has somehow "lost" an attribute that I'm almost-certain had been added accurately via setattr()... one of those cases was (tentatively) resolved by having a method return a data.table::copy() of the data.table it had produced, and now I'm hoping that the colnames() was the root cause.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants