Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

setorder() on data.frame input order only the sort column #5379

Open
amitramon opened this issue May 9, 2022 · 5 comments
Open

setorder() on data.frame input order only the sort column #5379

amitramon opened this issue May 9, 2022 · 5 comments

Comments

@amitramon
Copy link

I'm passing a data.frame to a function that sort it using data.table::setorder. Before sorting, I'm calling data.table::setDT on the input (in actuality I need to later call some specific data.table functions on the input). This is the function I'm using:

test_order <- function(x) {
  data.table::setDT(x)
  data.table::setorder(x, score)
}

and this is how I'm calling it:

x0 <- data.frame(id = 1:3, score = c(20, 10, 30))
x1 <- test_order(x0)

The results seem strange:

  • The type of x0 is changed to a data.table - that seems expected;
  • The column x0$id is not modified - it's address (using data.table::address) remains the same, and it also remains in the original order;
  • The column x0$score also remains at the same address, but it is now sorted, so it's out of sync with the id column;
  • The resulting x1 has the rows sorted correctly; also, the score column is at the same address as the score column of x0, but the id column is, of course, at a different address. Here is what I get:

Minimal reproducible example

test_order <- function(x) {
  data.table::setDT(x)
  data.table::setorder(x, score)
}
> x0 <- data.frame(id = 1:3, score = c(20, 10, 30))
> x0
  id score
1  1    20
2  2    10
3  3    30
> x1 <- test_order(x0)
> x0 # note that the id column is in the original order, but the score column has been re-ordered
   id score
1:  1    10
2:  2    20
3:  3    30
> x1
   id score
1:  2    10
2:  1    20
3:  3    30

Output of sessionInfo()

> sessionInfo( )
R version 4.0.5 (2021-03-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux bookworm/sid

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] compiler_4.0.5    data.table_1.14.2
@amitramon amitramon changed the title setorder on data.frame input order only the sort column setorder() on data.frame input order only the sort column May 9, 2022
@avimallu
Copy link
Contributor

avimallu commented May 9, 2022

Cluing onto why:

test_order_mod <- function(x) {
  data.table::setDT(x)
  x[, score:=score*1]
  data.table::setorder(x, score)
}
x0 <- data.frame(id = 1:3, score = c(20, 10, 30))
x1 <- test_order_mod(x0)

keeps x0 unchanged.

From the reference semantics vignette, and my understanding of how R stores columns in a data.frame or derived object from Hadley's Advanced R book, it looks like R is initially being "smart" by re-using common columns across two data.frame objects, unless they are modified.

On a limb, I'd say that the advice from the vignette applies for the set* functions as well:

we used := for its side effect. But of course this may not be always desirable. Sometimes, we would like to pass a data.table object to a function, and might want to use the := operator, but wouldn’t want to update the original object. We can accomplish this using the function copy().

@tlapak
Copy link
Contributor

tlapak commented May 9, 2022

That's not a fix, unfortunately:

x0 <- data.frame(id = 1:3, score = c(20, 10, 30), letters = c('a', 'b', 'c'))
x1 <- test_order_mod(x0)
x0
#    id score letters
# 1:  1    10       b
# 2:  2    20       a
# 3:  3    30       c

The issue arises because 1:3 is stored as an ALTREP vector and setDT() does ALTREP expansion to fix a different bug, see #2882. setDT() tries to modify the input object by reference, but it hast to do a shallow copy to over allocate the columns. The shallow copy is then assigned back to the original symbol setDT() was called on. Since there is no mechanism to tell you which symbols are bound to a given object, this doesn't propagate out to any other symbol. Especially, it doesn't propagate to other environments/out of function calls.

One might envision a mechanism to track down such changes, but I'm not sure it's worth the overhead and complexity.

More immediate, maybe we can move up the altrep expansion or delay the over allocation. But neither of those would alleviate the issue that you can't add or remove columns and have the result visible outside the function environment.

For now, the best way around this would be to either use setDT() once in the beginning outside of any functions, or only use as.data.table() inside of functions. The latter materializes a deep copy so be aware of that if memory is an issue.

TL,DR: Don't use setDT() inside functions. Use as.data.table() instead.

@OfekShilon
Copy link
Contributor

This seems the same root cause as #4783, #4816, #5330 and some others too. I submitted suggested fixes in PR #4978 and #4901 1Y+ ago, but failed to draw attention to them. @tlapak perhaps you'd consider merging them now?

@avimallu
Copy link
Contributor

I think only Matt and Arun can merge on the master repository.

@Waldi73
Copy link

Waldi73 commented Nov 30, 2022

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants