Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using a function inside by causes troubles #5583

Closed
AbrJA opened this issue Jan 24, 2023 · 5 comments · Fixed by #6708
Closed

Using a function inside by causes troubles #5583

AbrJA opened this issue Jan 24, 2023 · 5 comments · Fixed by #6708
Labels
Milestone

Comments

@AbrJA
Copy link

AbrJA commented Jan 24, 2023

I'm facing an issue when I summarize a data.table using a function inside the "by" clause

Here is an example:

> library(data.table)
> dt <- data.table(x = c(4, 5, 1, 3, 2), y = 1L, key = "x")
> dt
   x y
1: 1 1
2: 2 1
3: 3 1
4: 4 1
5: 5 1
> str(dt)
Classes ‘data.table’ and 'data.frame':	5 obs. of  2 variables:
 $ x: num  1 2 3 4 5
 $ y: int  1 1 1 1 1
 - attr(*, ".internal.selfref")=<externalptr> 
 - attr(*, "sorted")= chr "x"

> dt_sum <- dt[, .(.N), by = .(round(2 / x))]
> dt_sum
   round N
1:     2 1
2:     1 2
3:     0 2
> str(dt_sum)
Classes ‘data.table’ and 'data.frame':	3 obs. of  2 variables:
 $ round: num  2 1 0
 $ N    : int  1 2 2
 - attr(*, "sorted")= chr "round"
 - attr(*, ".internal.selfref")=<externalptr> 

> dt_sum[round == 0] # ERROR
Empty data.table (0 rows and 2 cols): round,N
> dt_sum[round == 1] # CORRECT
   round N
1:     1 2
> dt_sum[round == 2] # ERROR
Empty data.table (0 rows and 2 cols): round,N

I think the issue is here - attr(*, "sorted")= chr "round" because dt_sum isn't already sorted. I don't know if it's a known issue and there's documentation about it, I didn't find anything. Gretings!

> sessionInfo()
R version 4.2.2 (2022-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.5 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale:
 [1] LC_CTYPE=es_MX.UTF-8       LC_NUMERIC=C               LC_TIME=es_MX.UTF-8        LC_COLLATE=es_MX.UTF-8     LC_MONETARY=es_MX.UTF-8   
 [6] LC_MESSAGES=es_MX.UTF-8    LC_PAPER=es_MX.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=es_MX.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.14.6

loaded via a namespace (and not attached):
[1] compiler_4.2.2 tools_4.2.2   
@jangorecki
Copy link
Member

jangorecki commented Jan 25, 2023

Thank you for the report. Is it reproducible on master branch? 1.14.6 is far behind master.

@ben-schwen
Copy link
Member

ben-schwen commented Jan 25, 2023

Can reproduce on current master and is similar to #5361. At least the experienced problem is the same

AFAIR the fast subsetting problem always arises when a data.table has a key attribute although it isn't really sorted according to that key.

As already pointed out by @AbrJA, using x in by shouldn't automatically set the by column as key.

@AbrJA
Copy link
Author

AbrJA commented Jan 25, 2023

As a temporal solution (if someone is facing the same issue), using "keyby" instead of "by" solves the problem

> library(data.table)
> 
> dt <- data.table(x = c(4, 5, 1, 3, 2), y = 1L, key = "x")
> dt
   x y
1: 1 1
2: 2 1
3: 3 1
4: 4 1
5: 5 1
> str(dt)
Classes ‘data.table’ and 'data.frame':	5 obs. of  2 variables:
 $ x: num  1 2 3 4 5
 $ y: int  1 1 1 1 1
 - attr(*, ".internal.selfref")=<externalptr> 
 - attr(*, "sorted")= chr "x"

> dt_sum <- dt[, .(.N), keyby = .(round(2 / x))]
> dt_sum
   round N
1:     0 2
2:     1 2
3:     2 1
> str(dt_sum)
Classes ‘data.table’ and 'data.frame':	3 obs. of  2 variables:
 $ round: num  0 1 2
 $ N    : int  2 2 1
 - attr(*, "sorted")= chr "round"
 - attr(*, ".internal.selfref")=<externalptr> 

> dt_sum[round == 0]
   round N
1:     0 2
> dt_sum[round == 1]
   round N
1:     1 2
> dt_sum[round == 2]
   round N
1:     2 1

@jangorecki jangorecki added this to the 1.14.11 milestone Jul 30, 2023
@jangorecki jangorecki modified the milestones: 1.14.11, 1.15.1 Oct 29, 2023
@MichaelChirico MichaelChirico modified the milestones: 1.16.0, 1.17.0 Jul 10, 2024
@MichaelChirico
Copy link
Member

MichaelChirico commented Jan 4, 2025

OK I think the key is the function has to return a strictly decreasing ordering of the key column -- not just any function in by= triggers the bug:

DT=data.table(a=-1:2, key='a')
key(DT[,.N, by=.(a2=a**2)])
# NULL

here a2 = c(1L, 0L, 4L).

That would mean this line is to blame, since a strictly decreasing input to forderv() returns a 0-length o__

bysameorder = orderedirows && !length(o__)

forderv(list(b=2:1), sort=FALSE, retGrp=TRUE)
# integer(0)
# attr(,"starts")
# [1] 1 2
# ... other attr...

@jangorecki is this possibly related to the forder change to re-use sorting? no, confirmed the same issue on 1.15.0

@MichaelChirico
Copy link
Member

Some related faulty logic determining the key of the result:

DT=data.table(a=1:3, b=3:1, key='a,b')
key(DT[,.N, by=.(a^b, rep(1L, 3L))])
# [1] "a"   "rep"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants