Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

shift() in data.table v1.9.6 is slow for many groups #1534

Closed
nachti opened this issue Feb 11, 2016 · 3 comments · Fixed by #5205
Closed

shift() in data.table v1.9.6 is slow for many groups #1534

nachti opened this issue Feb 11, 2016 · 3 comments · Fixed by #5205
Labels
GForce issues relating to optimized grouping calculations (GForce) performance
Milestone

Comments

@nachti
Copy link
Contributor

nachti commented Feb 11, 2016

Hi there!
For many different groups in by, shift is much slower than manual shifting.
See: http://stackoverflow.com/questions/35179911/shift-in-data-table-v1-9-6-is-slow-for-many-groups
and https://github.com/nachti/datatable_test/blob/master/leadtest.R for a detailed example.
Cheers,
Gerhard

@arunsrinivasan
Copy link
Member

That's not surprising. This'll go away when gforce is optimised for :=. It's on the list for this release, I believe.

@ben519
Copy link

ben519 commented Nov 10, 2018

+1 for this performance enhancement. shift() is the main bottleneck in a lot of my code. Seems that for a fixed number of rows, the time it takes to run shift() is proportional to the number of groups in the data.

library(data.table)

# Build table to store timings
timings <- CJ(RowCount = 10^7, Groups = 10^c(0:7))
timings[, SizePerGroup := RowCount/Groups]

# Loop through each experiment
for(i in 1:nrow(dt)){
  print(paste0("Iteration: ", i))
  
  # Build dataset
  timings_i <- timings[i]
  dt <- data.table(Grp = rep(seq_len(timings_i$Groups), each = timings_i$SizePerGroup))
  dt[, Value := sample(100, size = .N, replace = T)]
  
  # Measure the time it takes to insert a column indicating the previous value by group
  elapsed <- system.time(dt[, PrevValueByGrp := shift(Value, type = "lag"), by = Grp])["elapsed"]
  timings[i, Elapsed := elapsed]
}

library(ggplot2)
ggplot(timings, aes(x = Groups, y = Elapsed))+geom_line()+geom_point()

screen shot 2018-11-10 at 1 08 15 pm

@franknarf1
Copy link
Contributor

@ben519 Fyi, for the special case of when your code looks like that, there's a shortcut:

library(data.table)
dt <- data.table(Grp = rep(seq_len(1e6), each=10L))
dt[, Value := sample(100L, size = .N, replace = TRUE)]

system.time(dt[, PrevValueByGrp := shift(Value, type = "lag"), by = Grp][])
#    user  system elapsed 
#   19.50    0.80   20.34
system.time(dt[, v := shift(Value, type = "lag")][rowid(Grp)==1L, v := NA][])
#    user  system elapsed 
#    1.00    0.87    1.25 

dt[, all.equal(v, PrevValueByGrp)]
# [1] TRUE

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
GForce issues relating to optimized grouping calculations (GForce) performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants