Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merging by character columns is very slow #4733

Closed
matthiasgomolka opened this issue Oct 2, 2020 · 13 comments
Closed

Merging by character columns is very slow #4733

matthiasgomolka opened this issue Oct 2, 2020 · 13 comments
Labels
High joins Use label:"non-equi joins" for rolling, overlapping, and non-equi joins performance regression
Milestone

Comments

@matthiasgomolka
Copy link

I noticed that merging data.tables on a character column is surprisingly slow (compared to dplyr). See the reprex below.

Am I missing something obvious or is there room for improvement? Don't get me wrong: I love data.table and use it daily, but I was very surprised that this kind of merge is so comparatively slow.

Minimal reproducible example

library(data.table)
#> Warning: package 'data.table' was built under R version 3.6.3
library(dplyr)
#> Warning: package 'dplyr' was built under R version 3.6.3
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:data.table':
#> 
#>     between, first, last
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(microbenchmark)
#> Warning: package 'microbenchmark' was built under R version 3.6.2

dfa_char <- data.table(id = as.character(seq_len(10000)))
dfb_char <- data.table(id = as.character(seq_len(100000)), val = rnorm(100000))

microbenchmark(
  data.table = merge.data.table(dfa_char, dfb_char, by = "id", sort = FALSE),
  dplyr = inner_join(dfa_char, dfb_char, by = "id"),
  times = 1L
)
#> Unit: milliseconds
#>        expr      min       lq     mean   median       uq      max neval
#>  data.table 1774.640 1774.640 1774.640 1774.640 1774.640 1774.640     1
#>       dplyr   35.852   35.852   35.852   35.852   35.852   35.852     1

setkey(dfa_char, id)
setkey(dfb_char, id)

microbenchmark(
  data.table = merge.data.table(dfa_char, dfb_char, by = "id", sort = FALSE),
  dplyr = inner_join(dfa_char, dfb_char, by = "id"),
  times = 1L
)
#> Unit: milliseconds
#>        expr     min      lq    mean  median      uq     max neval
#>  data.table  6.3679  6.3679  6.3679  6.3679  6.3679  6.3679     1
#>       dplyr 56.9153 56.9153 56.9153 56.9153 56.9153 56.9153     1

dfa_int <- data.table(id = seq_len(10000))
dfb_int <- data.table(id = seq_len(100000), val = rnorm(100000))

microbenchmark(
  data.table = merge.data.table(dfa_int, dfb_int, by = "id", sort = FALSE),
  dplyr = inner_join(dfa_int, dfb_int, by = "id"),
  times = 1L
)
#> Unit: milliseconds
#>        expr     min      lq    mean  median      uq     max neval
#>  data.table  5.4953  5.4953  5.4953  5.4953  5.4953  5.4953     1
#>       dplyr 14.6955 14.6955 14.6955 14.6955 14.6955 14.6955     1

Created on 2020-10-02 by the reprex package (v0.3.0)

Output of sessionInfo()

devtools::session_info()
#> - Session info ---------------------------------------------------------------
#>  setting  value                       
#>  version  R version 3.6.1 (2019-07-05)
#>  os       Windows 10 x64              
#>  system   x86_64, mingw32             
#>  ui       RTerm                       
#>  language (EN)                        
#>  collate  German_Germany.1252         
#>  ctype    German_Germany.1252         
#>  tz       Europe/Berlin               
#>  date     2020-10-02                  
#> 
#> - Packages -------------------------------------------------------------------
#>  package        * version date       lib source        
#>  assertthat       0.2.1   2019-03-21 [1] CRAN (R 3.6.2)
#>  backports        1.1.5   2019-10-02 [1] CRAN (R 3.6.1)
#>  callr            3.4.3   2020-03-28 [1] CRAN (R 3.6.3)
#>  cli              2.0.2   2020-02-28 [1] CRAN (R 3.6.3)
#>  crayon           1.3.4   2017-09-16 [1] CRAN (R 3.6.2)
#>  data.table     * 1.13.0  2020-07-24 [1] CRAN (R 3.6.3)
#>  desc             1.2.0   2018-05-01 [1] CRAN (R 3.6.2)
#>  devtools         2.3.0   2020-04-10 [1] CRAN (R 3.6.3)
#>  digest           0.6.25  2020-02-23 [1] CRAN (R 3.6.3)
#>  dplyr          * 1.0.0   2020-05-29 [1] CRAN (R 3.6.3)
#>  ellipsis         0.3.0   2019-09-20 [1] CRAN (R 3.6.2)
#>  evaluate         0.14    2019-05-28 [1] CRAN (R 3.6.2)
#>  fansi            0.4.1   2020-01-08 [1] CRAN (R 3.6.2)
#>  fs               1.3.1   2019-05-06 [1] CRAN (R 3.6.2)
#>  generics         0.0.2   2018-11-29 [1] CRAN (R 3.6.2)
#>  glue             1.4.1   2020-05-13 [1] CRAN (R 3.6.3)
#>  highr            0.8     2019-03-20 [1] CRAN (R 3.6.2)
#>  htmltools        0.4.0   2019-10-04 [1] CRAN (R 3.6.2)
#>  knitr            1.28    2020-02-06 [1] CRAN (R 3.6.3)
#>  lifecycle        0.2.0   2020-03-06 [1] CRAN (R 3.6.3)
#>  magrittr         1.5     2014-11-22 [1] CRAN (R 3.6.2)
#>  memoise          1.1.0   2017-04-21 [1] CRAN (R 3.6.2)
#>  microbenchmark * 1.4-7   2019-09-24 [1] CRAN (R 3.6.2)
#>  pillar           1.4.4   2020-05-05 [1] CRAN (R 3.6.3)
#>  pkgbuild         1.0.8   2020-05-07 [1] CRAN (R 3.6.3)
#>  pkgconfig        2.0.3   2019-09-22 [1] CRAN (R 3.6.2)
#>  pkgload          1.0.2   2018-10-29 [1] CRAN (R 3.6.2)
#>  prettyunits      1.1.1   2020-01-24 [1] CRAN (R 3.6.2)
#>  processx         3.4.1   2019-07-18 [1] CRAN (R 3.6.2)
#>  ps               1.3.0   2018-12-21 [1] CRAN (R 3.6.2)
#>  purrr            0.3.3   2019-10-18 [1] CRAN (R 3.6.2)
#>  R6               2.4.1   2019-11-12 [1] CRAN (R 3.6.2)
#>  Rcpp             1.0.3   2019-11-08 [1] CRAN (R 3.6.2)
#>  remotes          2.1.1   2020-02-15 [1] CRAN (R 3.6.3)
#>  rlang            0.4.6   2020-05-02 [1] CRAN (R 3.6.3)
#>  rmarkdown        2.1     2020-01-20 [1] CRAN (R 3.6.3)
#>  rprojroot        1.3-2   2018-01-03 [1] CRAN (R 3.6.2)
#>  sessioninfo      1.1.1   2018-11-05 [1] CRAN (R 3.6.2)
#>  stringi          1.4.4   2020-01-09 [1] CRAN (R 3.6.2)
#>  stringr          1.4.0   2019-02-10 [1] CRAN (R 3.6.2)
#>  testthat         2.3.2   2020-03-02 [1] CRAN (R 3.6.3)
#>  tibble           3.0.1   2020-04-20 [1] CRAN (R 3.6.3)
#>  tidyselect       1.1.0   2020-05-11 [1] CRAN (R 3.6.3)
#>  usethis          1.6.0   2020-04-09 [1] CRAN (R 3.6.3)
#>  vctrs            0.3.1   2020-06-05 [1] CRAN (R 3.6.3)
#>  withr            2.1.2   2018-03-15 [1] CRAN (R 3.6.2)
#>  xfun             0.12    2020-01-13 [1] CRAN (R 3.6.2)
#>  yaml             2.2.1   2020-02-01 [1] CRAN (R 3.6.2)
#> 
#> [1] C:/Program Files/R/library

PS: Not sure if this is already adressed by #4538.

@jangorecki jangorecki self-assigned this Oct 2, 2020
@matthiasgomolka
Copy link
Author

Interestingly, this issue does not occur on my Linux machine (see below). I'll investigate further.

library(data.table)
library(dplyr)
#> 
#> Attache Paket: 'dplyr'
#> The following objects are masked from 'package:data.table':
#> 
#>     between, first, last
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(microbenchmark)

dfa_char <- data.table(id = as.character(seq_len(10000)))
dfb_char <- data.table(id = as.character(seq_len(100000)), val = rnorm(100000))

microbenchmark(
  data.table = merge.data.table(dfa_char, dfb_char, by = "id", sort = FALSE),
  dplyr = inner_join(dfa_char, dfb_char, by = "id"),
  times = 1L
)
#> Unit: milliseconds
#>        expr      min       lq     mean   median       uq      max neval
#>  data.table 34.59378 34.59378 34.59378 34.59378 34.59378 34.59378     1
#>       dplyr 34.98623 34.98623 34.98623 34.98623 34.98623 34.98623     1

setkey(dfa_char, id)
setkey(dfb_char, id)

microbenchmark(
  data.table = merge.data.table(dfa_char, dfb_char, by = "id", sort = FALSE),
  dplyr = inner_join(dfa_char, dfb_char, by = "id"),
  times = 1L
)
#> Unit: milliseconds
#>        expr       min        lq      mean    median        uq       max neval
#>  data.table  3.570949  3.570949  3.570949  3.570949  3.570949  3.570949     1
#>       dplyr 14.952365 14.952365 14.952365 14.952365 14.952365 14.952365     1

dfa_int <- data.table(id = seq_len(10000))
dfb_int <- data.table(id = seq_len(100000), val = rnorm(100000))

microbenchmark(
  data.table = merge.data.table(dfa_int, dfb_int, by = "id", sort = FALSE),
  dplyr = inner_join(dfa_int, dfb_int, by = "id"),
  times = 1L
)
#> Unit: milliseconds
#>        expr      min       lq     mean   median       uq      max neval
#>  data.table 2.410635 2.410635 2.410635 2.410635 2.410635 2.410635     1
#>       dplyr 8.584089 8.584089 8.584089 8.584089 8.584089 8.584089     1

Created on 2020-10-02 by the reprex package (v0.3.0)

Session info
devtools::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value                       
#>  version  R version 4.0.2 (2020-06-22)
#>  os       elementary OS 5.1.7 Hera    
#>  system   x86_64, linux-gnu           
#>  ui       X11                         
#>  language de_DE                       
#>  collate  de_DE.UTF-8                 
#>  ctype    de_DE.UTF-8                 
#>  tz       Europe/Berlin               
#>  date     2020-10-02                  
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package        * version date       lib source        
#>  assertthat       0.2.1   2019-03-21 [1] CRAN (R 4.0.2)
#>  backports        1.1.10  2020-09-15 [1] CRAN (R 4.0.2)
#>  callr            3.4.4   2020-09-07 [1] CRAN (R 4.0.2)
#>  cli              2.0.2   2020-02-28 [1] CRAN (R 4.0.2)
#>  crayon           1.3.4   2017-09-16 [1] CRAN (R 4.0.2)
#>  data.table     * 1.13.0  2020-07-24 [1] CRAN (R 4.0.2)
#>  desc             1.2.0   2018-05-01 [1] CRAN (R 4.0.2)
#>  devtools         2.3.2   2020-09-18 [1] CRAN (R 4.0.2)
#>  digest           0.6.25  2020-02-23 [1] CRAN (R 4.0.2)
#>  dplyr          * 1.0.2   2020-08-18 [1] CRAN (R 4.0.2)
#>  ellipsis         0.3.1   2020-05-15 [1] CRAN (R 4.0.2)
#>  evaluate         0.14    2019-05-28 [1] CRAN (R 4.0.2)
#>  fansi            0.4.1   2020-01-08 [1] CRAN (R 4.0.2)
#>  fs               1.5.0   2020-07-31 [1] CRAN (R 4.0.2)
#>  generics         0.0.2   2018-11-29 [1] CRAN (R 4.0.2)
#>  glue             1.4.2   2020-08-27 [1] CRAN (R 4.0.2)
#>  highr            0.8     2019-03-20 [1] CRAN (R 4.0.2)
#>  htmltools        0.5.0   2020-06-16 [1] CRAN (R 4.0.2)
#>  knitr            1.30    2020-09-22 [1] CRAN (R 4.0.2)
#>  lifecycle        0.2.0   2020-03-06 [1] CRAN (R 4.0.2)
#>  magrittr         1.5     2014-11-22 [1] CRAN (R 4.0.2)
#>  memoise          1.1.0   2017-04-21 [1] CRAN (R 4.0.2)
#>  microbenchmark * 1.4-7   2019-09-24 [1] CRAN (R 4.0.2)
#>  pillar           1.4.6   2020-07-10 [1] CRAN (R 4.0.2)
#>  pkgbuild         1.1.0   2020-07-13 [1] CRAN (R 4.0.2)
#>  pkgconfig        2.0.3   2019-09-22 [1] CRAN (R 4.0.2)
#>  pkgload          1.1.0   2020-05-29 [1] CRAN (R 4.0.2)
#>  prettyunits      1.1.1   2020-01-24 [1] CRAN (R 4.0.2)
#>  processx         3.4.4   2020-09-03 [1] CRAN (R 4.0.2)
#>  ps               1.3.4   2020-08-11 [1] CRAN (R 4.0.2)
#>  purrr            0.3.4   2020-04-17 [1] CRAN (R 4.0.2)
#>  R6               2.4.1   2019-11-12 [1] CRAN (R 4.0.2)
#>  remotes          2.2.0   2020-07-21 [1] CRAN (R 4.0.2)
#>  rlang            0.4.7   2020-07-09 [1] CRAN (R 4.0.2)
#>  rmarkdown        2.3     2020-06-18 [1] CRAN (R 4.0.2)
#>  rprojroot        1.3-2   2018-01-03 [1] CRAN (R 4.0.2)
#>  sessioninfo      1.1.1   2018-11-05 [1] CRAN (R 4.0.2)
#>  stringi          1.5.3   2020-09-09 [1] CRAN (R 4.0.2)
#>  stringr          1.4.0   2019-02-10 [1] CRAN (R 4.0.2)
#>  testthat         2.3.2   2020-03-02 [1] CRAN (R 4.0.2)
#>  tibble           3.0.3   2020-07-10 [1] CRAN (R 4.0.2)
#>  tidyselect       1.1.0   2020-05-11 [1] CRAN (R 4.0.2)
#>  usethis          1.6.3   2020-09-17 [1] CRAN (R 4.0.2)
#>  vctrs            0.3.4   2020-08-29 [1] CRAN (R 4.0.2)
#>  withr            2.3.0   2020-09-22 [1] CRAN (R 4.0.2)
#>  xfun             0.18    2020-09-29 [1] CRAN (R 4.0.2)
#>  yaml             2.2.1   2020-02-01 [1] CRAN (R 4.0.2)
#> 
#> [1] /home/matthias/R/x86_64-pc-linux-gnu-library/4.0
#> [2] /usr/local/lib/R/site-library
#> [3] /usr/lib/R/site-library
#> [4] /usr/lib/R/library

@jangorecki
Copy link
Member

Could you use times=2? It may be caused by #2912 which on Windows can be even more severe.

@jangorecki
Copy link
Member

Also note that merge.data.table just wraps around [ so for benchmarking the latter should be preferred.

dfa_char[dfb_char, on = "id", nomatch = NULL]

@matthiasgomolka
Copy link
Author

Reinstalling all packages did not have an effect on Windows. Is this related to the old R version (3.6.1)?

@jangorecki
Copy link
Member

Reinstalling won't make difference.
Did you try times=2L, does the timings differ?

Also note that there is an open issue that your issue would probably be a duplicate of #3928

@matthiasgomolka
Copy link
Author

matthiasgomolka commented Oct 2, 2020

Sorry, somehow missed your first comment before. Here are the benchmarks with times = 2:

library(data.table)
#> Warning: package 'data.table' was built under R version 3.6.3
library(dplyr)
#> Warning: package 'dplyr' was built under R version 3.6.3
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:data.table':
#> 
#>     between, first, last
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(microbenchmark)
#> Warning: package 'microbenchmark' was built under R version 3.6.2
n = 2L # run benchmarks twice

dfa_char <- data.table(id = as.character(seq_len(10000)))
dfb_char <- data.table(id = as.character(seq_len(100000)), val = rnorm(100000))

microbenchmark(
  data.table = dfa_char[dfb_char, on = "id", nomatch = NULL],
  dplyr = inner_join(dfa_char, dfb_char, by = "id"),
  times = n
)
#> Unit: milliseconds
#>        expr       min        lq      mean    median        uq       max neval
#>  data.table 1698.4893 1698.4893 1758.1653 1758.1653 1817.8414 1817.8414     2
#>       dplyr   41.1121   41.1121   44.3698   44.3698   47.6275   47.6275     2

Created on 2020-10-02 by the reprex package (v0.3.0)

Session info
devtools::session_info()
#> - Session info ---------------------------------------------------------------
#>  setting  value                       
#>  version  R version 3.6.1 (2019-07-05)
#>  os       Windows 10 x64              
#>  system   x86_64, mingw32             
#>  ui       RTerm                       
#>  language (EN)                        
#>  collate  German_Germany.1252         
#>  ctype    German_Germany.1252         
#>  tz       Europe/Berlin               
#>  date     2020-10-02                  
#> 
#> - Packages -------------------------------------------------------------------
#>  package        * version date       lib source        
#>  assertthat       0.2.1   2019-03-21 [1] CRAN (R 3.6.2)
#>  backports        1.1.5   2019-10-02 [1] CRAN (R 3.6.1)
#>  callr            3.4.3   2020-03-28 [1] CRAN (R 3.6.3)
#>  cli              2.0.2   2020-02-28 [1] CRAN (R 3.6.3)
#>  crayon           1.3.4   2017-09-16 [1] CRAN (R 3.6.2)
#>  data.table     * 1.13.0  2020-07-24 [1] CRAN (R 3.6.3)
#>  desc             1.2.0   2018-05-01 [1] CRAN (R 3.6.2)
#>  devtools         2.3.0   2020-04-10 [1] CRAN (R 3.6.3)
#>  digest           0.6.25  2020-02-23 [1] CRAN (R 3.6.3)
#>  dplyr          * 1.0.0   2020-05-29 [1] CRAN (R 3.6.3)
#>  ellipsis         0.3.0   2019-09-20 [1] CRAN (R 3.6.2)
#>  evaluate         0.14    2019-05-28 [1] CRAN (R 3.6.2)
#>  fansi            0.4.1   2020-01-08 [1] CRAN (R 3.6.2)
#>  fs               1.3.1   2019-05-06 [1] CRAN (R 3.6.2)
#>  generics         0.0.2   2018-11-29 [1] CRAN (R 3.6.2)
#>  glue             1.4.1   2020-05-13 [1] CRAN (R 3.6.3)
#>  highr            0.8     2019-03-20 [1] CRAN (R 3.6.2)
#>  htmltools        0.4.0   2019-10-04 [1] CRAN (R 3.6.2)
#>  knitr            1.28    2020-02-06 [1] CRAN (R 3.6.3)
#>  lifecycle        0.2.0   2020-03-06 [1] CRAN (R 3.6.3)
#>  magrittr         1.5     2014-11-22 [1] CRAN (R 3.6.2)
#>  memoise          1.1.0   2017-04-21 [1] CRAN (R 3.6.2)
#>  microbenchmark * 1.4-7   2019-09-24 [1] CRAN (R 3.6.2)
#>  pillar           1.4.4   2020-05-05 [1] CRAN (R 3.6.3)
#>  pkgbuild         1.0.8   2020-05-07 [1] CRAN (R 3.6.3)
#>  pkgconfig        2.0.3   2019-09-22 [1] CRAN (R 3.6.2)
#>  pkgload          1.0.2   2018-10-29 [1] CRAN (R 3.6.2)
#>  prettyunits      1.1.1   2020-01-24 [1] CRAN (R 3.6.2)
#>  processx         3.4.1   2019-07-18 [1] CRAN (R 3.6.2)
#>  ps               1.3.0   2018-12-21 [1] CRAN (R 3.6.2)
#>  purrr            0.3.3   2019-10-18 [1] CRAN (R 3.6.2)
#>  R6               2.4.1   2019-11-12 [1] CRAN (R 3.6.2)
#>  Rcpp             1.0.5   2020-07-06 [1] CRAN (R 3.6.3)
#>  remotes          2.1.1   2020-02-15 [1] CRAN (R 3.6.3)
#>  rlang            0.4.7   2020-07-09 [1] CRAN (R 3.6.3)
#>  rmarkdown        2.1     2020-01-20 [1] CRAN (R 3.6.3)
#>  rprojroot        1.3-2   2018-01-03 [1] CRAN (R 3.6.2)
#>  sessioninfo      1.1.1   2018-11-05 [1] CRAN (R 3.6.2)
#>  stringi          1.4.6   2020-02-17 [1] CRAN (R 3.6.2)
#>  stringr          1.4.0   2019-02-10 [1] CRAN (R 3.6.2)
#>  testthat         2.3.2   2020-03-02 [1] CRAN (R 3.6.3)
#>  tibble           3.0.1   2020-04-20 [1] CRAN (R 3.6.3)
#>  tidyselect       1.1.0   2020-05-11 [1] CRAN (R 3.6.3)
#>  usethis          1.6.0   2020-04-09 [1] CRAN (R 3.6.3)
#>  vctrs            0.3.1   2020-06-05 [1] CRAN (R 3.6.3)
#>  withr            2.1.2   2018-03-15 [1] CRAN (R 3.6.2)
#>  xfun             0.12    2020-01-13 [1] CRAN (R 3.6.2)
#>  yaml             2.2.1   2020-02-01 [1] CRAN (R 3.6.2)
#> 
#> [1] C:/Program Files/R/library

I reinstalled the packages because I once had an issue with packages compiled for another R version...

@ColeMiller1
Copy link
Contributor

I have Windows 10 / data.table 1.13.0 / R 4.0.2. dplyr is somewhat faster for character fields on my machine but not the large difference that @plebejer is experiencing.

library(data.table)
library(dplyr)
library(microbenchmark)

n = 10L # run benchmarks twice

dfa_char <- data.table(id = as.character(seq_len(10000)))
dfb_char <- data.table(id = as.character(seq_len(100000)), val = rnorm(100000))

microbenchmark(
  data.table = dfa_char[dfb_char, on = "id", nomatch = NULL],
  dplyr = inner_join(dfa_char, dfb_char, by = "id"),
  times = n
)
#> Unit: milliseconds
#>        expr     min      lq     mean   median      uq     max neval
#>  data.table 55.1987 59.1335 64.47342 60.95335 63.0049 86.6976    10
#>       dplyr 36.0711 36.8935 43.36965 42.81990 48.2768 54.2212    10

I also tried it with varying threads between 1-8. There were no large effects on timings.

@shrektan
Copy link
Member

shrektan commented Oct 4, 2020

According to my test:

  1. On R3.6.1 / Win10 / data.table1.13.0, @plebejer 's example does show that dplyr is about 5 to 10 times faster than data.table.
  2. It's irrelevant to the number of threads used.
  3. It is not reproducible on R4.0.1. On R4.0.1, dplyr and data.table have close speed with the example code.

@George9000
Copy link

Using setkey on both dfa and dfb changes the dynamic and makes data.table about 3x faster than dplyr on macOS.

library(data.table)
library(dplyr)
library(microbenchmark)
n = 2L
dfa_char <- data.table(id = as.character(seq_len(10000)))
dfb_char <- data.table(id = as.character(seq_len(100000)), val = rnorm(100000))

microbenchmark(
  data.table = dfa_char[dfb_char, on = "id", nomatch = NULL],
  dplyr = inner_join(dfa_char, dfb_char, by = "id"),
  times = n
)
# Unit: milliseconds
#        expr      min       lq     mean   median       uq      max neval
#  data.table 36.99652 36.99652 42.68058 42.68058 48.36465 48.36465     2
#       dplyr 16.11978 16.11978 29.98812 29.98812 43.85645 43.85645     2

setkey(dfa_char, id)
setkey(dfb_char, id)

microbenchmark(
  data.table = dfa_char[dfb_char, on = "id", nomatch = NULL],
  dplyr = inner_join(dfa_char, dfb_char, by = "id"),
  times = n
)

# Unit: milliseconds
#        expr       min        lq      mean    median        uq       max neval
#  data.table  5.618789  5.618789  6.102372  6.102372  6.585954  6.585954     2
#       dplyr 15.181681 15.181681 17.706647 17.706647 20.231614 20.231614     2

sessionInfo

R version 4.0.2 Patched (2020-09-21 r79235)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.7

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] microbenchmark_1.4-7 dplyr_1.0.0          data.table_1.13.1   
[4] pacman_0.5.1        

loaded via a namespace (and not attached):
 [1] crayon_1.3.4     R6_2.4.1         lifecycle_0.2.0  magrittr_1.5    
 [5] pillar_1.4.6     rlang_0.4.7      vctrs_0.3.2      generics_0.0.2  
 [9] ellipsis_0.3.1   glue_1.4.1       purrr_0.3.4      compiler_4.0.2  
[13] pkgconfig_2.0.3  tidyselect_1.1.0 tibble_3.0.3    

@jangorecki
Copy link
Member

It is because we don't have to find order of a character column anymore. Issue I linked describes the problem. I might have found a root cause of it already and shared that to Matt over slack. If we will not have it resolved for the release that is coming now, then I hope the one after will have this fix.

@jangorecki
Copy link
Member

I described root cause in linked issue. While I think this issue is a duplicate of the other one, it is more safe to keep it open to ensure it will be resolved as well.

@jangorecki
Copy link
Member

Also interesting finding by @shrektan is that R must have improved character handling on windows in R 4.0.

@jangorecki jangorecki added this to the 1.13.3 milestone Oct 30, 2020
@jangorecki jangorecki removed their assignment Oct 30, 2020
@mattdowle mattdowle modified the milestones: 1.13.3, 1.13.5 Dec 1, 2020
@jangorecki jangorecki added the High label Dec 4, 2020
@jangorecki jangorecki modified the milestones: 1.14.3, 1.14.5 Jul 19, 2022
@jangorecki jangorecki modified the milestones: 1.14.11, 1.15.1 Oct 29, 2023
@MichaelChirico MichaelChirico modified the milestones: 1.16.0, 1.17.0 Jul 10, 2024
@MichaelChirico MichaelChirico added the joins Use label:"non-equi joins" for rolling, overlapping, and non-equi joins label Nov 27, 2024
@MichaelChirico
Copy link
Member

I will close this in favor of #3928. If you have an example that's still biting on current {data.table} and R >= 4.4.0, please file a new issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
High joins Use label:"non-equi joins" for rolling, overlapping, and non-equi joins performance regression
Projects
None yet
Development

No branches or pull requests

7 participants