-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merging by character columns is very slow #4733
Comments
Interestingly, this issue does not occur on my Linux machine (see below). I'll investigate further. library(data.table)
library(dplyr)
#>
#> Attache Paket: 'dplyr'
#> The following objects are masked from 'package:data.table':
#>
#> between, first, last
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(microbenchmark)
dfa_char <- data.table(id = as.character(seq_len(10000)))
dfb_char <- data.table(id = as.character(seq_len(100000)), val = rnorm(100000))
microbenchmark(
data.table = merge.data.table(dfa_char, dfb_char, by = "id", sort = FALSE),
dplyr = inner_join(dfa_char, dfb_char, by = "id"),
times = 1L
)
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> data.table 34.59378 34.59378 34.59378 34.59378 34.59378 34.59378 1
#> dplyr 34.98623 34.98623 34.98623 34.98623 34.98623 34.98623 1
setkey(dfa_char, id)
setkey(dfb_char, id)
microbenchmark(
data.table = merge.data.table(dfa_char, dfb_char, by = "id", sort = FALSE),
dplyr = inner_join(dfa_char, dfb_char, by = "id"),
times = 1L
)
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> data.table 3.570949 3.570949 3.570949 3.570949 3.570949 3.570949 1
#> dplyr 14.952365 14.952365 14.952365 14.952365 14.952365 14.952365 1
dfa_int <- data.table(id = seq_len(10000))
dfb_int <- data.table(id = seq_len(100000), val = rnorm(100000))
microbenchmark(
data.table = merge.data.table(dfa_int, dfb_int, by = "id", sort = FALSE),
dplyr = inner_join(dfa_int, dfb_int, by = "id"),
times = 1L
)
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> data.table 2.410635 2.410635 2.410635 2.410635 2.410635 2.410635 1
#> dplyr 8.584089 8.584089 8.584089 8.584089 8.584089 8.584089 1 Created on 2020-10-02 by the reprex package (v0.3.0) Session infodevtools::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#> setting value
#> version R version 4.0.2 (2020-06-22)
#> os elementary OS 5.1.7 Hera
#> system x86_64, linux-gnu
#> ui X11
#> language de_DE
#> collate de_DE.UTF-8
#> ctype de_DE.UTF-8
#> tz Europe/Berlin
#> date 2020-10-02
#>
#> ─ Packages ───────────────────────────────────────────────────────────────────
#> package * version date lib source
#> assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.0.2)
#> backports 1.1.10 2020-09-15 [1] CRAN (R 4.0.2)
#> callr 3.4.4 2020-09-07 [1] CRAN (R 4.0.2)
#> cli 2.0.2 2020-02-28 [1] CRAN (R 4.0.2)
#> crayon 1.3.4 2017-09-16 [1] CRAN (R 4.0.2)
#> data.table * 1.13.0 2020-07-24 [1] CRAN (R 4.0.2)
#> desc 1.2.0 2018-05-01 [1] CRAN (R 4.0.2)
#> devtools 2.3.2 2020-09-18 [1] CRAN (R 4.0.2)
#> digest 0.6.25 2020-02-23 [1] CRAN (R 4.0.2)
#> dplyr * 1.0.2 2020-08-18 [1] CRAN (R 4.0.2)
#> ellipsis 0.3.1 2020-05-15 [1] CRAN (R 4.0.2)
#> evaluate 0.14 2019-05-28 [1] CRAN (R 4.0.2)
#> fansi 0.4.1 2020-01-08 [1] CRAN (R 4.0.2)
#> fs 1.5.0 2020-07-31 [1] CRAN (R 4.0.2)
#> generics 0.0.2 2018-11-29 [1] CRAN (R 4.0.2)
#> glue 1.4.2 2020-08-27 [1] CRAN (R 4.0.2)
#> highr 0.8 2019-03-20 [1] CRAN (R 4.0.2)
#> htmltools 0.5.0 2020-06-16 [1] CRAN (R 4.0.2)
#> knitr 1.30 2020-09-22 [1] CRAN (R 4.0.2)
#> lifecycle 0.2.0 2020-03-06 [1] CRAN (R 4.0.2)
#> magrittr 1.5 2014-11-22 [1] CRAN (R 4.0.2)
#> memoise 1.1.0 2017-04-21 [1] CRAN (R 4.0.2)
#> microbenchmark * 1.4-7 2019-09-24 [1] CRAN (R 4.0.2)
#> pillar 1.4.6 2020-07-10 [1] CRAN (R 4.0.2)
#> pkgbuild 1.1.0 2020-07-13 [1] CRAN (R 4.0.2)
#> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.0.2)
#> pkgload 1.1.0 2020-05-29 [1] CRAN (R 4.0.2)
#> prettyunits 1.1.1 2020-01-24 [1] CRAN (R 4.0.2)
#> processx 3.4.4 2020-09-03 [1] CRAN (R 4.0.2)
#> ps 1.3.4 2020-08-11 [1] CRAN (R 4.0.2)
#> purrr 0.3.4 2020-04-17 [1] CRAN (R 4.0.2)
#> R6 2.4.1 2019-11-12 [1] CRAN (R 4.0.2)
#> remotes 2.2.0 2020-07-21 [1] CRAN (R 4.0.2)
#> rlang 0.4.7 2020-07-09 [1] CRAN (R 4.0.2)
#> rmarkdown 2.3 2020-06-18 [1] CRAN (R 4.0.2)
#> rprojroot 1.3-2 2018-01-03 [1] CRAN (R 4.0.2)
#> sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 4.0.2)
#> stringi 1.5.3 2020-09-09 [1] CRAN (R 4.0.2)
#> stringr 1.4.0 2019-02-10 [1] CRAN (R 4.0.2)
#> testthat 2.3.2 2020-03-02 [1] CRAN (R 4.0.2)
#> tibble 3.0.3 2020-07-10 [1] CRAN (R 4.0.2)
#> tidyselect 1.1.0 2020-05-11 [1] CRAN (R 4.0.2)
#> usethis 1.6.3 2020-09-17 [1] CRAN (R 4.0.2)
#> vctrs 0.3.4 2020-08-29 [1] CRAN (R 4.0.2)
#> withr 2.3.0 2020-09-22 [1] CRAN (R 4.0.2)
#> xfun 0.18 2020-09-29 [1] CRAN (R 4.0.2)
#> yaml 2.2.1 2020-02-01 [1] CRAN (R 4.0.2)
#>
#> [1] /home/matthias/R/x86_64-pc-linux-gnu-library/4.0
#> [2] /usr/local/lib/R/site-library
#> [3] /usr/lib/R/site-library
#> [4] /usr/lib/R/library |
Could you use |
Also note that dfa_char[dfb_char, on = "id", nomatch = NULL] |
Reinstalling all packages did not have an effect on Windows. Is this related to the old R version (3.6.1)? |
Reinstalling won't make difference. Also note that there is an open issue that your issue would probably be a duplicate of #3928 |
Sorry, somehow missed your first comment before. Here are the benchmarks with library(data.table)
#> Warning: package 'data.table' was built under R version 3.6.3
library(dplyr)
#> Warning: package 'dplyr' was built under R version 3.6.3
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:data.table':
#>
#> between, first, last
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(microbenchmark)
#> Warning: package 'microbenchmark' was built under R version 3.6.2
n = 2L # run benchmarks twice
dfa_char <- data.table(id = as.character(seq_len(10000)))
dfb_char <- data.table(id = as.character(seq_len(100000)), val = rnorm(100000))
microbenchmark(
data.table = dfa_char[dfb_char, on = "id", nomatch = NULL],
dplyr = inner_join(dfa_char, dfb_char, by = "id"),
times = n
)
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> data.table 1698.4893 1698.4893 1758.1653 1758.1653 1817.8414 1817.8414 2
#> dplyr 41.1121 41.1121 44.3698 44.3698 47.6275 47.6275 2 Created on 2020-10-02 by the reprex package (v0.3.0) Session infodevtools::session_info()
#> - Session info ---------------------------------------------------------------
#> setting value
#> version R version 3.6.1 (2019-07-05)
#> os Windows 10 x64
#> system x86_64, mingw32
#> ui RTerm
#> language (EN)
#> collate German_Germany.1252
#> ctype German_Germany.1252
#> tz Europe/Berlin
#> date 2020-10-02
#>
#> - Packages -------------------------------------------------------------------
#> package * version date lib source
#> assertthat 0.2.1 2019-03-21 [1] CRAN (R 3.6.2)
#> backports 1.1.5 2019-10-02 [1] CRAN (R 3.6.1)
#> callr 3.4.3 2020-03-28 [1] CRAN (R 3.6.3)
#> cli 2.0.2 2020-02-28 [1] CRAN (R 3.6.3)
#> crayon 1.3.4 2017-09-16 [1] CRAN (R 3.6.2)
#> data.table * 1.13.0 2020-07-24 [1] CRAN (R 3.6.3)
#> desc 1.2.0 2018-05-01 [1] CRAN (R 3.6.2)
#> devtools 2.3.0 2020-04-10 [1] CRAN (R 3.6.3)
#> digest 0.6.25 2020-02-23 [1] CRAN (R 3.6.3)
#> dplyr * 1.0.0 2020-05-29 [1] CRAN (R 3.6.3)
#> ellipsis 0.3.0 2019-09-20 [1] CRAN (R 3.6.2)
#> evaluate 0.14 2019-05-28 [1] CRAN (R 3.6.2)
#> fansi 0.4.1 2020-01-08 [1] CRAN (R 3.6.2)
#> fs 1.3.1 2019-05-06 [1] CRAN (R 3.6.2)
#> generics 0.0.2 2018-11-29 [1] CRAN (R 3.6.2)
#> glue 1.4.1 2020-05-13 [1] CRAN (R 3.6.3)
#> highr 0.8 2019-03-20 [1] CRAN (R 3.6.2)
#> htmltools 0.4.0 2019-10-04 [1] CRAN (R 3.6.2)
#> knitr 1.28 2020-02-06 [1] CRAN (R 3.6.3)
#> lifecycle 0.2.0 2020-03-06 [1] CRAN (R 3.6.3)
#> magrittr 1.5 2014-11-22 [1] CRAN (R 3.6.2)
#> memoise 1.1.0 2017-04-21 [1] CRAN (R 3.6.2)
#> microbenchmark * 1.4-7 2019-09-24 [1] CRAN (R 3.6.2)
#> pillar 1.4.4 2020-05-05 [1] CRAN (R 3.6.3)
#> pkgbuild 1.0.8 2020-05-07 [1] CRAN (R 3.6.3)
#> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 3.6.2)
#> pkgload 1.0.2 2018-10-29 [1] CRAN (R 3.6.2)
#> prettyunits 1.1.1 2020-01-24 [1] CRAN (R 3.6.2)
#> processx 3.4.1 2019-07-18 [1] CRAN (R 3.6.2)
#> ps 1.3.0 2018-12-21 [1] CRAN (R 3.6.2)
#> purrr 0.3.3 2019-10-18 [1] CRAN (R 3.6.2)
#> R6 2.4.1 2019-11-12 [1] CRAN (R 3.6.2)
#> Rcpp 1.0.5 2020-07-06 [1] CRAN (R 3.6.3)
#> remotes 2.1.1 2020-02-15 [1] CRAN (R 3.6.3)
#> rlang 0.4.7 2020-07-09 [1] CRAN (R 3.6.3)
#> rmarkdown 2.1 2020-01-20 [1] CRAN (R 3.6.3)
#> rprojroot 1.3-2 2018-01-03 [1] CRAN (R 3.6.2)
#> sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 3.6.2)
#> stringi 1.4.6 2020-02-17 [1] CRAN (R 3.6.2)
#> stringr 1.4.0 2019-02-10 [1] CRAN (R 3.6.2)
#> testthat 2.3.2 2020-03-02 [1] CRAN (R 3.6.3)
#> tibble 3.0.1 2020-04-20 [1] CRAN (R 3.6.3)
#> tidyselect 1.1.0 2020-05-11 [1] CRAN (R 3.6.3)
#> usethis 1.6.0 2020-04-09 [1] CRAN (R 3.6.3)
#> vctrs 0.3.1 2020-06-05 [1] CRAN (R 3.6.3)
#> withr 2.1.2 2018-03-15 [1] CRAN (R 3.6.2)
#> xfun 0.12 2020-01-13 [1] CRAN (R 3.6.2)
#> yaml 2.2.1 2020-02-01 [1] CRAN (R 3.6.2)
#>
#> [1] C:/Program Files/R/library I reinstalled the packages because I once had an issue with packages compiled for another R version... |
I have Windows 10 / data.table 1.13.0 / R 4.0.2. library(data.table)
library(dplyr)
library(microbenchmark)
n = 10L # run benchmarks twice
dfa_char <- data.table(id = as.character(seq_len(10000)))
dfb_char <- data.table(id = as.character(seq_len(100000)), val = rnorm(100000))
microbenchmark(
data.table = dfa_char[dfb_char, on = "id", nomatch = NULL],
dplyr = inner_join(dfa_char, dfb_char, by = "id"),
times = n
)
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> data.table 55.1987 59.1335 64.47342 60.95335 63.0049 86.6976 10
#> dplyr 36.0711 36.8935 43.36965 42.81990 48.2768 54.2212 10 I also tried it with varying threads between 1-8. There were no large effects on timings. |
According to my test:
|
Using
|
It is because we don't have to find order of a character column anymore. Issue I linked describes the problem. I might have found a root cause of it already and shared that to Matt over slack. If we will not have it resolved for the release that is coming now, then I hope the one after will have this fix. |
I described root cause in linked issue. While I think this issue is a duplicate of the other one, it is more safe to keep it open to ensure it will be resolved as well. |
Also interesting finding by @shrektan is that R must have improved character handling on windows in R 4.0. |
I will close this in favor of #3928. If you have an example that's still biting on current {data.table} and R >= 4.4.0, please file a new issue. |
I noticed that merging data.tables on a
character
column is surprisingly slow (compared todplyr
). See the reprex below.Am I missing something obvious or is there room for improvement? Don't get me wrong: I love
data.table
and use it daily, but I was very surprised that this kind of merge is so comparatively slow.Minimal reproducible example
Created on 2020-10-02 by the reprex package (v0.3.0)
Output of
sessionInfo()
PS: Not sure if this is already adressed by #4538.
The text was updated successfully, but these errors were encountered: