fast droplevels.data.table #5116

ben-schwen · 2021-08-27T12:50:10Z

Closes #647

…oplevels

ben-schwen · 2021-08-27T23:14:15Z

Benchmarks comparing to base and as_factor

as_factor = data.table:::as_factor

N=1e5
x=as_factor(paste0("id",1:N))
y1 = sample(x, N/10)
c1 = as.character(y1)

microbenchmark(times = 10L,
  droplevels(y1),
  factor(y1),
  as_factor(c1),
  fdroplevels(y1)
)

N=1e4
Unit: microseconds
            expr      min       lq      mean    median       uq      max neval
  droplevels(y1) 2585.601 2589.071 2731.8154 2611.5795 3024.333 3034.324    10
      factor(y1) 2564.567 2581.956 2678.3417 2601.1030 2625.355 3025.394    10
   as_factor(c1)  237.806  241.225  251.0525  249.1410  260.717  266.856    10
 fdroplevels(y1)  249.849  255.186  271.8364  262.0545  268.260  377.832    10

N=1e5
Unit: milliseconds
            expr       min        lq      mean    median        uq       max
  droplevels(y1) 37.289807 37.637694 38.418891 38.232925 39.325983 39.985931
      factor(y1) 37.144093 37.456095 37.926517 37.754123 38.112053 39.092934
   as_factor(c1)  3.908248  3.958273  4.117178  4.133674  4.230654  4.282865
 fdroplevels(y1)  3.133957  3.172253  3.270275  3.228315  3.263345  3.572649

N=1e6
Unit: milliseconds
            expr       min        lq      mean    median        uq      max
  droplevels(y1) 501.56759 505.81500 549.33246 519.68250 590.62264 664.8970
      factor(y1) 492.99682 503.09449 555.90634 567.84369 576.47122 677.7081
   as_factor(c1)  80.93633  81.63352  89.66339  82.68304  85.85953 149.0377
 fdroplevels(y1)  29.65646  29.89897  47.06573  29.96462  30.36279 136.1561

N=1e7
Unit: milliseconds
            expr       min        lq      mean    median        uq        max
  droplevels(y1) 6381.0480 6983.8366 7326.4234 7346.7880 7553.3920  8262.7974
      factor(y1) 6507.0006 6853.9014 7494.4574 7231.9087 7532.3646 10000.5792
   as_factor(c1) 1142.4705 1144.2451 1147.8972 1145.8006 1151.7073  1157.0637
 fdroplevels(y1)  418.3393  446.8463  626.6891  708.7769  738.7044   884.4214

mattdowle · 2021-08-28T00:56:45Z

NAMESPACE

@@ -58,6 +58,8 @@ export(.Last.updated)
 export(fcoalesce)
 export(substitute2)
 export(DT)  # mtcars |> DT(i,j,by)  #4872
+export(fdroplevels)
+export(droplevels)


as droplevels is in base and generic, all we need, iiuc, is the S3method line below you've got already. Putting export(droplevels) here as well will generate a warning on loading the package that droplevels is masked? But I'm writing from memory so I might have that wrong.

I see and did not notice before because no warning was generated. According to Writing R Extensions putting export(droplevels) is not necessary but possible.

Example:

S3method(print, foo)
Since the generic print is defined in base it does not need to be imported explicitly.

I removed it now.

codecov · 2021-08-30T21:07:42Z

Codecov Report

Merging #5116 (50039e9) into master (5c6bee3) will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##           master    #5116   +/-   ##
=======================================
  Coverage   99.38%   99.38%           
=======================================
  Files          76       77    +1     
  Lines       14489    14506   +17     
=======================================
+ Hits        14400    14417   +17     
  Misses         89       89

Impacted Files	Coverage Δ
R/fdroplevels.r	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5c6bee3...50039e9. Read the comment docs.

Kamgang-B · 2021-09-13T15:13:47Z

Hi @mattdowle and @ben-schwen

Thank you for this new function.

There is something that I find confusing or misleading: the in.place argument.
Because droplevels.data.table goal is to drop unused levels and this applies only to factors, in.place intuitively suggests to users to choose whether levels should be modified in place or not as it is also said in the documentation HERE

If TRUE levels of factors of data.table are modified in-place.

This is, for example, what came to my mind when I saw the in.place argument. But note that it is not what in.place actually does. After looking at the source code of droplevels.data.table, I realized that the intention was to specified whether the input data.table as a whole (and so all columns, whether they are factors or not) should be copied or not. So, in.place is not strictly related to factors but refers to the data.table input. If the intention is really to copy the whole original data.table, then I think that in.place should be named differently. A name like copy.x (since argument x refers to a data.table) sounds more intuitive and coherent with what actually happens when in.place is set TRUE.
An aspect of fdroplevels that, I think, should be modified is the class handling. The actual implementation does not handle factor subclasses properly: It drops subclasses; e.g. it drops the inherent ordering of ordered factors. But this is likely quite a simple issue because it can be solved by simply changing the last line before returning the output of fdroplevels function from setattr(ans, "class", "factor") to setattr(ans, "class", class(x)).
```
  options(datatable.print.class=TRUE)

  D1 = data.table(
    V1 = factor(c("a", "b"), c("a", "b", "c"), ordered=TRUE),
    V2 = factor(c("b", "a"), c("a", "b", "c")))

        V1     V2
     <ord> <fctr>
  1:     a      b
  2:     b      a

  data.table:::droplevels.data.table(D1)  # column V1 is no longer an ordered factor
        V1     V2
    <fctr> <fctr>
  1:     a      b
  2:     b      a
```

ben-schwen · 2021-09-20T12:15:59Z

@Kamgang-B
Ty for your thoughts.

I agree that the class attribute should be handled like you suggest with setattr(ans, "class", class(x)), not primarily because of own subclassing of factors but because ordered factors and current implementation loses "ordered" factors at dropping levels.
Regarding argument names, I have no preference about whether it should be in.place or copy (I only slightly prefer copy over copy.x albeit the name ambiguity with the copy function). I also checked with previous DT functions and none of them has an in.place argument yet, but we have coerceAs with copy as argument. So the decision of the argument name should be in concordance with future plans.

Kamgang-B · 2021-09-20T13:55:56Z

@ben-schwen thanks for your reply.
I think that copy is definitely better than in.place.

init droplevels

0ec79c1

ben-schwen marked this pull request as draft August 27, 2021 12:50

ben-schwen added 9 commits August 27, 2021 14:53

move droplevels to own file

ca1ad65

move droplevels to own file

78d454b

Merge branch 'droplevels' of github.com:Rdatatable/data.table into dr…

83b4a61

…oplevels

added doc

027ffe8

run test

64712ca

renumbered tests

8b70582

added tests

3e641a4

added news

7c886d3

added tests

db93ba1

ben-schwen marked this pull request as ready for review August 27, 2021 22:27

added test empty factor

10519cd

mattdowle reviewed Aug 28, 2021

View reviewed changes

mattdowle added this to the 1.14.1 milestone Aug 28, 2021

ben-schwen and others added 3 commits August 28, 2021 12:23

fixed namespace

a49cf80

added issues reference to tests

d3de56d

merge master and tweak news item

50039e9

mattdowle merged commit 42b492e into master Aug 30, 2021

mattdowle deleted the droplevels branch August 30, 2021 21:21

mattdowle added a commit that referenced this pull request Aug 30, 2021

#5116: pass R 3.1.0

1a6bcdc

mattdowle added a commit that referenced this pull request Aug 30, 2021

#5116: .r to .R

891fba9

mattdowle added a commit that referenced this pull request Aug 30, 2021

#5116: .rd to .Rd

0192172

ben-schwen mentioned this pull request Sep 29, 2021

droplevels for empty table and ordered factors #5185

Merged

jangorecki removed this from the 1.14.9 milestone Oct 29, 2023

jangorecki added this to the 1.15.0 milestone Oct 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fast droplevels.data.table #5116

fast droplevels.data.table #5116

ben-schwen commented Aug 27, 2021 •

edited

Loading

ben-schwen commented Aug 27, 2021

mattdowle Aug 28, 2021

ben-schwen Aug 28, 2021

codecov bot commented Aug 30, 2021

Kamgang-B commented Sep 13, 2021 •

edited

Loading

ben-schwen commented Sep 20, 2021

Kamgang-B commented Sep 20, 2021

fast droplevels.data.table #5116

fast droplevels.data.table #5116

Conversation

ben-schwen commented Aug 27, 2021 • edited Loading

ben-schwen commented Aug 27, 2021

mattdowle Aug 28, 2021

Choose a reason for hiding this comment

ben-schwen Aug 28, 2021

Choose a reason for hiding this comment

codecov bot commented Aug 30, 2021

Codecov Report

Kamgang-B commented Sep 13, 2021 • edited Loading

ben-schwen commented Sep 20, 2021

Kamgang-B commented Sep 20, 2021

ben-schwen commented Aug 27, 2021 •

edited

Loading

Kamgang-B commented Sep 13, 2021 •

edited

Loading