Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fast droplevels.data.table #5116

Merged
merged 14 commits into from
Aug 30, 2021
Merged

fast droplevels.data.table #5116

merged 14 commits into from
Aug 30, 2021

Conversation

ben-schwen
Copy link
Member

@ben-schwen ben-schwen commented Aug 27, 2021

Closes #647

  • fdroplevels for factors
  • S3method for data.table (based on Jan's initial proposal but with exclude argument as data.frame)
  • manpage
  • tests
  • NEWS

@ben-schwen ben-schwen marked this pull request as draft August 27, 2021 12:50
@ben-schwen ben-schwen marked this pull request as ready for review August 27, 2021 22:27
@ben-schwen
Copy link
Member Author

Benchmarks comparing to base and as_factor

as_factor = data.table:::as_factor

N=1e5
x=as_factor(paste0("id",1:N))
y1 = sample(x, N/10)
c1 = as.character(y1)

microbenchmark(times = 10L,
  droplevels(y1),
  factor(y1),
  as_factor(c1),
  fdroplevels(y1)
)
N=1e4
Unit: microseconds
            expr      min       lq      mean    median       uq      max neval
  droplevels(y1) 2585.601 2589.071 2731.8154 2611.5795 3024.333 3034.324    10
      factor(y1) 2564.567 2581.956 2678.3417 2601.1030 2625.355 3025.394    10
   as_factor(c1)  237.806  241.225  251.0525  249.1410  260.717  266.856    10
 fdroplevels(y1)  249.849  255.186  271.8364  262.0545  268.260  377.832    10
N=1e5
Unit: milliseconds
            expr       min        lq      mean    median        uq       max
  droplevels(y1) 37.289807 37.637694 38.418891 38.232925 39.325983 39.985931
      factor(y1) 37.144093 37.456095 37.926517 37.754123 38.112053 39.092934
   as_factor(c1)  3.908248  3.958273  4.117178  4.133674  4.230654  4.282865
 fdroplevels(y1)  3.133957  3.172253  3.270275  3.228315  3.263345  3.572649
N=1e6
Unit: milliseconds
            expr       min        lq      mean    median        uq      max
  droplevels(y1) 501.56759 505.81500 549.33246 519.68250 590.62264 664.8970
      factor(y1) 492.99682 503.09449 555.90634 567.84369 576.47122 677.7081
   as_factor(c1)  80.93633  81.63352  89.66339  82.68304  85.85953 149.0377
 fdroplevels(y1)  29.65646  29.89897  47.06573  29.96462  30.36279 136.1561
N=1e7
Unit: milliseconds
            expr       min        lq      mean    median        uq        max
  droplevels(y1) 6381.0480 6983.8366 7326.4234 7346.7880 7553.3920  8262.7974
      factor(y1) 6507.0006 6853.9014 7494.4574 7231.9087 7532.3646 10000.5792
   as_factor(c1) 1142.4705 1144.2451 1147.8972 1145.8006 1151.7073  1157.0637
 fdroplevels(y1)  418.3393  446.8463  626.6891  708.7769  738.7044   884.4214

NAMESPACE Outdated
@@ -58,6 +58,8 @@ export(.Last.updated)
export(fcoalesce)
export(substitute2)
export(DT) # mtcars |> DT(i,j,by) #4872
export(fdroplevels)
export(droplevels)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as droplevels is in base and generic, all we need, iiuc, is the S3method line below you've got already. Putting export(droplevels) here as well will generate a warning on loading the package that droplevels is masked? But I'm writing from memory so I might have that wrong.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see and did not notice before because no warning was generated. According to Writing R Extensions putting export(droplevels) is not necessary but possible.

Example:

S3method(print, foo)
Since the generic print is defined in base it does not need to be imported explicitly.

I removed it now.

@mattdowle mattdowle added this to the 1.14.1 milestone Aug 28, 2021
@codecov
Copy link

codecov bot commented Aug 30, 2021

Codecov Report

Merging #5116 (50039e9) into master (5c6bee3) will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #5116   +/-   ##
=======================================
  Coverage   99.38%   99.38%           
=======================================
  Files          76       77    +1     
  Lines       14489    14506   +17     
=======================================
+ Hits        14400    14417   +17     
  Misses         89       89           
Impacted Files Coverage Δ
R/fdroplevels.r 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5c6bee3...50039e9. Read the comment docs.

@mattdowle mattdowle merged commit 42b492e into master Aug 30, 2021
@mattdowle mattdowle deleted the droplevels branch August 30, 2021 21:21
mattdowle added a commit that referenced this pull request Aug 30, 2021
mattdowle added a commit that referenced this pull request Aug 30, 2021
mattdowle added a commit that referenced this pull request Aug 30, 2021
@Kamgang-B
Copy link
Contributor

Kamgang-B commented Sep 13, 2021

Hi @mattdowle and @ben-schwen

Thank you for this new function.

  • There is something that I find confusing or misleading: the in.place argument.
    Because droplevels.data.table goal is to drop unused levels and this applies only to factors, in.place intuitively suggests to users to choose whether levels should be modified in place or not as it is also said in the documentation HERE

    If TRUE levels of factors of data.table are modified in-place.

    This is, for example, what came to my mind when I saw the in.place argument. But note that it is not what in.place actually does. After looking at the source code of droplevels.data.table, I realized that the intention was to specified whether the input data.table as a whole (and so all columns, whether they are factors or not) should be copied or not. So, in.place is not strictly related to factors but refers to the data.table input. If the intention is really to copy the whole original data.table, then I think that in.place should be named differently. A name like copy.x (since argument x refers to a data.table) sounds more intuitive and coherent with what actually happens when in.place is set TRUE.

  • An aspect of fdroplevels that, I think, should be modified is the class handling. The actual implementation does not handle factor subclasses properly: It drops subclasses; e.g. it drops the inherent ordering of ordered factors. But this is likely quite a simple issue because it can be solved by simply changing the last line before returning the output of fdroplevels function from setattr(ans, "class", "factor") to setattr(ans, "class", class(x)).

      options(datatable.print.class=TRUE)
    
      D1 = data.table(
        V1 = factor(c("a", "b"), c("a", "b", "c"), ordered=TRUE),
        V2 = factor(c("b", "a"), c("a", "b", "c")))
    
            V1     V2
         <ord> <fctr>
      1:     a      b
      2:     b      a
    
      data.table:::droplevels.data.table(D1)  # column V1 is no longer an ordered factor
            V1     V2
        <fctr> <fctr>
      1:     a      b
      2:     b      a
    

@ben-schwen
Copy link
Member Author

@Kamgang-B
Ty for your thoughts.

  • I agree that the class attribute should be handled like you suggest with setattr(ans, "class", class(x)), not primarily because of own subclassing of factors but because ordered factors and current implementation loses "ordered" factors at dropping levels.
  • Regarding argument names, I have no preference about whether it should be in.place or copy (I only slightly prefer copy over copy.x albeit the name ambiguity with the copy function). I also checked with previous DT functions and none of them has an in.place argument yet, but we have coerceAs with copy as argument. So the decision of the argument name should be in concordance with future plans.

@Kamgang-B
Copy link
Contributor

@ben-schwen thanks for your reply.
I think that copy is definitely better than in.place.

@jangorecki jangorecki added this to the 1.15.0 milestone Oct 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[R-Forge #1841] Implement a fast droplevels.data.table method.
4 participants