Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement frev as fast base::rev alternative #5907

Open
wants to merge 73 commits into
base: master
Choose a base branch
from
Open

implement frev as fast base::rev alternative #5907

wants to merge 73 commits into from

Conversation

ben-schwen
Copy link
Member

@ben-schwen ben-schwen commented Jan 13, 2024

Closes #5885

  • implement
  • test
  • docs
  • atime benchmark looks great!

@ben-schwen
Copy link
Member Author

The goal is to export right? Should we also support data.frame and data.table? Matrix?

Copy link

codecov bot commented Jan 13, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 97.51%. Comparing base (6f3fc8d) to head (7d6aea9).
Report is 49 commits behind head on master.

Current head 7d6aea9 differs from pull request most recent head cabddd2

Please upload reports for the commit cabddd2 to get more accurate results.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #5907   +/-   ##
=======================================
  Coverage   97.51%   97.51%           
=======================================
  Files          80       80           
  Lines       14979    14957   -22     
=======================================
- Hits        14607    14586   -21     
+ Misses        372      371    -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@ben-schwen

This comment was marked as resolved.

@jangorecki

This comment was marked as resolved.

MichaelChirico and others added 3 commits January 14, 2024 15:12
* Fix 5492 by limiting the costly deparse to `nlines=1`

* Implementing PR feedbacks

* Added  inside

* Fix typo in name

* Idiomatic use of  inside

* Separating the deparse line limit to a different PR

---------

Co-authored-by: Michael Chirico <chiricom@google.com>
* Added my improvements to the intro vignette

* Removed two lines I added extra as a mistake earlier

* Requested changes
# attributes
x = structure(1:10, class = c("IDate", "Date"), att = 1L)
test(2249.51, attr(frev(x, copy=TRUE), "att"), 1L)
test(2249.52, attr(frev(x, copy=FALSE), "att"), 1L)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this will break e.g. the gregexpr() output, where the attributes are created to match the actual object order. Nothing we can do here besides document the potential issue, I think.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, but this problem also appears with base::rev?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, let's just document that we copy base::rev behavior for attributes?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm but rev() drops attributes?

names(attributes(rev(structure(1:10, class = c("IDate", "Date"), att = 1L))))
# [1] "class"

names(attributes(rev(regexpr("[a-m]", letters))))
# NULL

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good spot, I just checked with names and class but apparently those are special attributes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added details in the docs about attributes being retained.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should rather match base::rev() behavior and strip (non-"special") attributes. As mentioned this can have subtly broken implications -- user wanting to keep attributes around should be responsible for how to do so.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I'm on the same side here. Retaining attributes with all operations on a data.table is an often mentioned feature, users would like.

I know that it's not user-facing now, but what would be the behavior for setrev for stripping attributes? Strip them as a side effect or keep them?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We drop attributes now on frev. For setrev we still keep them but can change if wanted.

For the special case of zero length vectors, attributes are still copied (rev behaves here the same).

@ben-schwen
Copy link
Member Author

One problem I see with this new export: it's the first function from data.table with a copy= argument. This violates our usual pattern of by-reference functionality being restricted to set* functions.

AFAIR I copied the argument name from coerceAs = function(x, as, copy=TRUE). The only other case that comes to my mind immediately is droplevels.data.table = function(x, except = NULL, exclude, in.place = FALSE, ...) which has an in.place argument.

The pattern chosen for nafill() was different: nafill() always copies, and by-reference behavior was exposed via setnafill(). I think by-reference behavior causes enough headaches for users that the paradigm "only in set* functions" is useful, especially for code reviewers who may be less familiar with data.table.

So what about frev and setrev similar to fcoalesce and setcoalesce?

Should we do the same here? frev() always copies; setrev() is by reference. It's also a bit out-of-pattern since set* functions also accept data.tables (frev() doesn't, as of now).

That's exactly the reason copy=TRUE is the default. Users being smart enough to use an argument with a default parameter should be smart enough to understand the argument.

@MichaelChirico
Copy link
Member

That's exactly the reason copy=TRUE is the default. Users being smart enough to use an argument with a default parameter should be smart enough to understand the argument.

Possibly, but it's easy to think "faster --> unequivocally better" and use it somewhat carelessly / forgetting the full implications of the choice. And that's just on the part of the code author. We also need to consider code reviewers and future readers/maintainers.

It's with that (and more) in mind that we really want to limit the API surface of in-place operations of data.table IMO. Anxiety about getting bitten by mutability issues is quite a common reason to avoid touching {data.table}; this problem gets worse if readers/writers of code using {data.table} need to be wary of reference semantics in "new"/"unexpected" places. By sticking to a simple rule: "By-reference updates can only happen when you see := or a set* function", we can ease this burden as much as possible.

I missed droplevels(in.place=TRUE) -- wish I'd given the same feedback there :)

So what about frev and setrev similar to fcoalesce and setcoalesce?

Good catch, I'd forgotten about setcoalesce(). I believe @jangorecki originally requested that. To clarify, you means not exposing a copy=FALSE version to users, and only using that ourselves internally, but both are just thin wrappers to the C routine with a different flag passed? If so then yes, that looks like a good direction to me.

fcoalesce = function(...) .Call(Ccoalesce, list(...), FALSE)
setcoalesce = function(...) .Call(Ccoalesce, list(...), TRUE)

@jangorecki
Copy link
Member

jangorecki commented Mar 19, 2024

Let's add setdroplevels then. We could have dot prefixed non exported function that had copy art and then exported ones that goes via this .droplevels

@TysonStanley
Copy link
Member

Totally agree, @MichaelChirico. Having a clear set of expectations for mutability is going to make adoption for new users much easier.

@jangorecki
Copy link
Member

I think calling by reference operation on a vector alone is more likely to cause troubles, that possibly could be mentioned with some example, like changing R constant value

@MichaelChirico
Copy link
Member

I think calling by reference operation on a vector alone is more likely to cause troubles, that possibly could be mentioned with some example, like changing R constant value

Note that in the current PR, there's no user-facing way to call setrev(), I think that's for the best.

Copy link

github-actions bot commented May 18, 2024

Comparison Plot

Generated via commit cabddd2

Download link for the artifact containing the test results: ↓ atime-results.zip

Time taken to finish the standard R installation steps: 12 minutes and 12 seconds

Time taken to run atime::atime_pkg on the tests: 3 minutes and 17 seconds

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

implement frev() - base::rev() that allocates less