-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inconsistent semantics after setDT #4783
Comments
One way out of this would be to disable the modify-in-place optimization. To us the little extra memory consumption is certainly worth the gain in consistency. |
|
@MichaelChirico Thank you. I'm well aware of the alternatives - but I'm not doing exploratory interactive work. I work in a team of R developers with a large R codebase riddled with data.tables and setDT, and still hope this fundamental bug can be solved and not worked around at the user side. |
@OfekShilon Note also that Note 2: Maybe you shall reformulate your issue as a feature request instead of a "bug report". |
@tdeenes I know how to work around this behaviour, with the advice above and in other ways. This does not make this reported behaviour not a bug. Is the expectation for consistency really a 'feature request'? Both these behaviours seem reasonable design choices - and usually are. Specifically, when the data.table was created either by
Not sure what made you say that I want to rely on undocumented and unexported implementation details. I don't. My examples were entirely public and very basic (even fundamental) data.table interfaces, and the results are inconsistent. |
This is the crucial point I guess; my interpretation of the documentation is that you shall not ask these questions when using
If you think your use case does not fit into 1) or 2), please provide us with a minimal reproducible example. Note that your current example belongs to 2) unless you wanted to keep |
Sorry, I failed to follow the link which points to the original issue (#4589) in which @mattdowle and others gave you a pretty exhaustive explanation for the proper use of |
The code example is of course simplified, but lots of very real use cases exist. A prominent one is using setDT inside a function - in that discussion a data.table maintainer (Arun) expressed the will to have such cases resolved. |
I think I just hit a very related point, with a similar use case here? #4816 (comment) |
We keep getting bit by this. Perhaps the original example (
|
I think the solution to this might just be to warn users (via ?setDT) with something along the lines of: "use of setDT inside of function definitions, especially on objects that were passed as arguments, may cause unpredictable modification of objects outside the scope of the function. When writing a function, we recommend either A) using as.data.table (not setDT) inside functions. This guarantees a side-effect free ("pure") function or B) Write functions that expect (and are documented as expecting) data.tables as inputs. This allows creating "functions" which are pass-by-reference (ie, they avoid copying) but behave like procedures (in that there may be side-effects). " As a bit of an aside, I'll add that I have some experience with a third approach, in the package intervalaverage, where data.tables are explicitly required as inputs (and thus are passed by reference) but the function is carefully written to restore any changes on.exit(). This results in a pure function that benefits from pass-by-reference speed without any side-effects (which are unpredictable to the typical R user who expects functions to behave like pure functions). The approach used in intervalaverage (pure functions using pass-by-reference under the hood), only makes sense if you want to return an entirely new table. If you want to modify the original table, approach B is "best" (although potentially unfamiliar to R users). related post I made nearly a decade ago on SO: https://stackoverflow.com/questions/13756178/writings-functions-procedures-for-data-table-objects |
@myoung3 this is pretty much what we try to do now. However in an enterprise-size codebase like ours (~600K R LOC, in ~12 large in-house packages) if you can't transition to data.table gradually or use it eliminate specific bottlenecks in a pipeline - it's very hard to use it at all. We tried various techniques and conventions, but this data.table inconsistent behavior is a major, major pain for us (and I suspect for others).
One technical solution might be for |
@OfekShilon with an R footprint that large, it would certainly make sense for some engineering effort to be "donated" to support us in fixing |
@MichaelChirico I can try - but do you guys now agree that this is a problem to be solved? Seems most of this thread doesn't. |
I mean, will you merge such a PR?
|
@MichaelChirico If there are further comments or rejects or suggested improvements - I'd be more than happy to discuss and probably apply. Perhaps it was so long ago that now merging is painful? (I can re-apply the fixes and tests on current main, in a new PR) Is there any other reason not to apply this fix? What more would it take to get some attention to it? |
Hey @OfekShilon really do appreciate your efforts here. We are simply bottlenecked on reviewer time. Appreciate your patience 🙌 I see Jan added this to our 1.14.3 milestone -- I believe we are prioritizing a release in the near future, so that should mean this gets eyes soon. |
@jangorecki @MichaelChirico 1.14.3 flew by and again this PR+bugs are ignored. What can be done to get this to be discussed? |
Hi Ofek, 1.14.4 wound up being a patch release to stay on CRAN, almost no recent work (including merged PRs) was included |
@MichaelChirico can this be added maybe to 1.14.5? |
(This is a cleanup and improvement of some of the #4589 discussion.)
Take this code:
Do modifications to
d2
impactd1
? We could live with both 'yes' or 'no', but the answer is sometimes:In cases 1&2
d2
'plunks' the full columns into itself andd1
isn't affected. In cases 3 & 4 it seems that operation-in-place optimization kicks in (address(d2$b)
is unchanged), so there is no copy-on-write and data still pointed to byd1
is overwritten.These semantic discrepancies make (the otherwise great) setDT unusable to us except in the most trivial scripts.
#
Output of sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19041)
Matrix products: default
The text was updated successfully, but these errors were encountered: