Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent semantics after setDT #4783

Open
OfekShilon opened this issue Oct 27, 2020 · 20 comments · May be fixed by #4978
Open

Inconsistent semantics after setDT #4783

OfekShilon opened this issue Oct 27, 2020 · 20 comments · May be fixed by #4978
Milestone

Comments

@OfekShilon
Copy link
Contributor

OfekShilon commented Oct 27, 2020

(This is a cleanup and improvement of some of the #4589 discussion.)
Take this code:

> d1 <- data.frame(a=c(1,2,3,4,5), b=c(2,3,4,5,6))
> d2 <- d1
> setDT(d2)  # At this point d2 is a shallow copy of d1, pointing to the same columns

Do modifications to d2 impact d1? We could live with both 'yes' or 'no', but the answer is sometimes:

d2[, b:=3:7]         # (1) impacts only d2
d2[, c:=4:8]         # (2) impacts only d2
d2[!is.na(a), b:=5:9] # (3) impacts both
d2[, b:=30]          # (4) impacts both

In cases 1&2 d2 'plunks' the full columns into itself and d1 isn't affected. In cases 3 & 4 it seems that operation-in-place optimization kicks in (address(d2$b) is unchanged), so there is no copy-on-write and data still pointed to by d1 is overwritten.

These semantic discrepancies make (the otherwise great) setDT unusable to us except in the most trivial scripts.

# Output of sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19041)
Matrix products: default

> packageVersion("data.table")
[1] ‘1.13.0’
@OfekShilon
Copy link
Contributor Author

One way out of this would be to disable the modify-in-place optimization. To us the little extra memory consumption is certainly worth the gain in consistency.
I think it's in everyone's best interest to disable it always, but if not - at least add an option to do it.

@MichaelChirico
Copy link
Member

as.data.table is guaranteed to copy, and of course there's copy. If you are trying to get around modify-in-place, either of these are fine:

d2 = copy(d1)
d2 = as.data.table(d1)

@OfekShilon
Copy link
Contributor Author

@MichaelChirico Thank you. I'm well aware of the alternatives - but I'm not doing exploratory interactive work. I work in a team of R developers with a large R codebase riddled with data.tables and setDT, and still hope this fundamental bug can be solved and not worked around at the user side.

@tdeenes
Copy link
Member

tdeenes commented Oct 29, 2020

@OfekShilon
I think it is a bit harsh to describe this behaviour as a "fundamental bug". It is absolutely clearly stated in the manual of setDT() that setDT() modifies its input by reference. So what do you want to achieve in your example? If you want to modify d1 by reference, why do you need d2? If you want to modify d2 without modifying d1, you have to follow the advice above and use copy or as.data.table. It seems you want to rely on some consequences of (undocumented and unexported) implementation details. Not a good idea. If your codebase contains such setDT calls, you have to fix them instead of offending the authors of data.table.

Note also that data.table:::shallow() is not (yet) exported (see also #2323), probably not by accident.

Note 2: Maybe you shall reformulate your issue as a feature request instead of a "bug report".

@OfekShilon
Copy link
Contributor Author

OfekShilon commented Oct 29, 2020

@tdeenes I know how to work around this behaviour, with the advice above and in other ways. This does not make this reported behaviour not a bug. Is the expectation for consistency really a 'feature request'?
I'm afraid the usage of 'by reference' in the documentation and discussions on data.table is often ambiguous, and careful distinction is needed. It might mean two different things:
(1) A data table object (pointing to multiple column data) is not copy-on-write. Therefore, when multiple names are bound to the same data.table - every modification to the data.table via one name manifests in all the names.
(2) The column data is sometimes modified in place (cases 3&4 in the original bug), perhaps you mean this too when you write 'by reference'. When no data.frames are involved - this is not a problem.

Both these behaviours seem reasonable design choices - and usually are. Specifically, when the data.table was created either by data.table(***) or as.data.table. Where things go awry is when a data.table is generated by setDT.
The documentation for setDT indeed absolutely clearly states that it operates by reference, but this report is about the following operations on the resulting dt , and whether these are by reference or not.

dt<-df; setDT(dt) performs a shallow copy, so now dt and df are distinct objects pointing to the same columns. What does 'by reference' even mean in this context? Would you expect modifications to dt or its columns (these are different things!) to manifest in df?
Whatever you choose 'yes' or 'no', sometimes the current implementation would adhere to your expectation and sometimes not. This cannot be accepted as 'by design'.

Not sure what made you say that I want to rely on undocumented and unexported implementation details. I don't. My examples were entirely public and very basic (even fundamental) data.table interfaces, and the results are inconsistent.

@tdeenes
Copy link
Member

tdeenes commented Oct 29, 2020

dt<-df; setDT(dt) performs a shallow copy, so now dt and df are distinct objects pointing to the same columns. What does 'by reference' even mean in this context? Would you expect modifications to dt or its columns (these are different things!) to manifest in df?

This is the crucial point I guess; my interpretation of the documentation is that you shall not ask these questions when using setDT on data.frames (or lists). Just accept that you can not know which modifications of dt or its columns will manifest in df. I really can not imagine a legitimate use case where one wants to use a data.frame for breaking the standard copy-on-write semantics of R. So you have only two options:

  1. You want to keep df as it is -> you have to protect it before calling setDT on it or on any objects which keeps a reference to it.
  2. You do not bother what happens to df, you need a data.table -> you call setDT on it and continue with the return value of setDT.

setDT, :=, and other set* functions give you great power which comes with great responsibility. So when you use them, as you did in the following operations on the resulting dt, you have to be sure that you had not left behind any list or data.frame objects which you want to re-use later and whose elements are shared with dt.

If you think your use case does not fit into 1) or 2), please provide us with a minimal reproducible example. Note that your current example belongs to 2) unless you wanted to keep df as it is; in the latter case your example is an example for the misuse of setDT and :=.

@tdeenes
Copy link
Member

tdeenes commented Oct 29, 2020

Sorry, I failed to follow the link which points to the original issue (#4589) in which @mattdowle and others gave you a pretty exhaustive explanation for the proper use of setDT. Upon a quick check of your comments there and in this issue, it seems you want to get data.table-features but keep your object as a data.frame. I still do not get the idea behind this, but I definitely find it inappropriate to refer to the current behaviour of setDT as a "fundamental bug" after the first author of the package gave you a clear description of why this is not a bug and a major contributor of the package improved the documentation based on your original issue.

@OfekShilon
Copy link
Contributor Author

The code example is of course simplified, but lots of very real use cases exist. A prominent one is using setDT inside a function - in that discussion a data.table maintainer (Arun) expressed the will to have such cases resolved.

@MatthieuStigler
Copy link

I think I just hit a very related point, with a similar use case here? #4816 (comment)

@OfekShilon
Copy link
Contributor Author

OfekShilon commented Jan 14, 2021

We keep getting bit by this. Perhaps the original example (d1<-data.frame(a=1); d2 <- d1; setDT(d1)) seemed contrived? It is just an extra simplification of real life, where the issue often surfaces when trying to operate on function arguments:

> df <- data.frame(a=1:2)
> f <- function(x) {
+  setDT(x)
+  x[ , b := 5:6]   # doesn't leak to df
+  x[!is.na(a), a:=3:4]  # leaks to df :(
+  setDF(x)
+ }
>   
> f(df)
> df
   a
1: 3
2: 4

@myoung3
Copy link
Contributor

myoung3 commented Jan 28, 2021

I think the solution to this might just be to warn users (via ?setDT) with something along the lines of: "use of setDT inside of function definitions, especially on objects that were passed as arguments, may cause unpredictable modification of objects outside the scope of the function. When writing a function, we recommend either A) using as.data.table (not setDT) inside functions. This guarantees a side-effect free ("pure") function or B) Write functions that expect (and are documented as expecting) data.tables as inputs. This allows creating "functions" which are pass-by-reference (ie, they avoid copying) but behave like procedures (in that there may be side-effects). "

As a bit of an aside, I'll add that I have some experience with a third approach, in the package intervalaverage, where data.tables are explicitly required as inputs (and thus are passed by reference) but the function is carefully written to restore any changes on.exit(). This results in a pure function that benefits from pass-by-reference speed without any side-effects (which are unpredictable to the typical R user who expects functions to behave like pure functions). The approach used in intervalaverage (pure functions using pass-by-reference under the hood), only makes sense if you want to return an entirely new table. If you want to modify the original table, approach B is "best" (although potentially unfamiliar to R users).

related post I made nearly a decade ago on SO: https://stackoverflow.com/questions/13756178/writings-functions-procedures-for-data-table-objects

@OfekShilon
Copy link
Contributor Author

@myoung3 this is pretty much what we try to do now. However in an enterprise-size codebase like ours (~600K R LOC, in ~12 large in-house packages) if you can't transition to data.table gradually or use it eliminate specific bottlenecks in a pipeline - it's very hard to use it at all. We tried various techniques and conventions, but this data.table inconsistent behavior is a major, major pain for us (and I suspect for others).
Also, warnings as you suggest don't solve other similar situations.

d1 <- data.frame(a=c(1,2,3,4,5), b=c(2,3,4,5,6))
d2 <- d1
setDT(d2)  
d2[, b:=3:7]         # impacts only d2
d2[!is.na(a), b:=5:9] # leaks to d1

One technical solution might be for setDT to mark the input columns as 'setDT generated', and then use this marking to bypass the memrecycle call in the C function assign.

@MichaelChirico
Copy link
Member

@OfekShilon with an R footprint that large, it would certainly make sense for some engineering effort to be "donated" to support us in fixing setDT. We accept PRs.

@OfekShilon
Copy link
Contributor Author

@MichaelChirico I can try - but do you guys now agree that this is a problem to be solved? Seems most of this thread doesn't.

@OfekShilon
Copy link
Contributor Author

OfekShilon commented Jan 29, 2021 via email

@OfekShilon
Copy link
Contributor Author

@MichaelChirico
Per your request I submitted a PR (#4978) resolving this ~1.2Y ago. It solves this issue and at least 3 others with the same root cause (#4816, #5330 , #5379). @mattdowle had comments which I believe were all addressed. I updated the PR to be cleanly merged with main several times, and tried reminding and nudging several other times. I also reached out by mail and got no response.

If there are further comments or rejects or suggested improvements - I'd be more than happy to discuss and probably apply.

Perhaps it was so long ago that now merging is painful? (I can re-apply the fixes and tests on current main, in a new PR)

Is there any other reason not to apply this fix? What more would it take to get some attention to it?

@jangorecki jangorecki added this to the 1.14.3 milestone Jul 24, 2022
@MichaelChirico
Copy link
Member

Hey @OfekShilon really do appreciate your efforts here. We are simply bottlenecked on reviewer time. Appreciate your patience 🙌

I see Jan added this to our 1.14.3 milestone -- I believe we are prioritizing a release in the near future, so that should mean this gets eyes soon.

@OfekShilon
Copy link
Contributor Author

@jangorecki @MichaelChirico 1.14.3 flew by and again this PR+bugs are ignored. What can be done to get this to be discussed?

@MichaelChirico
Copy link
Member

Hi Ofek, 1.14.4 wound up being a patch release to stay on CRAN, almost no recent work (including merged PRs) was included

@OfekShilon
Copy link
Contributor Author

@MichaelChirico can this be added maybe to 1.14.5?

@jangorecki jangorecki modified the milestones: 1.14.9, 1.15.0 Oct 29, 2023
@jangorecki jangorecki removed this from the 1.15.0 milestone Nov 6, 2023
@jangorecki jangorecki added this to the 1.15.1 milestone Nov 6, 2023
@MichaelChirico MichaelChirico modified the milestones: 1.16.0, 1.17.0 Jul 10, 2024
@MichaelChirico MichaelChirico modified the milestones: 1.17.0, 1.18.0 Dec 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants