Inconsistent semantics after setDT #4783

OfekShilon · 2020-10-27T15:40:13Z

(This is a cleanup and improvement of some of the #4589 discussion.)
Take this code:

> d1 <- data.frame(a=c(1,2,3,4,5), b=c(2,3,4,5,6))
> d2 <- d1
> setDT(d2)  # At this point d2 is a shallow copy of d1, pointing to the same columns

Do modifications to d2 impact d1? We could live with both 'yes' or 'no', but the answer is sometimes:

d2[, b:=3:7]         # (1) impacts only d2
d2[, c:=4:8]         # (2) impacts only d2
d2[!is.na(a), b:=5:9] # (3) impacts both
d2[, b:=30]          # (4) impacts both

In cases 1&2 d2 'plunks' the full columns into itself and d1 isn't affected. In cases 3 & 4 it seems that operation-in-place optimization kicks in (address(d2$b) is unchanged), so there is no copy-on-write and data still pointed to by d1 is overwritten.

These semantic discrepancies make (the otherwise great) setDT unusable to us except in the most trivial scripts.

# Output of sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19041)
Matrix products: default

> packageVersion("data.table")
[1] ‘1.13.0’

The text was updated successfully, but these errors were encountered:

OfekShilon · 2020-10-28T09:41:02Z

One way out of this would be to disable the modify-in-place optimization. To us the little extra memory consumption is certainly worth the gain in consistency.
I think it's in everyone's best interest to disable it always, but if not - at least add an option to do it.

MichaelChirico · 2020-10-29T02:12:56Z

as.data.table is guaranteed to copy, and of course there's copy. If you are trying to get around modify-in-place, either of these are fine:

d2 = copy(d1)
d2 = as.data.table(d1)

OfekShilon · 2020-10-29T09:24:53Z

@MichaelChirico Thank you. I'm well aware of the alternatives - but I'm not doing exploratory interactive work. I work in a team of R developers with a large R codebase riddled with data.tables and setDT, and still hope this fundamental bug can be solved and not worked around at the user side.

tdeenes · 2020-10-29T12:00:14Z

@OfekShilon
I think it is a bit harsh to describe this behaviour as a "fundamental bug". It is absolutely clearly stated in the manual of setDT() that setDT() modifies its input by reference. So what do you want to achieve in your example? If you want to modify d1 by reference, why do you need d2? If you want to modify d2 without modifying d1, you have to follow the advice above and use copy or as.data.table. It seems you want to rely on some consequences of (undocumented and unexported) implementation details. Not a good idea. If your codebase contains such setDT calls, you have to fix them instead of offending the authors of data.table.

Note also that data.table:::shallow() is not (yet) exported (see also #2323), probably not by accident.

Note 2: Maybe you shall reformulate your issue as a feature request instead of a "bug report".

OfekShilon · 2020-10-29T18:55:37Z

@tdeenes I know how to work around this behaviour, with the advice above and in other ways. This does not make this reported behaviour not a bug. Is the expectation for consistency really a 'feature request'?
I'm afraid the usage of 'by reference' in the documentation and discussions on data.table is often ambiguous, and careful distinction is needed. It might mean two different things:
(1) A data table object (pointing to multiple column data) is not copy-on-write. Therefore, when multiple names are bound to the same data.table - every modification to the data.table via one name manifests in all the names.
(2) The column data is sometimes modified in place (cases 3&4 in the original bug), perhaps you mean this too when you write 'by reference'. When no data.frames are involved - this is not a problem.

Both these behaviours seem reasonable design choices - and usually are. Specifically, when the data.table was created either by data.table(***) or as.data.table. Where things go awry is when a data.table is generated by setDT.
The documentation for setDT indeed absolutely clearly states that it operates by reference, but this report is about the following operations on the resulting dt , and whether these are by reference or not.

dt<-df; setDT(dt) performs a shallow copy, so now dt and df are distinct objects pointing to the same columns. What does 'by reference' even mean in this context? Would you expect modifications to dt or its columns (these are different things!) to manifest in df?
Whatever you choose 'yes' or 'no', sometimes the current implementation would adhere to your expectation and sometimes not. This cannot be accepted as 'by design'.

Not sure what made you say that I want to rely on undocumented and unexported implementation details. I don't. My examples were entirely public and very basic (even fundamental) data.table interfaces, and the results are inconsistent.

tdeenes · 2020-10-29T22:45:55Z

dt<-df; setDT(dt) performs a shallow copy, so now dt and df are distinct objects pointing to the same columns. What does 'by reference' even mean in this context? Would you expect modifications to dt or its columns (these are different things!) to manifest in df?

This is the crucial point I guess; my interpretation of the documentation is that you shall not ask these questions when using setDT on data.frames (or lists). Just accept that you can not know which modifications of dt or its columns will manifest in df. I really can not imagine a legitimate use case where one wants to use a data.frame for breaking the standard copy-on-write semantics of R. So you have only two options:

You want to keep df as it is -> you have to protect it before calling setDT on it or on any objects which keeps a reference to it.
You do not bother what happens to df, you need a data.table -> you call setDT on it and continue with the return value of setDT.

setDT, :=, and other set* functions give you great power which comes with great responsibility. So when you use them, as you did in the following operations on the resulting dt, you have to be sure that you had not left behind any list or data.frame objects which you want to re-use later and whose elements are shared with dt.

If you think your use case does not fit into 1) or 2), please provide us with a minimal reproducible example. Note that your current example belongs to 2) unless you wanted to keep df as it is; in the latter case your example is an example for the misuse of setDT and :=.

tdeenes · 2020-10-29T23:21:18Z

Sorry, I failed to follow the link which points to the original issue (#4589) in which @mattdowle and others gave you a pretty exhaustive explanation for the proper use of setDT. Upon a quick check of your comments there and in this issue, it seems you want to get data.table-features but keep your object as a data.frame. I still do not get the idea behind this, but I definitely find it inappropriate to refer to the current behaviour of setDT as a "fundamental bug" after the first author of the package gave you a clear description of why this is not a bug and a major contributor of the package improved the documentation based on your original issue.

OfekShilon · 2020-11-06T12:06:13Z

The code example is of course simplified, but lots of very real use cases exist. A prominent one is using setDT inside a function - in that discussion a data.table maintainer (Arun) expressed the will to have such cases resolved.

MatthieuStigler · 2020-11-22T19:02:20Z

I think I just hit a very related point, with a similar use case here? #4816 (comment)

OfekShilon · 2021-01-14T12:27:58Z

We keep getting bit by this. Perhaps the original example (d1<-data.frame(a=1); d2 <- d1; setDT(d1)) seemed contrived? It is just an extra simplification of real life, where the issue often surfaces when trying to operate on function arguments:

> df <- data.frame(a=1:2)
> f <- function(x) {
+  setDT(x)
+  x[ , b := 5:6]   # doesn't leak to df
+  x[!is.na(a), a:=3:4]  # leaks to df :(
+  setDF(x)
+ }
>   
> f(df)
> df
   a
1: 3
2: 4

myoung3 · 2021-01-28T04:08:47Z

I think the solution to this might just be to warn users (via ?setDT) with something along the lines of: "use of setDT inside of function definitions, especially on objects that were passed as arguments, may cause unpredictable modification of objects outside the scope of the function. When writing a function, we recommend either A) using as.data.table (not setDT) inside functions. This guarantees a side-effect free ("pure") function or B) Write functions that expect (and are documented as expecting) data.tables as inputs. This allows creating "functions" which are pass-by-reference (ie, they avoid copying) but behave like procedures (in that there may be side-effects). "

As a bit of an aside, I'll add that I have some experience with a third approach, in the package intervalaverage, where data.tables are explicitly required as inputs (and thus are passed by reference) but the function is carefully written to restore any changes on.exit(). This results in a pure function that benefits from pass-by-reference speed without any side-effects (which are unpredictable to the typical R user who expects functions to behave like pure functions). The approach used in intervalaverage (pure functions using pass-by-reference under the hood), only makes sense if you want to return an entirely new table. If you want to modify the original table, approach B is "best" (although potentially unfamiliar to R users).

related post I made nearly a decade ago on SO: https://stackoverflow.com/questions/13756178/writings-functions-procedures-for-data-table-objects

OfekShilon · 2021-01-28T12:30:13Z

@myoung3 this is pretty much what we try to do now. However in an enterprise-size codebase like ours (~600K R LOC, in ~12 large in-house packages) if you can't transition to data.table gradually or use it eliminate specific bottlenecks in a pipeline - it's very hard to use it at all. We tried various techniques and conventions, but this data.table inconsistent behavior is a major, major pain for us (and I suspect for others).
Also, warnings as you suggest don't solve other similar situations.

d1 <- data.frame(a=c(1,2,3,4,5), b=c(2,3,4,5,6))
d2 <- d1
setDT(d2)  
d2[, b:=3:7]         # impacts only d2
d2[!is.na(a), b:=5:9] # leaks to d1

One technical solution might be for setDT to mark the input columns as 'setDT generated', and then use this marking to bypass the memrecycle call in the C function assign.

MichaelChirico · 2021-01-29T09:09:34Z

@OfekShilon with an R footprint that large, it would certainly make sense for some engineering effort to be "donated" to support us in fixing setDT. We accept PRs.

OfekShilon · 2021-01-29T10:25:40Z

@MichaelChirico I can try - but do you guys now agree that this is a problem to be solved? Seems most of this thread doesn't.

OfekShilon · 2021-01-29T12:14:01Z

I mean, will you merge such a PR?

OfekShilon · 2022-07-24T09:16:04Z

@MichaelChirico
Per your request I submitted a PR (#4978) resolving this ~1.2Y ago. It solves this issue and at least 3 others with the same root cause (#4816, #5330 , #5379). @mattdowle had comments which I believe were all addressed. I updated the PR to be cleanly merged with main several times, and tried reminding and nudging several other times. I also reached out by mail and got no response.

If there are further comments or rejects or suggested improvements - I'd be more than happy to discuss and probably apply.

Perhaps it was so long ago that now merging is painful? (I can re-apply the fixes and tests on current main, in a new PR)

Is there any other reason not to apply this fix? What more would it take to get some attention to it?

MichaelChirico · 2022-07-28T05:10:27Z

Hey @OfekShilon really do appreciate your efforts here. We are simply bottlenecked on reviewer time. Appreciate your patience 🙌

I see Jan added this to our 1.14.3 milestone -- I believe we are prioritizing a release in the near future, so that should mean this gets eyes soon.

OfekShilon · 2022-10-20T13:05:40Z

@jangorecki @MichaelChirico 1.14.3 flew by and again this PR+bugs are ignored. What can be done to get this to be discussed?

MichaelChirico · 2022-10-20T13:15:25Z

Hi Ofek, 1.14.4 wound up being a patch release to stay on CRAN, almost no recent work (including merged PRs) was included

OfekShilon · 2022-11-12T09:14:56Z

@MichaelChirico can this be added maybe to 1.14.5?

OfekShilon mentioned this issue Oct 27, 2020

setDT modifies the class of the origin data.frame #4784

Open

OfekShilon mentioned this issue Mar 30, 2021

Fix #4783 with a designated option #4947

Closed

mattdowle linked a pull request May 10, 2021 that will close this issue

allow.assign.inplace attribute #4978

Open

MichaelChirico added the consistency label Oct 2, 2021

OfekShilon mentioned this issue May 9, 2022

setorder() on data.frame input order only the sort column #5379

Open

jangorecki added this to the 1.14.3 milestone Jul 24, 2022

ghost mentioned this issue Mar 22, 2023

setDT and setkeyv inside function #5618

Open

jangorecki modified the milestones: 1.14.9, 1.15.0 Oct 29, 2023

jangorecki removed this from the 1.15.0 milestone Nov 6, 2023

jangorecki added this to the 1.15.1 milestone Nov 6, 2023

MichaelChirico modified the milestones: 1.16.0, 1.17.0 Jul 10, 2024

MichaelChirico modified the milestones: 1.17.0, 1.18.0 Dec 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent semantics after setDT #4783

Inconsistent semantics after setDT #4783

OfekShilon commented Oct 27, 2020 •

edited

Loading

OfekShilon commented Oct 28, 2020

MichaelChirico commented Oct 29, 2020

OfekShilon commented Oct 29, 2020

tdeenes commented Oct 29, 2020

OfekShilon commented Oct 29, 2020 •

edited

Loading

tdeenes commented Oct 29, 2020

tdeenes commented Oct 29, 2020

OfekShilon commented Nov 6, 2020

MatthieuStigler commented Nov 22, 2020

OfekShilon commented Jan 14, 2021 •

edited

Loading

myoung3 commented Jan 28, 2021 •

edited

Loading

OfekShilon commented Jan 28, 2021

MichaelChirico commented Jan 29, 2021

OfekShilon commented Jan 29, 2021

OfekShilon commented Jan 29, 2021 via email •

edited

Loading

OfekShilon commented Jul 24, 2022

MichaelChirico commented Jul 28, 2022

OfekShilon commented Oct 20, 2022

MichaelChirico commented Oct 20, 2022

OfekShilon commented Nov 12, 2022

Inconsistent semantics after setDT #4783

Inconsistent semantics after setDT #4783

Comments

OfekShilon commented Oct 27, 2020 • edited Loading

OfekShilon commented Oct 28, 2020

MichaelChirico commented Oct 29, 2020

OfekShilon commented Oct 29, 2020

tdeenes commented Oct 29, 2020

OfekShilon commented Oct 29, 2020 • edited Loading

tdeenes commented Oct 29, 2020

tdeenes commented Oct 29, 2020

OfekShilon commented Nov 6, 2020

MatthieuStigler commented Nov 22, 2020

OfekShilon commented Jan 14, 2021 • edited Loading

myoung3 commented Jan 28, 2021 • edited Loading

OfekShilon commented Jan 28, 2021

MichaelChirico commented Jan 29, 2021

OfekShilon commented Jan 29, 2021

OfekShilon commented Jan 29, 2021 via email • edited Loading

OfekShilon commented Jul 24, 2022

MichaelChirico commented Jul 28, 2022

OfekShilon commented Oct 20, 2022

MichaelChirico commented Oct 20, 2022

OfekShilon commented Nov 12, 2022

OfekShilon commented Oct 27, 2020 •

edited

Loading

OfekShilon commented Oct 29, 2020 •

edited

Loading

OfekShilon commented Jan 14, 2021 •

edited

Loading

myoung3 commented Jan 28, 2021 •

edited

Loading

OfekShilon commented Jan 29, 2021 via email •

edited

Loading