-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
names<- retain key and option to turn new warning on #5133
base: master
Are you sure you want to change the base?
Conversation
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
@mattdowle retaining key is possible with defining `colnames<-.data.table` = `names<-.data.table` = function(x, value) {
if (!is.null(value) && length(names(x))!=0) setnames(x, value)
else { setattr(x, "names", value); invisible(x) }
}
colnames.data.table = names.data.table = function(x) copy(attr(x, "names")) and registering them as I can push to this branch with a build which passes all tests but makes the following changes to
I just didn't want to hijack your PR without asking first ;) |
@ben-schwen Thanks for looking and asking. We already have
So I'm thinking we have to continue to re-over-allocate inside There is no known way to avoid R's |
Thanks for the explanation. Removing
|
Yes, good point. Adding those tests good idea. Thanks. It seems that R's You can see it happening using
If I got that right, it feels to me as though R's internals go through an extra step or function call when a subset of names is changed, to construct the new names vector to pass to the method, and that extra step is where the copy comes from. Strange. Just a guess. Would be interesting to look at R's internals on this to see if that's the case. Maybe doing so would reveal a way to avoid the copy, or something else unexpected. |
int protecti=0; | ||
const int n = (isNull(cols) ? length(dt) : length(cols)) + (nSpare<0 ? allocColOpt() : nSpare); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could move const int l = isNull(cols) ? LENGTH(dt) : length(cols);
up from line from line 180 and already use it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Ben. I have to admit I struggled in this PR so it's great to have more eyes on it helping out. Will come back to this line.
const bool verbose = LOGICAL(verboseArg)[0]; | ||
if (length(nSpareArg)!=1 || (!isInteger(nSpareArg) && !isReal(nSpareArg))) | ||
error(_("%s should be a single number but it is length %d and type '%s'"), "getOption('datatable.alloccol') (or an argument with this default)", length(nSpareArg), type2char(TYPEOF(nSpareArg))); | ||
int nSpare = isInteger(nSpareArg) ? INTEGER(nSpareArg)[0] : (int)REAL(nSpareArg)[0]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
won't work for bit64
integer such as in alloc.col(DT, as.integer64(100))
(quite edgy though)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed. We probably have this problem throughout many arguments in many functions then? On the one hand I agree it's edgy but now you've raised it, ugh, yep. Maybe we need a new helper to deal with this and sweep the whole codebase in a separate PR for int/real arguments that could be int64.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
coerceAs helper was added to handle this kind of coercion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see coerceAs
defined in wrappers.R
but it doesn't seem in use at R level yet. Yes it would be good to start using it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIR I added R wrapper for easier tests coverage.
Hey Matt, given the subtlety of the issues here, it would help to add a verbal summary of the PR to the main commit. Right now it's a bit scattered among your responses, the code comments, and the referenced issues -- it would be easier to follow the PR in this case if you could provide such an overview of the reasoning. |
Yes I see. But if I reach that stage to make it easy for you, then it will be done, and I won't need any help. Kind of the very point is that it is scattered and hard to follow, and that's what I need help with. Perhaps you could ask on specific lines for example. |
That makes sense, but even an encapsulation of that in the main post would be helpful, along the lines of
Something to communicate the current status of the PR in words & not just code... it will be an added friction for you as author but will substantially help the reviewer. |
For
I'm also not sure if this PR is a direct result of #5084 or is independent of it. Test 104.2 error# 104.2
library(data.table)
dt <- data.table(A = rep(1:3, each=4), B = rep(11:14, each=3), C = rep(21:22, 6), key = "A,B")
dt1 <- dt2 <- dt
dt1[,"A"] <- 3L;
## test(104.2, dt[, transform(.SD, B=sum(C)), by="A"], data.table(A=rep(1:3,each=4), B=86L, C=21:22, key="A"))
## While test expects A to have 1:3, dt1 was modified which impacts dt.
dt
## A B C
## 1: 3 11 21
## 2: 3 11 22
## 3: 3 11 21
## 4: 3 12 22
## 5: 3 12 21
## 6: 3 12 22
## 7: 3 13 21
## 8: 3 13 22
## 9: 3 13 21
## 10: 3 14 22
## 11: 3 14 21
## 12: 3 14 22 |
This comment was marked as resolved.
This comment was marked as resolved.
SEXP copy(SEXP x) | ||
{ | ||
return INHERITS(x, char_datatable) ? _shallow(x, R_NilValue, -1) : duplicate(x); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is responsible for 10 failed tests. Previously, copy()
only returned duplicate(x)
. An additional flag to guarantee copying or maybe even a different function may be helpful, e.g. SEXP copy(SEXP x, bool doCopy) { return doCopy || !INHERITS(x, char_datatable) ? duplicate(x) : _shallow(x, R_NilValue, -1);}
One way the tests fail is that the shallow copy returned does not always copy when assigned correctly when i
is provided. E.g., copy(dt)[, x := new_val]
would not affect the original dt whereas copy(dt)[1L, x:= new_val]
would affect the original dt and it would be modified. These examples would be helpful to have a flag to guarantee a copy - here are the 8 tests that fail (980.20, 982.20, 1419.03, 1419.05, 1419.07, 1419.10, 2068.20, 2233.36).
A second way the tests failed is that now [<-
behaves more similarly to dt[, x:=newval]
in that other objects pointing to the same address gets similar change when the replacement vector is a different length (i.e., assigning a scalar to a column that is more than 1 value). There are 2 tests like this (104.2 and 113 )
Just as interesting to note is that when reverting copy
back to simply return duplicate(x);
there are 3 new errors produced that aren't produced in the original PR (510, 2016.6, 2049.4). The first and last test are due to #5084 detecting invalid self ref. Specifically, making <-
assignments and then subsequent assignment calls don't build on one another. e.g., dt = data.table(x = 1L, y = 1L); key(dt) <- "x"; dt[, y:=2L];
results in data.table(x = 1L, y = 1L)
. In that example, the first test would typically be correct but then the next text would expect the data.table to be modified but it is still the original.
2016.6, is also related to the invalid self ref, but the cause is because the data.table is loaded instead of modifying data.tables in a non-canonical way. I'll look at this more next weekend.
## Tests 980.20, 982.20, 1419.03, 1419.05, 1419.07, 1419.10, 2068.20, 2233.36
dt = data.table(x = 1L)
copy(dt)[, x:= x+1L]
dt
## x
## <int>
## 1: 1
copy(dt)[1L, x:=x+1L] ## affects original dt. Any subsequent calls that assume unmodified dt cause errors
dt
## x
## <int>
## 1: 2
## Tests 104.20 113.00
dt = data.table(a = 1:2)
dt1 <- dt
dt1[, "A"] <- 3L ## if this were the same length as initial vector, a copy would be made and match original tests
dt
## a
## <int>
## 1: 3
## 2: 3
## Reverting copy behavior to always duplicate **change to PR code on local**
## Tests 510, 2049.4
dt = data.table(x = 1L, y = 1L)
key(dt) <- "x"
dt[, y:= 2L] ## Warning about invalid ref
dt ## No change
## Key: <x>
## x y
## <int> <int>
## 1: 1 1
## Similar issue when loading data from storage
## This still modified to return duplicate copy
## 2016.6
DT = data.table(a=1:3)
save(list="DT", file=tt<-tempfile())
rm(DT)
name = load(tt)
DT[, a:=4L] ## warning due to invalid self ref.
DT
## a
## <int>
## 1: 1
## 2: 2
## 3: 3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for digging into this
whereas
copy(dt)[1L, x:= new_val]
would affect the original dt and it would be modified.
This indeed seems to be unexpected behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added tests 492.* to ensure about that
This issue is currently main blocker for releasing master branch to CRAN. If anyone would like to take over this PR you are welcome and encouraged to do so. |
I still am a bit lost on this PR, can anyone at least confirm that we have good tests in place such that once tests go green the PR is ready to go? |
This issue came up from revdeps testing so even if current tests won't be enough then revdeps will be enough. |
I guess in this case yes, where our main concern is un-breaking regressed functionality. but in general it's much preferable to have minimal tests written by authors who understand the issue more deeply. |
I don't think merging to master is good at that point. IMO better to make PR complete, and then merge to master. Eventually there might have been an interference in recent changes in the master that could impact the code in not very predictable way. Then it is not only finishing PR but also resolving new issue that could have sneaked in during merge. |
#5125 gives the greatest insight of the issue from Matt outside of the notes in this PR:
Jan - could we roll back #5084 and plan to release 1.15.0? 5084 had breaking changes and I am not sure we can resolve the changes. Alternatively, it may be better to include a way for users to opt in with options while we figure this out. To me, it seemed like there was interest from Matt to not have DTs be more respectful of modification by reference, mainly based on his interest with #4978 and 5084. But even in this PR, there are non-idiomatic items (e.g., |
Rolling back is some idea, but problem that PR fixed was quite severe, silently returning wrong answer. Moreover it may not be limited to dplyr only but to whoever subsets DT without [ dispatch. There is huge amount of work already invested in resolving that, and I am worrying if we roll it back, it will slowly get abandoned. Now feels the best time to work that out. I cannot imagine documentation entry that warns about subset avoiding [ that would not sound scary to 95% of readers. |
That is fair - it might still be good to have a general timeline (e.g., 3 months) where if this still isn't resolved, other options are looked at. Do you have thoughts on failed test 104.2? The existing test and behavior in 1.14.10 expect this: library(data.table)
dt = data.table(A = 1:2, key = "A")
dt1 <- dt
dt1[, "A"] <- 3L
dt1
#> A
#> 1: 3
#> 2: 3
## original DT unaffected
dt
#> A
#> 1: 1
#> 2: 2 In this PR, the original DT is also updated on the |
I think this is what Matt was getting at with If |
Reading a bit more carefully, I see Lines 2139 to 2146 in 6782251
In fact, several methods do: Lines 2202 to 2206 in 6782251
Lines 2234 to 2239 in 6782251
Lines 2243 to 2245 in 6782251
Lines 2265 to 2269 in 6782251
Lines 2290 to 2293 in 6782251
Lines 2253 to 2263 in 6782251
|
* names<- retains sorting attributes * Add tests from #5133 * Add colnames<- method, setalloccol() output to enable set()
Follow up to #5084
Related to #5085
Closes #5125
Closes #5126
Closes #5127
Closes #5128
The existing
names<-
test didn't check retaining key, fixed. This PR will fail until WIP removed.