Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

check for index in setkey #3582

Merged
merged 21 commits into from
May 28, 2019
Merged
Show file tree
Hide file tree
Changes from 10 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 23 additions & 5 deletions R/setkey.R
Original file line number Diff line number Diff line change
Expand Up @@ -86,12 +86,30 @@ setkeyv = function(x, cols, verbose=getOption("datatable.verbose"), physical=TRU
if (!typeof(.xi) %chin% c("integer","logical","character","double")) stop("Column '",i,"' is type '",typeof(.xi),"' which is not supported as a key column type, currently.")
}
if (!is.character(cols) || length(cols)<1L) stop("Internal error. 'cols' should be character at this point in setkey; please report.") # nocov
if (verbose) {
tt = suppressMessages(system.time(o <- forderv(x, cols, sort=TRUE, retGrp=FALSE))) # system.time does a gc, so we don't want this always on, until refcnt is on by default in R
# suppress needed for tests 644 and 645 in verbose mode
cat("forder took", tt["user.self"]+tt["sys.self"], "sec\n")

# get existing index name if any
found_index = NULL
if(!is.null(indices(x))) found_index <- names(attributes(attributes(x)$index))
new_possible_index = paste0("__", cols, collapse="")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These 3 lines would be simpler as just one line :
index = paste0(cols, collapse="__")
See the next 3 comments in combination below ...


# forder only if index is not present
if(!any(new_possible_index == found_index)){
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then this if() would be :
if (!any(index == indices(x))
If x doesn't have any indices, indices(x) returns NULL and any("anything"==NULL) is FALSE in R (which is nice).
The idea is just to save repeating variable names and use fewer lines of code so we have less to maintain in future. Not too few lines to the extent of making it hard to understand. But in this case, what I'm suggesting is simpler and worth doing and easier to read and check.
Using indices() is cleaner than fetching the attributes() directly, because we have an isolating interface. If we ever change the attribute structure or names in future, we only need to change the code inside indices() not try and find everywhere that reads the attributes directly.

if (verbose) {
tt = suppressMessages(system.time(o <- forderv(x, cols, sort=TRUE, retGrp=FALSE))) # system.time does a gc, so we don't want this always on, until refcnt is on by default in R
# suppress needed for tests 644 and 645 in verbose mode
cat("forder took", tt["user.self"]+tt["sys.self"], "sec\n")
} else {
o <- forderv(x, cols, sort=TRUE, retGrp=FALSE)
}
} else {
o = forderv(x, cols, sort=TRUE, retGrp=FALSE)
# find the matching index
ix = found_index[which(found_index == new_possible_index)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then this line can be removed.

if (verbose){
cat("using existing index for", gsub("^__","", ix), "\n")
o <- attr(attributes(x)$index, which=ix, exact = TRUE)
} else {
o <- attr(attributes(x)$index, which=ix, exact = TRUE)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And these two which=ix become which=index

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mattdowle I've made the changes. Correct me if I am wrong, I think we need ix here because, let's take a scenario where a user set multiple indexes. So when we do indices(DT), we get a, b, __a__b, if the user does setkey(DT, a), setkey(DT, b), setkey(DT, a, b) for a table DT = data.table(a = c(...), b = c(...), c = c(...)) . In this case, we need to find the ix index of which index is matched.

Copy link
Member

@mattdowle mattdowle May 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great - thanks again. Almost there!
Look at line 104 though. I commented before that line 104 could be removed. Line 104 is, in the end, just renaming index as idx, iiuc. This which= is looking up the item by name anyway. So line 104 can be removed and then replace this which=idx with which=index.
Once that's done, perhaps rename the local variable index to thiskey or newkey? That way it reads a little betters and avoids the same name "index" as the attribute name. For example, line 94 will probably read a little better as if (!any(indices(x)==newkey)).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Thanks for your patience.

}
}
if (!physical) {
if (is.null(attr(x,"index",exact=TRUE))) setattr(x, "index", integer())
Expand Down
13 changes: 13 additions & 0 deletions inst/tests/tests.Rraw
Original file line number Diff line number Diff line change
Expand Up @@ -6031,6 +6031,19 @@ thisDT <- copy(DT)[, c("aaa", "b") := 2]
test(1419.58, indices(thisDT), c("a", "ab"))
test(1419.59, allIndicesValid(thisDT), TRUE)

## setkey on same col as index before
DT <- data.table(a = c(1,1,1,1,1,2,2,2,2,2),
aaa = c(1,1,2,2,2,1,1,2,2,2))
setindex(DT, a)
test(1419.60, allIndicesValid(DT), TRUE)
test(1419.61, setkey(DT, a, verbose=TRUE), output="using existing index for a")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check on output= is necessary and good but not quite sufficient. It's possible that setkey() prints this message but then doesn't set the key properly. So there needs to be another test just after this one that makes sure the key has been changed. Unfortunately the test data chosen doesn't make this a good test because the input data is trivially sorted already. So my first thought is to change the input data so it's random e.g. DT = data.table(a=c(2,3,2,1,3,2,1), aaa=...) then add test 1419.62 to check that DT$a is c(1,1,2,2,2,3,3) afterwards. Or something like that. Then it is checking both that i) the setkey() has changed the physical order and ii) the output= also checks that it changed the order for the correct reason (using the existing index).

DT <- data.table(a = c(1,1,1,1,1,2,2,2,2,2),
aaa = c(1,1,2,2,2,1,1,2,2,2),
bbb = c(1,1,2,0,1,1,1,0,1,1))
setindex(DT, a)
setindex(DT, aaa)
test(1419.62, allIndicesValid(DT), TRUE)
test(1419.63, setkey(DT, aaa, verbose=TRUE), output="using existing index for aaa")
Copy link
Member

@mattdowle mattdowle May 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same request here as above for 1419.61. This test needs to check the result of this setkey() command has done the correct thing (e.g. changed the physical row order properly) as well as doing that for the right reason (output=).
Since setkey() returns DT invisibly, the easiest way is just to pass an appropriate y=data.table(...) into this test(). See other tests() in this file for examples for single calls to test() which use x, y and output= too.


# setnames updates secondary key
DT = data.table(a=1:5,b=10:6)
Expand Down