-
Notifications
You must be signed in to change notification settings - Fork 997
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
unique can be optimized on keyed data.tables #2947
Comments
good idea, but should be implemented without making new class, just checking attributes should be enough to detect and apply optimization. |
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as resolved.
This comment was marked as resolved.
That said, I can't reproduce the numbers I got back in 2018. Further, here is a direct way in C++ that I wrote last year that seems to produce faster results (for integer vectors at least): #include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
IntegerVector do_unique_sorted(IntegerVector x) {
int n = x.length();
int i = 0;
std::vector<int> o;
o.reserve(n);
o.push_back(x[i]);
while (++i < n) {
int xi = x[i];
if (xi != x[i - 1]) {
o.push_back(xi);
}
}
return wrap(o);
} bench::mark(DT[, TRUE, keyby = "V1"][["V1"]],
unique(DT$V1),
unique(DT, by = "V1")[["V1"]],
do_unique_sorted(DT$V1))
|
DT[ , unique(V1)] is tricky to optimize because base R |
I think we all agree that optimizing with index rather than key is much more useful. Here are the timings I am getting now using merged of #4370 and #4386. There could be some speed up after #4467 as well. Verbose output is there to explain whenever index optimization was attempted to be made. If there are two lines, then it means that two calls to forder were made. Second AFAIK to revert to original order, which is very cheap because it is integer of length of unique value. library(data.table)
fdistinct = data.table:::fdistinct
NN = 1e8
set.seed(13013)
DT = data.table(sample(1e5, NN, TRUE)) # 400MB
setindexv(DT, "V1")
system.time(a1<-unique(DT$V1))
# user system elapsed
# 2.952 0.592 3.544
system.time(a2<-DT[ , unique(V1)])
# user system elapsed
# 3.062 0.984 4.047
system.time(a3<-DT[ , TRUE, by = V1][["V1"]])
# user system elapsed
# 10.642 2.513 1.053
##Finding groups using forderv ... forderLazy: opt not possible: is.data.table(DT)=0, sortGroups=0, all1(ascArg)=1
system.time(a4<-DT[ , TRUE, keyby = V1][["V1"]]) ## it is incompatible sort here, but lets just have it for comparison
# user system elapsed
# 1.903 0.371 2.275
##Finding groups using uniqlist on index 'V1' ... 1.914s elapsed (1.586s cpu)
system.time(a5<-unique(DT, by="V1")[["V1"]])
# user system elapsed
# 10.153 1.184 0.769
##forderLazy: opt not possible: is.data.table(DT)=1, sortGroups=0, all1(ascArg)=1
##forderLazy: opt not possible: is.data.table(DT)=0, sortGroups=1, all1(ascArg)=1
system.time(a6<-fdistinct(DT, "V1")[["V1"]])
# user system elapsed
# 0.040 0.044 0.075
##forderLazy: using existing index: __V1
##forderLazy: opt not possible: is.data.table(DT)=0, sortGroups=1, all1(ascArg)=1
all.equal(a1, a2) && all.equal(a1, a3) && all.equal(sort(a1), a4) && all.equal(a1, a5) && all.equal(a1, a6)
#[1] TRUE First two are slowest because they dispatch to It is nice to see how Lines 53 to 56 in 2884b29
normally when computing unique we would call forderv(sort=FALSE) which cannot use index because indices are sort=TRUE . For unique we don't need order but starts attribute only. Those few lines of code checks if there is index available, and if it is, then uses it. Then it has to re-order at the end as well, which is handled by the following line
if (sort && length(o <- forderv(f))) f = f[o] ## this rolls back to original order The same optimization could be applied to |
To actually optimize calls like |
For one-key tables, users will soon benefit from fast unique/duplicated on ALTREP objects: https://twitter.com/groundwalkergmb/status/1389685109909975040 Which changes the calculus for this FR a bit. Related: #4697 |
For what it is worth I think fdistinct would be amongst the functions we would use the most. From what I understand there are 2 orders of magnitude to make that would be amazing. |
fdistinct is not exported in mergelist PR. I agree it make sense to export it as well. Once resolved then this Q can be nicely addressed |
key
ed tables are already known sorted, so finding unique values is much easier than it is in the general case.Compare:
It seems to me we should be able to match (or exceed) the final time in the second call to
unique
(i.e. within[]
).If we were willing to do something like add a
dt_primary_key
class to the primary key, we could also achieve this speed in the first approach by writing aunique.dt_primary_key
method, but I'm not sure how extensible this is to multiple keys (S4
?)The text was updated successfully, but these errors were encountered: