Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dcast only computes default fill if necessary #5549

Merged
merged 32 commits into from
Mar 14, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
2886c4f
delete old commented code
tdhock Nov 30, 2022
90f0647
new test for no warning fails
tdhock Nov 30, 2022
26745f4
only compute default fill if missing cells present
tdhock Nov 30, 2022
03dc91d
any_NA_int helper
tdhock Nov 30, 2022
258befb
bugfix #5512
tdhock Dec 1, 2022
360ba9d
Update src/fcast.c
tdhock Dec 4, 2022
75102bf
Update src/fcast.c
tdhock Dec 4, 2022
6225799
mention warning text
tdhock Dec 5, 2022
5055306
const int args
tdhock Dec 5, 2022
6a93cb1
add back ithiscol
tdhock Dec 5, 2022
a40d969
get pointer before for loop
tdhock Dec 7, 2022
2019a5c
Merge branch 'master' into fix5512
tdhock Feb 15, 2024
c46cfaa
Merge branch 'master' into fix5512
tdhock Mar 4, 2024
1a8ba9c
add test case from Michael
tdhock Mar 4, 2024
47d735e
Merge branch 'fix5512' of https://github.com/Rdatatable/data.table in…
tdhock Mar 4, 2024
7198d08
merge
tdhock Mar 8, 2024
02f2c3a
test min(dbl) and no warning when fill specified
tdhock Mar 8, 2024
fc542ec
Revert "delete old commented code"
tdhock Mar 10, 2024
6ae4c76
use suggestions from Michael
tdhock Mar 10, 2024
eb95ab8
rm inline any_NA_int since that causes install to fail
tdhock Mar 10, 2024
6d8f614
clarify comment
tdhock Mar 10, 2024
3c7fb24
link 5390
tdhock Mar 11, 2024
dcb51ed
mymin test fails
tdhock Mar 13, 2024
83b0cf5
compute some_fill using anyNA in R then pass to C
tdhock Mar 13, 2024
6f4b711
Update R/fcast.R
tdhock Mar 13, 2024
ee93c5f
Update R/fcast.R
tdhock Mar 13, 2024
747c76c
dat_for_default_fill is zero-row dt
tdhock Mar 13, 2024
07c6838
merge
tdhock Mar 13, 2024
4d6c0e1
!length instead of length==0
tdhock Mar 13, 2024
359c3c3
new dcast tests with fill=character
tdhock Mar 14, 2024
4b96d35
dat_for_default_fill is dat again, not 0-row, because that causes som…
tdhock Mar 14, 2024
4ca3736
Merge branch 'master' into fix5512
MichaelChirico Mar 14, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,8 @@

3. Optimized `shift` per group produced wrong results when simultaneously subsetting, for example, `DT[i==1L, shift(x), by=group]`, [#5962](https://github.com/Rdatatable/data.table/issues/5962). Thanks to @renkun-ken for the report and Benjamin Schwendinger for the fix.

4. `dcast(fill=NULL)` only computes default fill value if necessary, which eliminates some previous warnings (for example, when fun.aggregate=min or max, warning was NAs introduced by coercion to integer range) which were potentially confusing, [#5512](https://github.com/Rdatatable/data.table/issues/5512), [#5390](https://github.com/Rdatatable/data.table/issues/5390). Thanks to Toby Dylan Hocking for the fix.

## NOTES

1. `transform` method for data.table sped up substantially when creating new columns on large tables. Thanks to @OfekShilon for the report and PR. The implemented solution was proposed by @ColeMiller1.
Expand Down
22 changes: 13 additions & 9 deletions R/fcast.R
Original file line number Diff line number Diff line change
Expand Up @@ -152,23 +152,22 @@ dcast.data.table = function(data, formula, fun.aggregate = NULL, sep = "_", ...,
dat = .Call(CsubsetDT, dat, idx, seq_along(dat))
}
fun.call = m[["fun.aggregate"]]
fill.default = NULL
if (is.null(fun.call)) {
oo = forderv(dat, by=varnames, retGrp=TRUE)
if (attr(oo, 'maxgrpn', exact=TRUE) > 1L) {
messagef("'fun.aggregate' is NULL, but found duplicate row/column combinations, so defaulting to length(). That is, the variables %s used in 'formula' do not uniquely identify rows in the input 'data'. In such cases, 'fun.aggregate' is used to derive a single representative value for each combination in the output data.table, for example by summing or averaging (fun.aggregate=sum or fun.aggregate=mean, respectively). Check the resulting table for values larger than 1 to see which combinations were not unique. See ?dcast.data.table for more details.", brackify(varnames))
fun.call = quote(length)
}
}
if (!is.null(fun.call)) {
dat_for_default_fill = dat
run_agg_funs = !is.null(fun.call)
if (run_agg_funs) {
fun.call = aggregate_funs(fun.call, lvals, sep, ...)
errmsg = gettext("Aggregating function(s) should take vector inputs and return a single value (length=1). However, function(s) returns length!=1. This value will have to be used to fill any missing combinations, and therefore must be length=1. Either override by setting the 'fill' argument explicitly or modify your function to handle this case appropriately.")
if (is.null(fill)) {
fill.default = suppressWarnings(dat[0L][, eval(fun.call)])
# tryCatch(fill.default <- dat[0L][, eval(fun.call)], error = function(x) stopf(errmsg))
if (nrow(fill.default) != 1L) stopf(errmsg)
maybe_err = function(list.of.columns) {
if (any(lengths(list.of.columns) != 1L)) stopf("Aggregating function(s) should take vector inputs and return a single value (length=1). However, function(s) returns length!=1. This value will have to be used to fill any missing combinations, and therefore must be length=1. Either override by setting the 'fill' argument explicitly or modify your function to handle this case appropriately.")
list.of.columns
}
dat = dat[, eval(fun.call), by=c(varnames)]
dat = dat[, maybe_err(eval(fun.call)), by=c(varnames)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this could possibly affect the code path in [.
We check for eval in j to handle that specially, not sure if we are checking for nested eval.
I would go with new env argument, or eventually modify fun.call by prefixing it with maybe_err call.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you thinking for efficiency? Otherwise passing tests ensures your concern is moot right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

possibly for efficiency but just touching the edge like that raises some concerns

}
order_ = function(x) {
o = forderv(x, retGrp=TRUE, sort=TRUE)
Expand Down Expand Up @@ -211,7 +210,12 @@ dcast.data.table = function(data, formula, fun.aggregate = NULL, sep = "_", ...,
}
maplen = vapply_1i(mapunique, length)
idx = do.call("CJ", mapunique)[map, 'I' := .I][["I"]] # TO DO: move this to C and avoid materialising the Cross Join.
ans = .Call(Cfcast, lhs, val, maplen[[1L]], maplen[[2L]], idx, fill, fill.default, is.null(fun.call))
some_fill = anyNA(idx)
fill.default = if (run_agg_funs && is.null(fill) && some_fill) dat_for_default_fill[, maybe_err(eval(fun.call))]
if (run_agg_funs && is.null(fill) && some_fill) {
fill.default = dat_for_default_fill[0L][, maybe_err(eval(fun.call))]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

}
ans = .Call(Cfcast, lhs, val, maplen[[1L]], maplen[[2L]], idx, fill, fill.default, is.null(fun.call), some_fill)
allcols = do.call("paste", c(rhs, sep=sep))
if (length(valnames) > 1L)
allcols = do.call("paste", if (identical(".", allcols)) list(valnames, sep=sep)
Expand Down
15 changes: 15 additions & 0 deletions inst/tests/tests.Rraw
Original file line number Diff line number Diff line change
Expand Up @@ -3729,6 +3729,21 @@ test(1100, dt1[dt2,roll=-Inf,rollends=c(FALSE,TRUE)]$ind, INT(NA,NA,1,2,2,2,2,2,
DT = data.table(x=sample(5,20,TRUE), y=sample(2,20,TRUE), z=sample(letters[1:2],20,TRUE), d1=runif(20), d2=1L)
test(1102.38, names(dcast(DT, x ~ y + z, fun.aggregate=length, value.var = "d2", sep=".")),
c("x", "1.a", "1.b", "2.a", "2.b"))

# test for #5512, only compute default fill if needed.
DT = data.table(chr=c("a","b","b"), int=1:3, dbl=as.double(4:6))
mymin <- function(x){
if (!length(x)) stop("calling mymin on vector of length 0")
min(x)
}
test(1102.39, dcast(DT, . ~ chr, mymin, value.var="int"), data.table(.=".",a=1L,b=2L,key=".")) # fill not used in output, so default fill not computed.
ans <- data.table(int=1:3, a=c(1L,NA,NA), b=c(NA,2L,3L), key="int")
test(1102.40, dcast(DT, int ~ chr, min, value.var="int"), ans, warning=c("no non-missing arguments to min; returning Inf", "inf (type 'double') at RHS position 1 out-of-range(NA) or truncated (precision lost) when assigning to type 'integer' (target vector)")) # warning emitted when coercing default fill since as.integer(min(integer()) is Inf) is NA.
test(1102.41, dcast(DT, int ~ chr, mymin, value.var="int", fill=NA), ans) # because fill=NA is provided by user, no need to call mymin(integer()).
test(1102.42, dcast(DT, int ~ chr, min, value.var="dbl"), data.table(int=1:3, a=c(4,Inf,Inf), b=c(Inf,5,6), key="int"), warning="no non-missing arguments to min; returning Inf") # only one warning, because no coercion.
test(1102.43, dcast(DT, int ~ chr, min, value.var="dbl", fill="coerced to NA"), data.table(int=1:3, a=c(4,NA,NA), b=c(NA,5,6), key="int"), warning=c("Coercing 'character' RHS to 'double' to match the type of target vector.", "NAs introduced by coercion"))
test(1102.44, dcast(DT, int ~ ., value.var="dbl", fill="ignored"), data.table(int=1:3, .=c(4,5,6), key="int"))

}

# test for freading commands
Expand Down
2 changes: 1 addition & 1 deletion man/dcast.data.table.Rd
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@
\item{\dots}{Any other arguments that may be passed to the aggregating function.}
\item{margins}{Not implemented yet. Should take variable names to compute margins on. A value of \code{TRUE} would compute all margins.}
\item{subset}{Specified if casting should be done on a subset of the data. Ex: \code{subset = .(col1 <= 5)} or \code{subset = .(variable != "January")}.}
\item{fill}{Value with which to fill missing cells. If \code{fun.aggregate} is present, takes the value by applying the function on a 0-length vector.}
\item{fill}{Value with which to fill missing cells. If \code{fill=NULL} and missing cells are present, then \code{fun.aggregate} is used on a 0-length vector to obtain a fill value.}
\item{drop}{\code{FALSE} will cast by including all missing combinations.

\code{c(FALSE, TRUE)} will only include all missing combinations of formula \code{LHS}; \code{c(TRUE, FALSE)} will only include all missing combinations of formula RHS. See Examples.}
Expand Down
2 changes: 1 addition & 1 deletion src/data.table.h
Original file line number Diff line number Diff line change
Expand Up @@ -289,7 +289,7 @@ SEXP setlistelt(SEXP, SEXP, SEXP);
SEXP address(SEXP);
SEXP expandAltRep(SEXP);
SEXP fmelt(SEXP, SEXP, SEXP, SEXP, SEXP, SEXP, SEXP, SEXP, SEXP);
SEXP fcast(SEXP, SEXP, SEXP, SEXP, SEXP, SEXP, SEXP, SEXP);
SEXP fcast(SEXP, SEXP, SEXP, SEXP, SEXP, SEXP, SEXP, SEXP, SEXP);
SEXP issorted(SEXP, SEXP);
SEXP gforce(SEXP, SEXP, SEXP, SEXP, SEXP, SEXP);
SEXP gsum(SEXP, SEXP);
Expand Down
28 changes: 17 additions & 11 deletions src/fcast.c
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
// raise(SIGINT);

// TO DO: margins
SEXP fcast(SEXP lhs, SEXP val, SEXP nrowArg, SEXP ncolArg, SEXP idxArg, SEXP fill, SEXP fill_d, SEXP is_agg) {
SEXP fcast(SEXP lhs, SEXP val, SEXP nrowArg, SEXP ncolArg, SEXP idxArg, SEXP fill, SEXP fill_d, SEXP is_agg, SEXP some_fillArg) {
int nrows=INTEGER(nrowArg)[0], ncols=INTEGER(ncolArg)[0];
int nlhs=length(lhs), nval=length(val), *idx = INTEGER(idxArg);
SEXP target;
Expand All @@ -15,24 +15,28 @@ SEXP fcast(SEXP lhs, SEXP val, SEXP nrowArg, SEXP ncolArg, SEXP idxArg, SEXP fil
SET_VECTOR_ELT(ans, i, VECTOR_ELT(lhs, i));
}
// get val cols
bool some_fill = LOGICAL(some_fillArg)[0];
for (int i=0; i<nval; ++i) {
const SEXP thiscol = VECTOR_ELT(val, i);
SEXP thisfill = fill;
const SEXPTYPE thistype = TYPEOF(thiscol);
int nprotect = 0;
if (isNull(fill)) {
if (LOGICAL(is_agg)[0]) {
thisfill = PROTECT(allocNAVector(thistype, 1)); nprotect++;
} else thisfill = VECTOR_ELT(fill_d, i);
}
if (isVectorAtomic(thiscol)) { // defer error handling to below, but also skip on list
thisfill = PROTECT(coerceAs(thisfill, thiscol, /*copyArg=*/ScalarLogical(false))); nprotect++;
if(some_fill){
if (isNull(fill)) {
if (LOGICAL(is_agg)[0]) {
thisfill = PROTECT(allocNAVector(thistype, 1)); nprotect++;
} else thisfill = VECTOR_ELT(fill_d, i);
}
if (isVectorAtomic(thiscol)) { // defer error handling to below, but also skip on list
thisfill = PROTECT(coerceAs(thisfill, thiscol, /*copyArg=*/ScalarLogical(false))); nprotect++;
}
}
switch (thistype) {
case INTSXP:
case LGLSXP: {
const int *ithiscol = INTEGER(thiscol);
const int *ithisfill = INTEGER(thisfill);
const int *ithisfill = 0;
if (some_fill) ithisfill = INTEGER(thisfill);
for (int j=0; j<ncols; ++j) {
SET_VECTOR_ELT(ans, nlhs+j+i*ncols, target=allocVector(thistype, nrows) );
int *itarget = INTEGER(target);
Expand All @@ -45,7 +49,8 @@ SEXP fcast(SEXP lhs, SEXP val, SEXP nrowArg, SEXP ncolArg, SEXP idxArg, SEXP fil
} break;
case REALSXP: {
const double *dthiscol = REAL(thiscol);
const double *dthisfill = REAL(thisfill);
const double *dthisfill = 0;
if (some_fill) dthisfill = REAL(thisfill);
for (int j=0; j<ncols; ++j) {
SET_VECTOR_ELT(ans, nlhs+j+i*ncols, target=allocVector(thistype, nrows) );
double *dtarget = REAL(target);
Expand All @@ -58,7 +63,8 @@ SEXP fcast(SEXP lhs, SEXP val, SEXP nrowArg, SEXP ncolArg, SEXP idxArg, SEXP fil
} break;
case CPLXSXP: {
const Rcomplex *zthiscol = COMPLEX(thiscol);
const Rcomplex *zthisfill = COMPLEX(thisfill);
const Rcomplex *zthisfill = 0;
if (some_fill) zthisfill = COMPLEX(thisfill);
for (int j=0; j<ncols; ++j) {
SET_VECTOR_ELT(ans, nlhs+j+i*ncols, target=allocVector(thistype, nrows) );
Rcomplex *ztarget = COMPLEX(target);
Expand Down
Loading