Skip to content

Commit

Permalink
alloc.col now also as setalloccol, closes #3475
Browse files Browse the repository at this point in the history
  • Loading branch information
jangorecki committed Aug 27, 2019
1 parent 81af9b6 commit 94ff977
Show file tree
Hide file tree
Showing 6 changed files with 14 additions and 9 deletions.
2 changes: 1 addition & 1 deletion NAMESPACE
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ export(set2key, set2keyv, key2) # deprecated with helpful error; remove after Ma
export(as.data.table,is.data.table,test.data.table)
export(last,first,like,"%like%","%ilike%","%flike%",between,"%between%",inrange,"%inrange%")
export(timetaken)
export(truelength, alloc.col, ":=")
export(truelength, setalloccol, alloc.col, ":=")
export(setattr, setnames, setcolorder, set, setDT, setDF)
export(setorder, setorderv)
export(setNumericRounding, getNumericRounding)
Expand Down
2 changes: 2 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -274,6 +274,8 @@

16. Rolling functions (`?froll`) coerce `logical` input to `numeric` (instead of failing) to mimic the behavior of `integer` input.

17. Function `alloc.col` got a new name `setalloccol` for consistency about `set*` prefixes to functions that operates in-place. Name \code{alloc.col} is not going to be deprecated but we recommend to use `setalloccol`, [#3475](https://github.com/Rdatatable/data.table/issues/3475).


### Changes in [v1.12.2](https://github.com/Rdatatable/data.table/milestone/14?closed=1) (07 Apr 2019)

Expand Down
2 changes: 1 addition & 1 deletion R/data.table.R
Original file line number Diff line number Diff line change
Expand Up @@ -2381,7 +2381,7 @@ shallow = function(x, cols=NULL) {
ans
}

alloc.col = function(DT, n=getOption("datatable.alloccol"), verbose=getOption("datatable.verbose"))
setalloccol = alloc.col = function(DT, n=getOption("datatable.alloccol"), verbose=getOption("datatable.verbose"))
{
name = substitute(DT)
if (identical(name,quote(`*tmp*`))) stop("alloc.col attempting to modify `*tmp*`")
Expand Down
2 changes: 1 addition & 1 deletion man/as.data.table.Rd
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@ is.data.table(x)
\code{keep.rownames} argument can be used to preserve the (row)names attribute in the resulting \code{data.table}.
}
\seealso{
\code{\link{data.table}}, \code{\link{setDT}}, \code{\link{setDF}}, \code{\link{copy}}, \code{\link{setkey}}, \code{\link{J}}, \code{\link{SJ}}, \code{\link{CJ}}, \code{\link{merge.data.table}}, \code{\link{:=}}, \code{\link{alloc.col}}, \code{\link{truelength}}, \code{\link{rbindlist}}, \code{\link{setNumericRounding}}, \code{\link{datatable-optimize}}
\code{\link{data.table}}, \code{\link{setDT}}, \code{\link{setDF}}, \code{\link{copy}}, \code{\link{setkey}}, \code{\link{J}}, \code{\link{SJ}}, \code{\link{CJ}}, \code{\link{merge.data.table}}, \code{\link{:=}}, \code{\link{setalloccol}}, \code{\link{truelength}}, \code{\link{rbindlist}}, \code{\link{setNumericRounding}}, \code{\link{datatable-optimize}}
}
\examples{
nn = c(a=0.1, b=0.2, c=0.3, d=0.4)
Expand Down
13 changes: 8 additions & 5 deletions man/truelength.Rd
Original file line number Diff line number Diff line change
@@ -1,12 +1,16 @@
\name{truelength}
\alias{truelength}
\alias{setalloccol}
\alias{alloc.col}
\title{ Over-allocation access }
\description{
These functions are experimental and somewhat advanced. By \emph{experimental} we mean their names might change and perhaps the syntax, argument names and types. So if you write a lot of code using them, you have been warned! They should work and be stable, though, so please report problems with them.
These functions are experimental and somewhat advanced. By \emph{experimental} we mean their names might change and perhaps the syntax, argument names and types. So if you write a lot of code using them, you have been warned! They should work and be stable, though, so please report problems with them. \code{alloc.col} is just an alias to \code{setalloccol}. We recommend to use \code{setalloccol} from now on. \code{alloc.col} is not going to be deprecated, although \code{set*} prefix in \code{setalloccol} name makes it clear that input argument is modified in-place.
}
\usage{
truelength(x)
setalloccol(DT,
n = getOption("datatable.alloccol"), # default: 1024L
verbose = getOption("datatable.verbose")) # default: FALSE
alloc.col(DT,
n = getOption("datatable.alloccol"), # default: 1024L
verbose = getOption("datatable.verbose")) # default: FALSE
Expand All @@ -20,25 +24,24 @@ alloc.col(DT,
\details{
When adding columns by reference using \code{:=}, we \emph{could} simply create a new column list vector (one longer) and memcpy over the old vector, with no copy of the column vectors themselves. That requires negligible use of space and time, and is what v1.7.2 did. However, that copy of the list vector of column pointers only (but not the columns themselves), a \emph{shallow copy}, resulted in inconsistent behaviour in some circumstances. So, as from v1.7.3 data.table over allocates the list vector of column pointers so that columns can be added fully by reference, consistently.

When the allocated column pointer slots are used up, to add a new column \code{data.table} must reallocate that vector. If two or more variables are bound to the same data.table this shallow copy may or may not be desirable, but we don't think this will be a problem very often (more discussion may be required on data.table issue tracker). Setting \code{options(datatable.verbose=TRUE)} includes messages if and when a shallow copy is taken. To avoid shallow copies there are several options: use \code{\link{copy}} to make a deep copy first, use \code{alloc.col} to reallocate in advance, or, change the default allocation rule (perhaps in your .Rprofile); e.g., \code{options(datatable.alloccol=10000L)}.
When the allocated column pointer slots are used up, to add a new column \code{data.table} must reallocate that vector. If two or more variables are bound to the same data.table this shallow copy may or may not be desirable, but we don't think this will be a problem very often (more discussion may be required on data.table issue tracker). Setting \code{options(datatable.verbose=TRUE)} includes messages if and when a shallow copy is taken. To avoid shallow copies there are several options: use \code{\link{copy}} to make a deep copy first, use \code{setalloccol} to reallocate in advance, or, change the default allocation rule (perhaps in your .Rprofile); e.g., \code{options(datatable.alloccol=10000L)}.
Please note : over allocation of the column pointer vector is not for efficiency \emph{per se}; it is so that \code{:=} can add columns by reference without a shallow copy.
}
\value{
\code{truelength(x)} returns the length of the vector allocated in memory. \code{length(x)} of those items are in use. Currently, it is just the list vector of column pointers that is over-allocated (i.e. \code{truelength(DT)}), not the column vectors themselves, which would in future allow fast row \code{insert()}. For tables loaded from disk however, \code{truelength} is 0 in \R 2.14.0+ (and random in \R <= 2.13.2), which is perhaps unexpected. \code{data.table} detects this state and over-allocates the loaded \code{data.table} when the next column addition occurs. All other operations on \code{data.table} (such as fast grouping and joins) do not need \code{truelength}.
\code{alloc.col} \emph{reallocates} \code{DT} by reference. This may be useful for efficiency if you know you are about to going to add a lot of columns in a loop. It also returns the new \code{DT}, for convenience in compound queries.
\code{setalloccol} \emph{reallocates} \code{DT} by reference. This may be useful for efficiency if you know you are about to going to add a lot of columns in a loop. It also returns the new \code{DT}, for convenience in compound queries.
}
\seealso{ \code{\link{copy}} }
\examples{
DT = data.table(a=1:3,b=4:6)
length(DT) # 2 column pointer slots used
truelength(DT) # 1026 column pointer slots allocated
alloc.col(DT,2048)
setalloccol(DT, 2048)
length(DT) # 2 used
truelength(DT) # 2050 allocated, 2048 free
DT[,c:=7L] # add new column by assigning to spare slot
truelength(DT)-length(DT) # 2047 slots spare
}
\keyword{ data }
2 changes: 1 addition & 1 deletion vignettes/datatable-faq.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -573,7 +573,7 @@ DT[ , b := rnorm(5)] # 'replace' integer column with a numeric column

## Reading data.table from RDS or RData file

`*.RDS` and `*.RData` are file types which can store in-memory R objects on disk efficiently. However, storing data.table into the binary file loses its column over-allocation. This isn't a big deal -- your data.table will be copied in memory on the next _by reference_ operation and throw a warning. Therefore it is recommended to call `alloc.col()` on each data.table loaded with `readRDS()` or `load()` calls.
`*.RDS` and `*.RData` are file types which can store in-memory R objects on disk efficiently. However, storing data.table into the binary file loses its column over-allocation. This isn't a big deal -- your data.table will be copied in memory on the next _by reference_ operation and throw a warning. Therefore it is recommended to call `setalloccol()` on each data.table loaded with `readRDS()` or `load()` calls.

# General questions about the package

Expand Down

0 comments on commit 94ff977

Please sign in to comment.