Closes #686. Implemented 'rleid()' a convenience function.

Rdatatable · Jan 7, 2015 · b8c1b01 · b8c1b01
1 parent c54cb93
commit b8c1b01
Show file tree

Hide file tree

Showing 5 changed files with 63 additions and 0 deletions.
diff --git a/NAMESPACE b/NAMESPACE
@@ -22,6 +22,7 @@ export(frank)
 export(frankv)
 export(address)
 export(.SD,.N,.I,.GRP,.BY,.EACHI)
+export(rleid)
 
 S3method("[", data.table)
 S3method("[<-", data.table)

diff --git a/R/data.table.R b/R/data.table.R
@@ -2345,6 +2345,31 @@ setDT <- function(x, giveNames=TRUE, keep.rownames=FALSE) {
     invisible(x)
 }
 
+# FR #686
+rleid <- function(x, cols=seq_along(x)) {
+    as_list <- function(x) {
+        xx = vector("list", 1L)
+        .Call(Csetlistelt, xx, 1L, x)
+        xx
+    }
+    if (is.atomic(x)) {
+        if (!missing(cols) && !is.null(cols)) 
+            stop("x is a single vector, non-NULL 'cols' doesn't make sense")
+        cols = 1L
+        x = as_list(x)
+    } else {
+        if (!length(cols))
+            stop("x is a list, 'cols' can not be 0-length")
+        if (is.character(cols)) 
+            cols = chmatch(cols, names(x))
+        cols = as.integer(cols)
+    }
+    x = .shallow(x, cols) # shallow copy even if list..
+    setDT(x)
+    ulist = uniqlist(x)
+    rep.int(seq_along(ulist), uniqlengths(ulist, nrow(x)))
+}
+
 gsum <- function(x, na.rm=FALSE) .Call(Cgsum, x, na.rm)
 gmean <- function(x, na.rm=FALSE) .Call(Cgmean, x, na.rm)
 gmin <- function(x, na.rm=FALSE) .Call(Cgmin, x, na.rm)

diff --git a/README.md b/README.md
@@ -26,6 +26,8 @@
 
   6. `frank()` is now implemented. It's much faster than `base::rank` and does more. It accepts *vectors*, *lists* with all elements of equal lengths, *data.frames* and *data.tables*, and optionally takes a `cols` argument. In addition to implementing all the `ties.method` methods available from `base::rank`, it also implements *dense rank*. See `?frank` for more. Closes [#760](https://github.com/Rdatatable/data.table/issues/760) and [#771](https://github.com/Rdatatable/data.table/issues/771)
 
+  7. `rleid()`, a convenience function for generating a run-length type id column to be used in grouping operations is now implemented. Closes [#686](https://github.com/Rdatatable/data.table/issues/771). Check `?rleid` examples section for usage scenarios.
+
 #### BUG FIXES
 
   1. `if (TRUE) DT[,LHS:=RHS]` no longer prints, [#869](https://github.com/Rdatatable/data.table/issues/869). Tests added. To get this to work we've had to live with one downside: if a `:=` is used inside a function with no `DT[]` before the end of the function, then the next time `DT` is typed at the prompt, nothing will be printed. A repeated `DT` will print. To avoid this: include a `DT[]` after the last `:=` in your function. If that is not possible (e.g., it's not a function you can change) then `print(DT)` and `DT[]` at the prompt are guaranteed to print. As before, adding an extra `[]` on the end of `:=` query is a recommended idiom to update and then print; e.g. `> DT[,foo:=3L][]`. Thanks to Jureiss for reporting.

diff --git a/inst/tests/tests.Rraw b/inst/tests/tests.Rraw
@@ -5716,6 +5716,11 @@ test(1463.24, shift(x,1L, 0L, type="lead"), list(as.character(c(2:5, 0L))))
 
 # add tests for date and factor?
 
+# FR #686
+DT = data.table(grp=rep(c("A", "B", "C", "A", "B"), 
+          c(2,2,3,1,2)), value=1:10)
+test(1464, rleid(DT, "grp"), c(1L, 1L, 2L, 2L, 3L, 3L, 3L, 4L, 5L, 5L))
+
 ##########################
 
 

diff --git a/man/rleid.Rd b/man/rleid.Rd
@@ -0,0 +1,30 @@
+\name{rleid}
+\alias{rleid}
+\title{ Generate run-length type group id}
+\description{
+   A convenience function for generating a \emph{run-length} type \emph{id} column to be used in grouping operations. It accepts atomic vectors, lists, data.frames or data.tables as input.
+}
+\usage{
+rleid(x, cols=seq_along(x))
+}
+\arguments{
+  \item{x}{ A vector, list, data.frame or data.table. }
+  \item{cols}{ Only meaningful for lists, data.frames or data.tables. A character vector of column names (or numbers) of x. }
+}
+\details{
+    At times aggregation (or grouping) operations need to be performed where consecutive runs of identical values should belong to the same group (See \code{\link[base]{rle}}). The use for such a function has come up repeatedly on StackOverflow, see the \code{See Also} section. This function allows to generate \emph{"run-length"} groups directly.
+}
+\value{
+	An integer vector with same length as \code{NROW(x)}.
+}
+\examples{
+DT = data.table(grp=rep(c("A", "B", "C", "A", "B"), c(2,2,3,1,2)), value=1:10)
+rleid(DT, "grp") # get run-length ids
+# get sum of value over run-length groups
+DT[, sum(value), by=.(grp, rleid(grp))]
+
+}
+\seealso{
+  \code{\link{data.table}}, \url{http://stackoverflow.com/q/21421047/559784}
+}
+\keyword{ data }