rdatatable-community · kbodwin · Sep 27, 2024 · Sep 21, 2024 · Sep 21, 2024 · Sep 21, 2024
diff --git a/posts/2024-09-21-seal_of_approval-collapse/hex.png b/posts/2024-09-21-seal_of_approval-collapse/hex.png
diff --git a/posts/2024-09-21-seal_of_approval-collapse/index.qmd b/posts/2024-09-21-seal_of_approval-collapse/index.qmd
@@ -0,0 +1,128 @@
+---
+title: "Seal of Approval: collapse"
+author: "Sebastian Krantz"
+date: "Sept 21, 2024"
+categories: [seal of approval, partner package]
+image: "hex.png"
+draft: true
+---
+
+## [`collapse`](https://github.com/SebKrantz/collapse)
+
+*Author(s):*	Sebastian Krantz
+
+*Maintainer:*	Sebastian Krantz (sebastian.krantz@graduateinstitute.ch)
+
+![collapse hex sticker](hex.png)
+
+`collapse` is a large C/C++-based infrastructure package facilitating complex statistical computing, data transformation, and exploration tasks in R - at outstanding levels of performance and memory efficiency. It also implements a class-agnostic approach to R programming supporting vector, matrix and data frame-like objects (including *xts*, *tibble*, ***data.table***, and *sf*). It has a stable API, depends on *Rcpp*, and supports R versions >= 3.4.0. 
+
+## Relationship with `data.table`
+
+At the C-level, `collapse` took much inspiration from `data.table`, and leverages some of its core algorithms like radixsort, while adding significant [statistical functionality](https://sebkrantz.github.io/collapse/reference/collapse-documentation.html) and [new algorithms](https://sebkrantz.github.io/collapse/reference/fast-grouping-ordering.html) within a [class-agnostic programming framework](https://sebkrantz.github.io/collapse/articles/collapse_object_handling.html) that seamlessly supports `data.table`. Notably, [`collapse::qDT()`](https://sebkrantz.github.io/collapse/reference/quick-conversion.html) is a highly efficient anything to `data.table` converter, and all manipulation functions in `collapse` return a valid `data.table` object when a `data.table` is passed, enabling subsequent reference operations (`:=`).
+
+Its added functionality includes a rich set of [Fast Statistical Functions](https://sebkrantz.github.io/collapse/reference/fast-statistical-functions.html) supporting vectorized (grouped, weighted) statistical operations on matrix-like objects. These are integrated with fast [data manipulation functions](https://sebkrantz.github.io/collapse/reference/fast-data-manipulation.html) in a way that also [more complex statistical expressions can be vectorized across groups](https://andrewghazi.github.io/posts/collapse_is_sick/sick.html). It also adds [flexible time series functions and classes](https://sebkrantz.github.io/collapse/reference/time-series-panel-series.html) supporting irregular series and panels, [(panel-)data transformations](https://sebkrantz.github.io/collapse/reference/data-transformations.html), [vectorized hash-joins](https://sebkrantz.github.io/collapse/reference/join.html), [fast aggregation and recast pivots](https://sebkrantz.github.io/collapse/reference/pivot.html), [(internal) support for variable labels](https://sebkrantz.github.io/collapse/reference/small-helpers.html), [powerful descriptive tools](https://sebkrantz.github.io/collapse/reference/summary-statistics.html), [memory efficient programming tools](https://sebkrantz.github.io/collapse/reference/efficient-programming.html), and [recursive tools for heterogeneous nested data](https://sebkrantz.github.io/collapse/reference/list-processing.html). 
+
+It is [highly and interactively configurable](https://sebkrantz.github.io/collapse/reference/collapse-options.html). A navigable [internal documentation/overview](https://sebkrantz.github.io/collapse/reference/collapse-documentation.html) facilitates its use.
+
+## Overview
+
+The easiest way to load `collapse` and `data.table` together is via the [`fastverse` package](https://github.com/fastverse/fastverse): 
+
+```{r, include = FALSE}
+options(fastverse.styling = FALSE, width = 120)
+```
+
+```{r}
+library(fastverse)
+```
+
+This demonstrates `collapse`'s *deep integration* with `data.table`.
+
+```{r}
+mtcarsDT <- qDT(mtcars)                # This creates a valid data.table (no deep copy)
+mtcarsDT[, new := mean(mpg), by = cyl] # Proof: no warning here
+```
+
+There are many reasons to use `collapse`, e.g., to compute advanced statistics very fast:
+
+```{r}
+# Fast tidyverse-like functions: one of the ways to code with collapse
+mtcDTagg <- mtcarsDT |> 
+  fgroup_by(cyl, vs, am) |> 
+  fsummarise(mpg_wtd_median = fmedian(mpg, wt),             # Weighted median
+             mpg_wtd_p90 = fnth(mpg, 0.9, wt, ties = "q8"), # Weighted 90% quantile type 8
+             mpg_wtd_mode = fmode(mpg, wt, ties = "max"),   # Weighted maximum mode 
+             mpg_range = fmax(mpg) %-=% fmin(mpg),          # Range: vectorized and memory efficient   
+             lm_mpg_carb = fsum(mpg, W(carb)) %/=% fsum(W(carb)^2)) # coef(lm(mpg ~ carb)): vectorized
+# Note: for increased parsimony, can abbreviate fgroup_by -> gby, fsummarise -> smr
+mtcDTagg[, new2 := 1][1:3] # Still a data.table
+```
+
+Or simply, convenience functions like `collap()` for fast multi-type aggregation:
+
+```{r}
+# World Development Dataset (see ?wlddev)
+head(wlddev, 3) 
+# Population weighted mean for numeric and mode for non-numeric columns (multithreaded and 
+# vectorized across groups and columns, the default in statistical functions is na.rm = TRUE)
+wlddev |> collap(~ year + income, fmean, fmode, w = ~ POP, nthreads = 4) |> ss(1:3)
+```
+
+We can also use the low-level API for statistical programming:
+
+```{r}
+# Grouped mean
+fmean(mtcars$mpg, mtcars$g) 
+# Grouping object from multiple columns
+g <- GRP(mtcars, c("cyl", "vs", "am"))
+fmean(mtcars$mpg, g)
+vars <- c("carb", "hp", "qsec") # columns to aggregate
+# Aggregating: weighted mean - vectorized across groups and columns 
+add_vars(g$groups, # Grouping columns
+  fmean(get_vars(mtcars, vars), g, 
+        w = mtcars$wt, use.g.names = FALSE)
+)
+# Let's aggregate a matrix 
+m <- matrix(abs(rnorm(32^2)), 32)
+m |> fmean(g) |> t() |> fmean(g) |> t()
+# Normalizing the columns, by reference
+fsum(m, TRA = "/", set = TRUE)
+fsum(m) # Check
+# Multiply the rows with a vector (by reference)
+setop(m, "*", mtcars$mpg, rowwise = TRUE)
+# Replace some elements with a number
+setv(m, 3:40, 5.76) # Could also use a vector to copy from
+whichv(m, 5.76) # get the indices back...
+```
+
+It is also fairly easy to do more involved data exploration and manipulation:
+
+```{r}
+# Groningen Growth and Development Center 10 Sector Database (see ?GGDC10S)
+namlab(GGDC10S, N = TRUE, Ndistinct = TRUE, class = TRUE)
+# Describe total Employment and Value-Added
+descr(GGDC10S, SUM ~ Variable)
+# Compute growth rate (Employment and VA, all sectors)
+GGDC10S_growth <- tfmv(GGDC10S, AGR:SUM, fgrowth, # tfmv = transform variables. Alternatively: fmutate(across(...))
+                       g = list(Country, Variable), t = Year, # Internal grouping and ordering, passed to fgrowth()
+                       apply = FALSE) # apply = FALSE ensures we call fgrowth.data.frame
+
+# Recast the dataset, median growth rate across years, taking along variable labels 
+GGDC_med_growth <- pivot(GGDC10S_growth,
+  ids = c("Country", "Regioncode", "Region"),
+  values = slt(GGDC10S, AGR:SUM, return = "names"), # slt = shorthand for fselect()
+  names = list(from = "Variable", to = "Sectorcode"),
+  labels = list(to = "Sector"), 
+  FUN = fmedian,  # Fast function = vectorized
+  how = "recast"  # Recast (transposition) method
+) |> qDT()
+GGDC_med_growth[1:3]
+
+# Finally, lets just join this to wlddev, enabling multiple matches (cartesian product)
+# -> on average 61 years x 11 sectors = 671 records per unique (country) match
+join(wlddev, GGDC_med_growth, on = c("iso3c" = "Country"), 
+     how = "inner", multiple = TRUE) |> ss(1:3)
+```
+
+**In summary:** `collapse` provides flexible high-performance statistical and data manipulation tools, which extend and seamlessly integrate with `data.table`. The package follows a similar development philosophy emphasizing API stability, parsimonious syntax, and zero dependencies (apart from `Rcpp`). `data.table` users may wish to employ `collapse` for some of the advanced statistical and manipulation functionality showcased above, but also to efficiently manipulate other data frame-like objects, such as [`sf` data frames](https://sebkrantz.github.io/collapse/articles/collapse_and_sf.html).