Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Submitting collapse for the Seal of Approval: Fix Typos. #47

Merged
merged 8 commits into from
Sep 27, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
128 changes: 128 additions & 0 deletions posts/2024-09-21-seal_of_approval-collapse/index.qmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,128 @@
---
title: "Seal of Approval: collapse"
author: "Sebastian Krantz"
date: "Sept 21, 2024"
categories: [seal of approval, partner package]
image: "hex.png"
draft: true
---

## [`collapse`](https://github.com/SebKrantz/collapse)

*Author(s):* Sebastian Krantz

*Maintainer:* Sebastian Krantz (sebastian.krantz@graduateinstitute.ch)

![collapse hex sticker](hex.png)

`collapse` is a large C/C++-based infrastructure package facilitating complex statistical computing, data transformation, and exploration tasks in R - at outstanding levels of performance and memory efficiency. It also implements a class-agnostic approach to R programming supporting vector, matrix and data frame-like objects (including *xts*, *tibble*, ***data.table***, and *sf*). It has a stable API, depends on *Rcpp*, and supports R versions >= 3.4.0.

## Relationship with `data.table`

At the C-level, `collapse` took much inspiration from `data.table`, and leverages some of its core algorithms like radixsort, while adding significant [statistical functionality](https://sebkrantz.github.io/collapse/reference/collapse-documentation.html) and [new algorithms](https://sebkrantz.github.io/collapse/reference/fast-grouping-ordering.html) within a [class-agnostic programming framework](https://sebkrantz.github.io/collapse/articles/collapse_object_handling.html) that seamlessly supports `data.table`. Notably, [`collapse::qDT()`](https://sebkrantz.github.io/collapse/reference/quick-conversion.html) is a highly efficient anything to `data.table` converter, and all manipulation functions in `collapse` return a valid `data.table` object when a `data.table` is passed, enabling subsequent reference operations (`:=`).

Its added functionality includes a rich set of [Fast Statistical Functions](https://sebkrantz.github.io/collapse/reference/fast-statistical-functions.html) supporting vectorized (grouped, weighted) statistical operations on matrix-like objects. These are integrated with fast [data manipulation functions](https://sebkrantz.github.io/collapse/reference/fast-data-manipulation.html) in a way that also [more complex statistical expressions can be vectorized across groups](https://andrewghazi.github.io/posts/collapse_is_sick/sick.html). It also adds [flexible time series functions and classes](https://sebkrantz.github.io/collapse/reference/time-series-panel-series.html) supporting irregular series and panels, [(panel-)data transformations](https://sebkrantz.github.io/collapse/reference/data-transformations.html), [vectorized hash-joins](https://sebkrantz.github.io/collapse/reference/join.html), [fast aggregation and recast pivots](https://sebkrantz.github.io/collapse/reference/pivot.html), [(internal) support for variable labels](https://sebkrantz.github.io/collapse/reference/small-helpers.html), [powerful descriptive tools](https://sebkrantz.github.io/collapse/reference/summary-statistics.html), [memory efficient programming tools](https://sebkrantz.github.io/collapse/reference/efficient-programming.html), and [recursive tools for heterogeneous nested data](https://sebkrantz.github.io/collapse/reference/list-processing.html).

It is [highly and interactively configurable](https://sebkrantz.github.io/collapse/reference/collapse-options.html). A navigable [internal documentation/overview](https://sebkrantz.github.io/collapse/reference/collapse-documentation.html) facilitates its use.

## Overview

The easiest way to load `collapse` and `data.table` together is via the [`fastverse` package](https://github.com/fastverse/fastverse):

```{r, include = FALSE}
options(fastverse.styling = FALSE, width = 120)
```

```{r}
library(fastverse)
```

This demonstrates `collapse`'s *deep integration* with `data.table`.

```{r}
mtcarsDT <- qDT(mtcars) # This creates a valid data.table (no deep copy)
mtcarsDT[, new := mean(mpg), by = cyl] # Proof: no warning here
```

There are many reasons to use `collapse`, e.g., to compute advanced statistics very fast:

```{r}
# Fast tidyverse-like functions: one of the ways to code with collapse
mtcDTagg <- mtcarsDT |>
fgroup_by(cyl, vs, am) |>
fsummarise(mpg_wtd_median = fmedian(mpg, wt), # Weighted median
mpg_wtd_p90 = fnth(mpg, 0.9, wt, ties = "q8"), # Weighted 90% quantile type 8
mpg_wtd_mode = fmode(mpg, wt, ties = "max"), # Weighted maximum mode
mpg_range = fmax(mpg) %-=% fmin(mpg), # Range: vectorized and memory efficient
lm_mpg_carb = fsum(mpg, W(carb)) %/=% fsum(W(carb)^2)) # coef(lm(mpg ~ carb)): vectorized
# Note: for increased parsimony, can abbreviate fgroup_by -> gby, fsummarise -> smr
mtcDTagg[, new2 := 1][1:3] # Still a data.table
```

Or simply, convenience functions like `collap()` for fast multi-type aggregation:

```{r}
# World Development Dataset (see ?wlddev)
head(wlddev, 3)
# Population weighted mean for numeric and mode for non-numeric columns (multithreaded and
# vectorized across groups and columns, the default in statistical functions is na.rm = TRUE)
wlddev |> collap(~ year + income, fmean, fmode, w = ~ POP, nthreads = 4) |> ss(1:3)
```

We can also use the low-level API for statistical programming:

```{r}
# Grouped mean
fmean(mtcars$mpg, mtcars$g)
# Grouping object from multiple columns
g <- GRP(mtcars, c("cyl", "vs", "am"))
fmean(mtcars$mpg, g)
vars <- c("carb", "hp", "qsec") # columns to aggregate
# Aggregating: weighted mean - vectorized across groups and columns
add_vars(g$groups, # Grouping columns
fmean(get_vars(mtcars, vars), g,
w = mtcars$wt, use.g.names = FALSE)
)
# Let's aggregate a matrix
m <- matrix(abs(rnorm(32^2)), 32)
m |> fmean(g) |> t() |> fmean(g) |> t()
# Normalizing the columns, by reference
fsum(m, TRA = "/", set = TRUE)
fsum(m) # Check
# Multiply the rows with a vector (by reference)
setop(m, "*", mtcars$mpg, rowwise = TRUE)
# Replace some elements with a number
setv(m, 3:40, 5.76) # Could also use a vector to copy from
whichv(m, 5.76) # get the indices back...
```

It is also fairly easy to do more involved data exploration and manipulation:

```{r}
# Groningen Growth and Development Center 10 Sector Database (see ?GGDC10S)
namlab(GGDC10S, N = TRUE, Ndistinct = TRUE, class = TRUE)
# Describe total Employment and Value-Added
descr(GGDC10S, SUM ~ Variable)
# Compute growth rate (Employment and VA, all sectors)
GGDC10S_growth <- tfmv(GGDC10S, AGR:SUM, fgrowth, # tfmv = transform variables. Alternatively: fmutate(across(...))
g = list(Country, Variable), t = Year, # Internal grouping and ordering, passed to fgrowth()
apply = FALSE) # apply = FALSE ensures we call fgrowth.data.frame

# Recast the dataset, median growth rate across years, taking along variable labels
GGDC_med_growth <- pivot(GGDC10S_growth,
ids = c("Country", "Regioncode", "Region"),
values = slt(GGDC10S, AGR:SUM, return = "names"), # slt = shorthand for fselect()
names = list(from = "Variable", to = "Sectorcode"),
labels = list(to = "Sector"),
FUN = fmedian, # Fast function = vectorized
how = "recast" # Recast (transposition) method
) |> qDT()
GGDC_med_growth[1:3]

# Finally, lets just join this to wlddev, enabling multiple matches (cartesian product)
# -> on average 61 years x 11 sectors = 671 records per unique (country) match
join(wlddev, GGDC_med_growth, on = c("iso3c" = "Country"),
how = "inner", multiple = TRUE) |> ss(1:3)
```

**In summary:** `collapse` provides flexible high-performance statistical and data manipulation tools, which extend and seamlessly integrate with `data.table`. The package follows a similar development philosophy emphasizing API stability, parsimonious syntax, and zero dependencies (apart from `Rcpp`). `data.table` users may wish to employ `collapse` for some of the advanced statistical and manipulation functionality showcased above, but also to efficiently manipulate other data frame-like objects, such as [`sf` data frames](https://sebkrantz.github.io/collapse/articles/collapse_and_sf.html).