index.Rmd

---
title: "Tooling update"
author: "Slides: Logan Brooks, Daniel McDonald, Ryan Tibshirani"
#
# NOTE KEEP THESE LINKS; they may be required here by upstream GPL-3 slides, CC attribution
date: "2022-12-13 <br /> <small> View online: <https://brookslogan.github.io/tooling-slides-for-modelers/> <br /> Source: <https://github.com/brookslogan/tooling-slides-for-modelers/> </small>"
#
output:
  revealjs::revealjs_presentation:
    theme: night
    highlight: zenburn
    transition: convex
    # Nicer, but takes more disk space:
    self_contained: false
    reveal_plugins: ["notes","search","menu"]
    #
    # # Self-contained html:
    # self_contained: true
---

# Context

Last year
: software community working group scoped out initial ~~`epitools`~~ `epiprocess` package

This year
: Delphi iterated on / developed `epiprocess`, `epipredict`, `epidat{r,py}`

This presentation
: Focus: capabilities of `epiprocess`

Note
: We're in the middle of changes to the `epiprocess` interface to improve
  first-time use. We hope to have these done by the beginning of next semester.

<!-- Some front matter is placed here rather than above in order to avoid
generating an empty slide between the title slide and the first real non-title
slide: -->

<!-- Show & update continuously with
`xaringan::inf_mr("index.Rmd")`.
-->

<!-- Add scroll bars within slides: -->
```{css, echo=FALSE}
.slide {
    height: 750px;
    overflow-y: auto !important;
}
```

```{r, echo=FALSE, message=FALSE}
library(epiprocess)
library(dplyr)
library(tidyr)
library(tibble)
library(purrr)
library(ggplot2)

theme_set(theme_bw())
```

<!-- Place plots on a separate second-lever slide from plotting code,
     repeating the most recent second-level slide title: -->
```{r, include=FALSE}
# knitr::knit_hooks$restore("text") # encounters error looking up default
default_text_hook = knitr::knit_hooks$get("text")
previous_l2_heading = NULL
knitr::knit_hooks$set(text = function(x) {
  previous_l2_heading <<- stringi::stri_match_last_regex(x, pattern="\n##.*\n")
  default_text_hook(x)
})
# knitr::knit_hooks$restore("plot") # encounters error looking up default
default_plot_hook = knitr::knit_hooks$get("plot")
knitr::knit_hooks$set(plot = function(x, options) {
  if (identical(knitr::opts_current$get("echo"), FALSE)) {
    default_plot_hook(x, options)
  } else {
    paste0(previous_l2_heading, default_plot_hook(x, options))
  }
})
```

<!-- TODO change this back to have epidatr, epidatpy below -->

## {data-background-image=`r xfun::base64_uri("resources/ToolingOverview.png")`}

# `{epiprocess}`

<https://cmu-delphi.github.io/epiprocess>

## `epiprocess::epi_df`

Snapshot of an epi surveillance data set; a specialized data frame.

```{r}
jhu_csse_daily_subset
```

```{r, echo=FALSE, results="asis"}
withr::with_options(
  code={
    cat("<details><summary>Data object documentation, license, attribution</summary>")
    print(help("jhu_csse_daily_subset", package="epiprocess", help_type="text"))
    cat("</details>")
  },
  list(pager=function(files, header, title, delete.file) {
    on.exit({
      unlink(files)
    })
    cat(paste(c("<pre>",purrr::reduce(purrr::map(files, function(file) {
      # gsub("</?u>","_",gsub("</u>( *)<u>","\\1",
      gsub("_\b(.)", "<u>\\1</u>", readLines(file))
      # ))
    }), function(x, y) c(x,"\n\n\n",y)), "</pre>"), collapse="\n"))
  })
)
```

::: notes
- Some methods, like `epi_cor` and `epi_slide` (next), operate on `epi_df`s, while others produce them.
:::

## Preprocessing

```{r}
completed_jhu_csse_daily_subset =
  jhu_csse_daily_subset %>%
  complete(geo_value, time_value = full_seq(time_value, period=1)) %>%
  as_epi_df(as_of = attr(jhu_csse_daily_subset, "metadata")$as_of)
```

<section style="text-align: left">
makes:

 - `time_value`s evenly spaced
 - `time_value`s the same for every geo
 - filled-in measurements (new rows) have NA values

(these are unchecked requirements of some functions in `dplyr`, `epiprocess`)

</section>

```{r}
alternative_completed_jhu_csse_daily_subset =
  jhu_csse_daily_subset %>%
  group_by(geo_value) %>%
  complete(time_value = full_seq(time_value, period=1),
           fill=list(cases=0), explicit=FALSE) %>%
  ungroup() %>%
  as_epi_df(as_of = attr(jhu_csse_daily_subset, "metadata")$as_of)
```

::: notes
Alternative approach:

- `time_value` range can differ by geo
- filled-in measurements (rows) have NA values except `cases` uses 0s

Have `complete.epi_df` method on our list of todos.

Actually, both of these are identical to the original, since the original already was complete in the first sense, and was arranged in the same way.

Another option would be to convert to&from tsibble and use `tsibble::fill_gaps`.
:::

## `epiprocess::growth_rate`

::: notes
- Several methods and parameter settings for defining growth rates are provided
- E.g., gr1 here is using a relative change of counts in a small window, which is more heavily impacted by the negative blip around July 2021 than gr2.  We might try a different change rate calculation like gr2, or outlier removal.
:::

```{r, message=FALSE, warning=FALSE}
with_growth_rates =
  completed_jhu_csse_daily_subset %>%
  group_by(geo_value) %>%
  mutate(
    cases_gr1 = growth_rate(time_value, cases, method = "rel_change", h=14L),
    cases_gr2 = growth_rate(time_value, cases, method = "linear_reg", h=14L)
  ) %>%
  ungroup() %>%
  # Don't include partial window calculations or extreme Mar 2020 growth rates
  # from near-0 cases:
  mutate(include_gr =
           time_value >= min(time_value) + 13L & time_value <= max(time_value) - 14L &
           time_value >= as.Date("2020-04-01"),
         cases_gr1 = if_else(include_gr, cases_gr1, NA_real_),
         cases_gr2 = if_else(include_gr, cases_gr2, NA_real_))
```

`r previous_l2_heading`

Plot for CA:

```{r, echo=FALSE}
with_growth_rates %>%
  filter(geo_value == "ca") %>%
  pivot_longer(c(cases, cases_gr1, cases_gr2), names_to="signal", values_to="value") %>%
  mutate(facet = dplyr::recode_factor(signal,
                                      cases = "(CA) Cases",
                                      cases_gr1 = "(CA) Case Growth Rate",
                                      cases_gr2 = "(CA) Case Growth Rate")) %>%
  ggplot(aes(time_value, value, colour=signal)) +
  facet_wrap(~ facet, ncol=1L, scales="free_y") +
  geom_line(na.rm=TRUE)
```

<!-- TODO add direction annotations derived from these growth rates? -->

## `epiprocess::detect_outlr*`

Contributed by Evan Ray.

```{r}
with_outlier_info =
  completed_jhu_csse_daily_subset %>%
  group_by(geo_value) %>%
  mutate(cases_outlier_info = detect_outlr_rm(time_value, cases)) %>%
  ungroup()
```

* `_rm` = "rolling median"
* Additional methods\&settings (including ensemble approaches) available.

::: notes
- Several methods and parameter settings for outlier detection are provided
- Ensembling
- Correction (just simple replacement correction scheme right now, not backdistribution, iterative, etc.)
- This plot shows retrospective corrections for the time series, not real-time corrections; to see what the real-time calculations would have looked like, we'd use `epix_slide`, discussed later
:::

`r previous_l2_heading`

Plot for CA:

```{r, echo=FALSE}
outlier_facets =
  c("Points outside bands are marked as outliers",
    "Replacement values are available for simple outlier corrections") %>%
  {factor(., levels=.)} # (to avoid facet alphabetization)
with_outlier_info %>%
  unpack(cases_outlier_info) %>%
  filter(geo_value == "ca") %>%
  mutate(facet = outlier_facets[[1L]]) %>%
  ggplot(aes(time_value, cases, ymin=lower, ymax=upper)) +
  geom_line() +
  geom_ribbon(colour = "red", linetype="dashed", fill = NA) +
  facet_wrap(~ facet, ncol=1L) +
  geom_point(data = function(df) df %>% filter(cases != replacement) %>%
                                   mutate(facet = outlier_facets[[2L]]),
             colour="orange") +
  geom_line(aes(y = replacement),
            data = function(df) df %>% mutate(facet = outlier_facets[[2L]]),
            colour = "blue")
```

<!-- TODO point to / mention full vignettes for each topic? -->

<!-- TODO crediting Evan & others -->

## `epiprocess::epi_cor`

```{r}
cors =
  completed_jhu_csse_daily_subset %>%
  epi_cor(death_rate_7d_av, case_rate_7d_av,
          dt2 = -21L,
          cor_by = "time_value",
          method = "kendall")
```

<section style="text-align: left">

Kendall correlations

- between death rates and case rates 21 days prior
- across `geo_value`s (only 6 of them in this sample data)
- by `time_value`

</section>

::: notes
- Note there are only 6 geos in this example data object.
- Roughly, do weekly case rates 21 days ago seem helpful in determining whether one state will have higher weekly death rates than another, and does that usefulness vary across time?
- Since this is only using a few locations, extreme values are expected to be more common.
:::

`r previous_l2_heading`

```{r, echo=FALSE}
cor_facets =
  c("Kendall correlations between deaths & 21d-lagged cases, across geos, by time",
    "Death rates",
    "21d-lagged case rates") %>%
  {factor(., levels=.)}
cors %>%
  mutate(facet = cor_facets[[1L]]) %>%
  ggplot(aes(x = time_value, y = cor)) +
  geom_line(na.rm=TRUE) +
  geom_line(aes(y=death_rate_7d_av, colour=geo_value),
            completed_jhu_csse_daily_subset %>% mutate(facet = cor_facets[[2L]])) +
  geom_line(aes(y=case_rate_7d_av, colour=geo_value),
            completed_jhu_csse_daily_subset %>%
              group_by(geo_value) %>%
              mutate(case_rate_7d_av = lag(case_rate_7d_av, 21L)) %>%
              ungroup() %>%
              mutate(facet = cor_facets[[3L]]),
            na.rm=TRUE) +
  facet_wrap(~ facet, ncol=1L, scales="free_y") +
  scale_x_date(minor_breaks = "month", date_labels = "%b '%y") +
  labs(x = "Date", y = "Correlation")
```

## `epiprocess::epi_slide`

Key functionality: optionally-grouped, rolling time window calculations:
```{r}
sum_of_k_else_na = function(x, k) {
  if (length(x) == k) sum(x) else NA
}
n_geos = nrow(distinct(completed_jhu_csse_daily_subset, geo_value))
with_sum_features =
  completed_jhu_csse_daily_subset %>%
  select(geo_value, time_value, cases) %>%
  group_by(geo_value) %>%
  epi_slide(same_geo_7d_sum = sum_of_k_else_na(cases, 7L), n=7L) %>%
  ungroup() %>%
  epi_slide(cross_geo_7d_sum = sum_of_k_else_na(cases, 7L*n_geos), n=7L)
```

`r previous_l2_heading`

Plot for FL:

```{r, echo=FALSE}
with_sum_features %>%
  filter(geo_value == "fl") %>%
  pivot_longer(c(cases, same_geo_7d_sum, cross_geo_7d_sum), names_to="signal", values_to="value") %>%
  ggplot(aes(time_value, value, colour=signal)) +
  geom_line(na.rm=TRUE)
```

<!-- TODO list perceived use cases? -->
<!-- https://github.com/cmu-delphi/epiprocess/issues/256 -->

## `epiprocess::epi_archive`

Version history (and present) of an epi surveillance data set.
```{r}
archive_cases_dv_subset
```

```{r, echo=FALSE, results="asis"}
withr::with_options(
  code={
    cat("<details><summary>Data object documentation, license, attribution</summary>")
    print(help("archive_cases_dv_subset", package="epiprocess", help_type="text"))
    cat("</details>")
  },
  list(pager=function(files, header, title, delete.file) {
    on.exit({
      unlink(files)
    })
    cat(paste(c("<pre>",purrr::reduce(purrr::map(files, function(file) {
      # gsub("</?u>","_",gsub("</u>( *)<u>","\\1",
      gsub("_\b(.)", "<u>\\1</u>", readLines(file))
      # ))
    }), function(x, y) c(x,"\n\n\n",y)), "</pre>"), collapse="\n"))
  })
)
```

::: notes
* Easy to inspect revision behaviour
* Can use it for pseudo-prospective forecasts
* Compare to our procedure for the PNAS article &mdash; store every version as a `.csv` and load them individually
:::

##

Underlying version data:
```{r}
archive_cases_dv_subset$DT
```

<!-- TODO an epix_as_of slide by itself -->

<!-- TODO note stuff about we provide an interface that removes the extra worries when dealing directly with `data.table`s -->

<!-- FIXME move notes so that they render on the right slides -->

<!-- TODO direction slide -->

## Examining an `epi_archive`

Some snapshots of an outpatient %CLI signal for FL:

```{r, echo=FALSE, message=FALSE, warning=FALSE}
x <- archive_cases_dv_subset
x_latest <- epix_as_of(x, max_version = x$versions_end)
versions = seq(as.Date("2020-06-01"), x$versions_end - 1, by = "1 month")
snapshots <- map_dfr(versions, function(v) {
  epix_as_of(x, max_version = v) %>% mutate(version = v)}) %>%
  bind_rows(x_latest %>% mutate(version = x$versions_end)) %>%
  mutate(latest = version == x$versions_end) %>%
  filter(geo_value == "fl")

ggplot(snapshots %>% filter(!latest),
            aes(x = time_value, y = percent_cli)) +
  geom_line(aes(colour = version, group = version), na.rm=TRUE) +
  geom_vline(aes(colour = version, xintercept = version), lty = 2) +
  khroma::scale_colour_smoothrainbow(range=c(0.15, 1), trans="date") +
  scale_x_date(minor_breaks = "month", date_labels = "%b %y") +
  labs(x = "Date", y = "% of doctor's visits with CLI") +
  theme(legend.position = "none") +
  geom_line(data = snapshots %>% filter(latest),
               aes(x = time_value, y = percent_cli),
            inherit.aes = FALSE, color = "black", na.rm=TRUE)
```

<!-- TODO combine with the below -->

## `epiprocess::epix_slide`

Key functionality: version-aware counterpart to `epi_slide`; combines time windowing and as-of-ing.
```{r}
slide_result =
  archive_cases_dv_subset %>%
  epix_slide(~ tibble(
    group_snapshot = list(.x),
    coefs = .x %>%
      complete(time_value = full_seq(time_value, 1L)) %>%
      lm(formula = case_rate_7d_av ~
           lag(case_rate_7d_av, 7L) +
           lag(case_rate_7d_av, 14L) +
           0) %>%
      coef() %>%
      list()
  ),
  n=90L,
  group_by=geo_value,
  ref_time_values = seq(as.Date("2020-06-01")+89L, as.Date("2021-11-30"), by="week")) %>%
  rename(as_of = time_value)
```

<!-- NOTE want these notes on the `slide_result` printing slide, but can't seem to get it there, even by moving the notes farther up to unrelated slides. -->

::: notes
- (Note that data latency cuts into the amount of data available in the window.  For this JHU-CSSE case data, it's only 1 day out of the 90 day window, but we must be careful with indexing.)
:::

## `epiprocess::epix_slide`

```{r}
slide_result
```

## `epiprocess::epix_slide`

Plot for NY:

```{r, echo=FALSE}
slide_result_facets = c(
  "Coefficients by date of (pseudoprospective) fit",
  "Signal data by time_value"
) %>%
  {factor(., levels=.)}
slide_result %>%
  unnest_longer(slide_value_coefs, indices_to="feature", values_to="coefficient") %>%
  filter(geo_value == "ny") %>%
  mutate(facet = slide_result_facets[[1L]]) %>%
  ggplot() +
  facet_wrap(~ facet, ncol=1L, scales="free_y") +
  geom_line(
    aes(as_of, coefficient, linetype=feature)
  ) +
  geom_line(
    aes(time_value, case_rate_7d_av, colour=as_of, group=as_of),
    slide_result %>%
      transmute(as_of = as_of, slide_value_group_snapshot) %>%
      unnest(slide_value_group_snapshot) %>%
      filter(geo_value == "ny") %>%
      mutate(facet = slide_result_facets[[2L]])
  ) +
  khroma::scale_colour_smoothrainbow(range=c(0.15, 1), trans="date") +
  guides(linetype = guide_legend(order=1L),
         colour = guide_legend(order=2L)) +
  xlab("as_of for coefficients / time_value for signals") +
  ylab("coefficient / (unlagged) value")
```

::: notes
- note that revisions and latency aren't very prominent in this toy data set,
but are in other disease forecasting targets and predictors
:::

<!-- TODO move `epidatr` material here and segue from above note? -->

## `epiprocess` overview

| Vectors         | `epi_df`s         |              `epi_archive`s |
|:---------------:|:-----------------:|----------------------------:|
|                 |                   |                             |
|                 |                   |                             |
|                 |                   |                             |
| `growth_rate`   | `epi_cor`         |                             |
| `detect_outlr*` |                   |                             |
|                 |                   |                             |
|                 |                   |                             |
|                 |                   |                             |
|                 | `dplyr::*`        |                             |
|                 | `tsibble` interop |                `epix_as_of` |
|                 | `epi_slide`       |                `epix_slide` |
|                 |                   |                             |
|                 |                   |                             |
|                 |                   |                             |
|                 | sample data sets  |             sample data set |
|                 |                   |                             |
|                 |                   |                             |
|                 |                   |                             |
|                 |                   | `epix_fill_through_version` |
|                 |                   |                `epix_merge` |
|                 |                   |                             |
|                 |                   |                             |

# Next presentations

## `{epipredict}`

<https://cmu-delphi.github.io/epipredict>

<section style="text-align: left">

General idea: provide

* a set of basic, easy-to-use forecasters that work out of the box, with reasonably limited customization, e.g.:

  * baseline flat-line forecaster
  * autoregressive forecaster
  * autoregressive classifier

* a framework for creating custom forecasters out of modular components

</section>

(See some slides prepared in September [here](https://dajmcdon.github.io/delphi-research-tooling/#7).)

## `{epidatr}`, `{epidatpy}`

<https://github.com/cmu-delphi/epidatr>

<https://github.com/cmu-delphi/epidatpy>


<section style="text-align: left">

Successors to Delphi Epidata API functionality in `delphi_epidata`, `covidcast` R and Python packages.

* faster downloads
* more consistent interfaces across endpoints
* friendlier interfaces (longer-term)

</section>