-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathindex.Rmd
564 lines (454 loc) · 17.9 KB
/
index.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
---
title: "Tooling update"
author: "Slides: Logan Brooks, Daniel McDonald, Ryan Tibshirani"
#
# NOTE KEEP THESE LINKS; they may be required here by upstream GPL-3 slides, CC attribution
date: "2022-12-13 <br /> <small> View online: <https://brookslogan.github.io/tooling-slides-for-modelers/> <br /> Source: <https://github.com/brookslogan/tooling-slides-for-modelers/> </small>"
#
output:
revealjs::revealjs_presentation:
theme: night
highlight: zenburn
transition: convex
# Nicer, but takes more disk space:
self_contained: false
reveal_plugins: ["notes","search","menu"]
#
# # Self-contained html:
# self_contained: true
---
# Context
Last year
: software community working group scoped out initial ~~`epitools`~~ `epiprocess` package
This year
: Delphi iterated on / developed `epiprocess`, `epipredict`, `epidat{r,py}`
This presentation
: Focus: capabilities of `epiprocess`
Note
: We're in the middle of changes to the `epiprocess` interface to improve
first-time use. We hope to have these done by the beginning of next semester.
<!-- Some front matter is placed here rather than above in order to avoid
generating an empty slide between the title slide and the first real non-title
slide: -->
<!-- Show & update continuously with
`xaringan::inf_mr("index.Rmd")`.
-->
<!-- Add scroll bars within slides: -->
```{css, echo=FALSE}
.slide {
height: 750px;
overflow-y: auto !important;
}
```
```{r, echo=FALSE, message=FALSE}
library(epiprocess)
library(dplyr)
library(tidyr)
library(tibble)
library(purrr)
library(ggplot2)
theme_set(theme_bw())
```
<!-- Place plots on a separate second-lever slide from plotting code,
repeating the most recent second-level slide title: -->
```{r, include=FALSE}
# knitr::knit_hooks$restore("text") # encounters error looking up default
default_text_hook = knitr::knit_hooks$get("text")
previous_l2_heading = NULL
knitr::knit_hooks$set(text = function(x) {
previous_l2_heading <<- stringi::stri_match_last_regex(x, pattern="\n##.*\n")
default_text_hook(x)
})
# knitr::knit_hooks$restore("plot") # encounters error looking up default
default_plot_hook = knitr::knit_hooks$get("plot")
knitr::knit_hooks$set(plot = function(x, options) {
if (identical(knitr::opts_current$get("echo"), FALSE)) {
default_plot_hook(x, options)
} else {
paste0(previous_l2_heading, default_plot_hook(x, options))
}
})
```
<!-- TODO change this back to have epidatr, epidatpy below -->
## {data-background-image=`r xfun::base64_uri("resources/ToolingOverview.png")`}
# `{epiprocess}`
<https://cmu-delphi.github.io/epiprocess>
## `epiprocess::epi_df`
Snapshot of an epi surveillance data set; a specialized data frame.
```{r}
jhu_csse_daily_subset
```
```{r, echo=FALSE, results="asis"}
withr::with_options(
code={
cat("<details><summary>Data object documentation, license, attribution</summary>")
print(help("jhu_csse_daily_subset", package="epiprocess", help_type="text"))
cat("</details>")
},
list(pager=function(files, header, title, delete.file) {
on.exit({
unlink(files)
})
cat(paste(c("<pre>",purrr::reduce(purrr::map(files, function(file) {
# gsub("</?u>","_",gsub("</u>( *)<u>","\\1",
gsub("_\b(.)", "<u>\\1</u>", readLines(file))
# ))
}), function(x, y) c(x,"\n\n\n",y)), "</pre>"), collapse="\n"))
})
)
```
::: notes
- Some methods, like `epi_cor` and `epi_slide` (next), operate on `epi_df`s, while others produce them.
:::
## Preprocessing
```{r}
completed_jhu_csse_daily_subset =
jhu_csse_daily_subset %>%
complete(geo_value, time_value = full_seq(time_value, period=1)) %>%
as_epi_df(as_of = attr(jhu_csse_daily_subset, "metadata")$as_of)
```
<section style="text-align: left">
makes:
- `time_value`s evenly spaced
- `time_value`s the same for every geo
- filled-in measurements (new rows) have NA values
(these are unchecked requirements of some functions in `dplyr`, `epiprocess`)
</section>
```{r}
alternative_completed_jhu_csse_daily_subset =
jhu_csse_daily_subset %>%
group_by(geo_value) %>%
complete(time_value = full_seq(time_value, period=1),
fill=list(cases=0), explicit=FALSE) %>%
ungroup() %>%
as_epi_df(as_of = attr(jhu_csse_daily_subset, "metadata")$as_of)
```
::: notes
Alternative approach:
- `time_value` range can differ by geo
- filled-in measurements (rows) have NA values except `cases` uses 0s
Have `complete.epi_df` method on our list of todos.
Actually, both of these are identical to the original, since the original already was complete in the first sense, and was arranged in the same way.
Another option would be to convert to&from tsibble and use `tsibble::fill_gaps`.
:::
## `epiprocess::growth_rate`
::: notes
- Several methods and parameter settings for defining growth rates are provided
- E.g., gr1 here is using a relative change of counts in a small window, which is more heavily impacted by the negative blip around July 2021 than gr2. We might try a different change rate calculation like gr2, or outlier removal.
:::
```{r, message=FALSE, warning=FALSE}
with_growth_rates =
completed_jhu_csse_daily_subset %>%
group_by(geo_value) %>%
mutate(
cases_gr1 = growth_rate(time_value, cases, method = "rel_change", h=14L),
cases_gr2 = growth_rate(time_value, cases, method = "linear_reg", h=14L)
) %>%
ungroup() %>%
# Don't include partial window calculations or extreme Mar 2020 growth rates
# from near-0 cases:
mutate(include_gr =
time_value >= min(time_value) + 13L & time_value <= max(time_value) - 14L &
time_value >= as.Date("2020-04-01"),
cases_gr1 = if_else(include_gr, cases_gr1, NA_real_),
cases_gr2 = if_else(include_gr, cases_gr2, NA_real_))
```
`r previous_l2_heading`
Plot for CA:
```{r, echo=FALSE}
with_growth_rates %>%
filter(geo_value == "ca") %>%
pivot_longer(c(cases, cases_gr1, cases_gr2), names_to="signal", values_to="value") %>%
mutate(facet = dplyr::recode_factor(signal,
cases = "(CA) Cases",
cases_gr1 = "(CA) Case Growth Rate",
cases_gr2 = "(CA) Case Growth Rate")) %>%
ggplot(aes(time_value, value, colour=signal)) +
facet_wrap(~ facet, ncol=1L, scales="free_y") +
geom_line(na.rm=TRUE)
```
<!-- TODO add direction annotations derived from these growth rates? -->
## `epiprocess::detect_outlr*`
Contributed by Evan Ray.
```{r}
with_outlier_info =
completed_jhu_csse_daily_subset %>%
group_by(geo_value) %>%
mutate(cases_outlier_info = detect_outlr_rm(time_value, cases)) %>%
ungroup()
```
* `_rm` = "rolling median"
* Additional methods\&settings (including ensemble approaches) available.
::: notes
- Several methods and parameter settings for outlier detection are provided
- Ensembling
- Correction (just simple replacement correction scheme right now, not backdistribution, iterative, etc.)
- This plot shows retrospective corrections for the time series, not real-time corrections; to see what the real-time calculations would have looked like, we'd use `epix_slide`, discussed later
:::
`r previous_l2_heading`
Plot for CA:
```{r, echo=FALSE}
outlier_facets =
c("Points outside bands are marked as outliers",
"Replacement values are available for simple outlier corrections") %>%
{factor(., levels=.)} # (to avoid facet alphabetization)
with_outlier_info %>%
unpack(cases_outlier_info) %>%
filter(geo_value == "ca") %>%
mutate(facet = outlier_facets[[1L]]) %>%
ggplot(aes(time_value, cases, ymin=lower, ymax=upper)) +
geom_line() +
geom_ribbon(colour = "red", linetype="dashed", fill = NA) +
facet_wrap(~ facet, ncol=1L) +
geom_point(data = function(df) df %>% filter(cases != replacement) %>%
mutate(facet = outlier_facets[[2L]]),
colour="orange") +
geom_line(aes(y = replacement),
data = function(df) df %>% mutate(facet = outlier_facets[[2L]]),
colour = "blue")
```
<!-- TODO point to / mention full vignettes for each topic? -->
<!-- TODO crediting Evan & others -->
## `epiprocess::epi_cor`
```{r}
cors =
completed_jhu_csse_daily_subset %>%
epi_cor(death_rate_7d_av, case_rate_7d_av,
dt2 = -21L,
cor_by = "time_value",
method = "kendall")
```
<section style="text-align: left">
Kendall correlations
- between death rates and case rates 21 days prior
- across `geo_value`s (only 6 of them in this sample data)
- by `time_value`
</section>
::: notes
- Note there are only 6 geos in this example data object.
- Roughly, do weekly case rates 21 days ago seem helpful in determining whether one state will have higher weekly death rates than another, and does that usefulness vary across time?
- Since this is only using a few locations, extreme values are expected to be more common.
:::
`r previous_l2_heading`
```{r, echo=FALSE}
cor_facets =
c("Kendall correlations between deaths & 21d-lagged cases, across geos, by time",
"Death rates",
"21d-lagged case rates") %>%
{factor(., levels=.)}
cors %>%
mutate(facet = cor_facets[[1L]]) %>%
ggplot(aes(x = time_value, y = cor)) +
geom_line(na.rm=TRUE) +
geom_line(aes(y=death_rate_7d_av, colour=geo_value),
completed_jhu_csse_daily_subset %>% mutate(facet = cor_facets[[2L]])) +
geom_line(aes(y=case_rate_7d_av, colour=geo_value),
completed_jhu_csse_daily_subset %>%
group_by(geo_value) %>%
mutate(case_rate_7d_av = lag(case_rate_7d_av, 21L)) %>%
ungroup() %>%
mutate(facet = cor_facets[[3L]]),
na.rm=TRUE) +
facet_wrap(~ facet, ncol=1L, scales="free_y") +
scale_x_date(minor_breaks = "month", date_labels = "%b '%y") +
labs(x = "Date", y = "Correlation")
```
## `epiprocess::epi_slide`
Key functionality: optionally-grouped, rolling time window calculations:
```{r}
sum_of_k_else_na = function(x, k) {
if (length(x) == k) sum(x) else NA
}
n_geos = nrow(distinct(completed_jhu_csse_daily_subset, geo_value))
with_sum_features =
completed_jhu_csse_daily_subset %>%
select(geo_value, time_value, cases) %>%
group_by(geo_value) %>%
epi_slide(same_geo_7d_sum = sum_of_k_else_na(cases, 7L), n=7L) %>%
ungroup() %>%
epi_slide(cross_geo_7d_sum = sum_of_k_else_na(cases, 7L*n_geos), n=7L)
```
`r previous_l2_heading`
Plot for FL:
```{r, echo=FALSE}
with_sum_features %>%
filter(geo_value == "fl") %>%
pivot_longer(c(cases, same_geo_7d_sum, cross_geo_7d_sum), names_to="signal", values_to="value") %>%
ggplot(aes(time_value, value, colour=signal)) +
geom_line(na.rm=TRUE)
```
<!-- TODO list perceived use cases? -->
<!-- https://github.com/cmu-delphi/epiprocess/issues/256 -->
## `epiprocess::epi_archive`
Version history (and present) of an epi surveillance data set.
```{r}
archive_cases_dv_subset
```
```{r, echo=FALSE, results="asis"}
withr::with_options(
code={
cat("<details><summary>Data object documentation, license, attribution</summary>")
print(help("archive_cases_dv_subset", package="epiprocess", help_type="text"))
cat("</details>")
},
list(pager=function(files, header, title, delete.file) {
on.exit({
unlink(files)
})
cat(paste(c("<pre>",purrr::reduce(purrr::map(files, function(file) {
# gsub("</?u>","_",gsub("</u>( *)<u>","\\1",
gsub("_\b(.)", "<u>\\1</u>", readLines(file))
# ))
}), function(x, y) c(x,"\n\n\n",y)), "</pre>"), collapse="\n"))
})
)
```
::: notes
* Easy to inspect revision behaviour
* Can use it for pseudo-prospective forecasts
* Compare to our procedure for the PNAS article — store every version as a `.csv` and load them individually
:::
##
Underlying version data:
```{r}
archive_cases_dv_subset$DT
```
<!-- TODO an epix_as_of slide by itself -->
<!-- TODO note stuff about we provide an interface that removes the extra worries when dealing directly with `data.table`s -->
<!-- FIXME move notes so that they render on the right slides -->
<!-- TODO direction slide -->
## Examining an `epi_archive`
Some snapshots of an outpatient %CLI signal for FL:
```{r, echo=FALSE, message=FALSE, warning=FALSE}
x <- archive_cases_dv_subset
x_latest <- epix_as_of(x, max_version = x$versions_end)
versions = seq(as.Date("2020-06-01"), x$versions_end - 1, by = "1 month")
snapshots <- map_dfr(versions, function(v) {
epix_as_of(x, max_version = v) %>% mutate(version = v)}) %>%
bind_rows(x_latest %>% mutate(version = x$versions_end)) %>%
mutate(latest = version == x$versions_end) %>%
filter(geo_value == "fl")
ggplot(snapshots %>% filter(!latest),
aes(x = time_value, y = percent_cli)) +
geom_line(aes(colour = version, group = version), na.rm=TRUE) +
geom_vline(aes(colour = version, xintercept = version), lty = 2) +
khroma::scale_colour_smoothrainbow(range=c(0.15, 1), trans="date") +
scale_x_date(minor_breaks = "month", date_labels = "%b %y") +
labs(x = "Date", y = "% of doctor's visits with CLI") +
theme(legend.position = "none") +
geom_line(data = snapshots %>% filter(latest),
aes(x = time_value, y = percent_cli),
inherit.aes = FALSE, color = "black", na.rm=TRUE)
```
<!-- TODO combine with the below -->
## `epiprocess::epix_slide`
Key functionality: version-aware counterpart to `epi_slide`; combines time windowing and as-of-ing.
```{r}
slide_result =
archive_cases_dv_subset %>%
epix_slide(~ tibble(
group_snapshot = list(.x),
coefs = .x %>%
complete(time_value = full_seq(time_value, 1L)) %>%
lm(formula = case_rate_7d_av ~
lag(case_rate_7d_av, 7L) +
lag(case_rate_7d_av, 14L) +
0) %>%
coef() %>%
list()
),
n=90L,
group_by=geo_value,
ref_time_values = seq(as.Date("2020-06-01")+89L, as.Date("2021-11-30"), by="week")) %>%
rename(as_of = time_value)
```
<!-- NOTE want these notes on the `slide_result` printing slide, but can't seem to get it there, even by moving the notes farther up to unrelated slides. -->
::: notes
- (Note that data latency cuts into the amount of data available in the window. For this JHU-CSSE case data, it's only 1 day out of the 90 day window, but we must be careful with indexing.)
:::
## `epiprocess::epix_slide`
```{r}
slide_result
```
## `epiprocess::epix_slide`
Plot for NY:
```{r, echo=FALSE}
slide_result_facets = c(
"Coefficients by date of (pseudoprospective) fit",
"Signal data by time_value"
) %>%
{factor(., levels=.)}
slide_result %>%
unnest_longer(slide_value_coefs, indices_to="feature", values_to="coefficient") %>%
filter(geo_value == "ny") %>%
mutate(facet = slide_result_facets[[1L]]) %>%
ggplot() +
facet_wrap(~ facet, ncol=1L, scales="free_y") +
geom_line(
aes(as_of, coefficient, linetype=feature)
) +
geom_line(
aes(time_value, case_rate_7d_av, colour=as_of, group=as_of),
slide_result %>%
transmute(as_of = as_of, slide_value_group_snapshot) %>%
unnest(slide_value_group_snapshot) %>%
filter(geo_value == "ny") %>%
mutate(facet = slide_result_facets[[2L]])
) +
khroma::scale_colour_smoothrainbow(range=c(0.15, 1), trans="date") +
guides(linetype = guide_legend(order=1L),
colour = guide_legend(order=2L)) +
xlab("as_of for coefficients / time_value for signals") +
ylab("coefficient / (unlagged) value")
```
::: notes
- note that revisions and latency aren't very prominent in this toy data set,
but are in other disease forecasting targets and predictors
:::
<!-- TODO move `epidatr` material here and segue from above note? -->
## `epiprocess` overview
| Vectors | `epi_df`s | `epi_archive`s |
|:---------------:|:-----------------:|----------------------------:|
| | | |
| | | |
| | | |
| `growth_rate` | `epi_cor` | |
| `detect_outlr*` | | |
| | | |
| | | |
| | | |
| | `dplyr::*` | |
| | `tsibble` interop | `epix_as_of` |
| | `epi_slide` | `epix_slide` |
| | | |
| | | |
| | | |
| | sample data sets | sample data set |
| | | |
| | | |
| | | |
| | | `epix_fill_through_version` |
| | | `epix_merge` |
| | | |
| | | |
# Next presentations
## `{epipredict}`
<https://cmu-delphi.github.io/epipredict>
<section style="text-align: left">
General idea: provide
* a set of basic, easy-to-use forecasters that work out of the box, with reasonably limited customization, e.g.:
* baseline flat-line forecaster
* autoregressive forecaster
* autoregressive classifier
* a framework for creating custom forecasters out of modular components
</section>
(See some slides prepared in September [here](https://dajmcdon.github.io/delphi-research-tooling/#7).)
## `{epidatr}`, `{epidatpy}`
<https://github.com/cmu-delphi/epidatr>
<https://github.com/cmu-delphi/epidatpy>
<section style="text-align: left">
Successors to Delphi Epidata API functionality in `delphi_epidata`, `covidcast` R and Python packages.
* faster downloads
* more consistent interfaces across endpoints
* friendlier interfaces (longer-term)
</section>