Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

data prep to translate target forecasts to submission file format #6

Closed
vpnagraj opened this issue Dec 22, 2020 · 30 comments
Closed

data prep to translate target forecasts to submission file format #6

vpnagraj opened this issue Dec 22, 2020 · 30 comments
Assignees

Comments

@vpnagraj
Copy link
Contributor

the COVID-19 forecast hub has strict requirements for the forecast submission format:

https://github.com/reichlab/covid19-forecast-hub/blob/master/data-processed/README.md#forecast-file-format

once we generate point / quantile forecasts for targets we need to execute some post-processing to wrangle the data into the required format

this could include:

  • pivoting wide quantile predictions to long format
  • translating week/year to epiweek
  • getting a target_end_date from week
  • formatting the "n week ahead" target text (e.g. "1 wk ahead inc death")

once we have the submission file format prepped, we can validate locally:

@vpnagraj
Copy link
Contributor Author

NOTE the data prep of forecast output will depend on the forecast method used ... but given that we are likely starting with a time series model (and using the fable framework) then the prep should work for whatever method we land on for initial implementation (ARIMA, ETS, etc)

stephenturner added a commit that referenced this issue Dec 22, 2020
@stephenturner
Copy link
Contributor

Initial shot at this in 6bbd2fc. Lots of outstanding issues here.

  • Needs modularity to take different targets and create the output accordingly.
  • Something's really, really off with dates. It's some combination of epiweeks to date to yearweek and back to date that's getting weird. I can demonstrate or you can run through that last pipeline sticking a %>% tail() in there to take a look at what I mean.
forecast_date target target_end_date location type quantile value
2020-12-22 1 wk ahead inc case 2020-12-12 US point NA 1595791
2020-12-22 2 wk ahead inc case 2020-12-19 US point NA 1717279
2020-12-22 3 wk ahead inc case 2020-12-26 US point NA 1874981
2020-12-22 4 wk ahead inc case 2021-01-02 US point NA 1992527
2020-12-22 1 wk ahead inc case 2020-12-12 US quantile 0.025 1518044
2020-12-22 2 wk ahead inc case 2020-12-19 US quantile 0.025 1477875
2020-12-22 3 wk ahead inc case 2020-12-26 US quantile 0.025 1502830
2020-12-22 4 wk ahead inc case 2021-01-02 US quantile 0.025 1521864
2020-12-22 1 wk ahead inc case 2020-12-12 US quantile 0.100 1539116
2020-12-22 2 wk ahead inc case 2020-12-19 US quantile 0.100 1615797
2020-12-22 3 wk ahead inc case 2020-12-26 US quantile 0.100 1725278
2020-12-22 4 wk ahead inc case 2021-01-02 US quantile 0.100 1784281
2020-12-22 1 wk ahead inc case 2020-12-12 US quantile 0.250 1578533
2020-12-22 2 wk ahead inc case 2020-12-19 US quantile 0.250 1670508
2020-12-22 3 wk ahead inc case 2020-12-26 US quantile 0.250 1805991
2020-12-22 4 wk ahead inc case 2021-01-02 US quantile 0.250 1894356
2020-12-22 1 wk ahead inc case 2020-12-12 US quantile 0.500 1590791
2020-12-22 2 wk ahead inc case 2020-12-19 US quantile 0.500 1712935
2020-12-22 3 wk ahead inc case 2020-12-26 US quantile 0.500 1876995
2020-12-22 4 wk ahead inc case 2021-01-02 US quantile 0.500 1995373
2020-12-22 1 wk ahead inc case 2020-12-12 US quantile 0.750 1614622
2020-12-22 2 wk ahead inc case 2020-12-19 US quantile 0.750 1777245
2020-12-22 3 wk ahead inc case 2020-12-26 US quantile 0.750 1965876
2020-12-22 4 wk ahead inc case 2021-01-02 US quantile 0.750 2119008
2020-12-22 1 wk ahead inc case 2020-12-12 US quantile 0.900 1656737
2020-12-22 2 wk ahead inc case 2020-12-19 US quantile 0.900 1843507
2020-12-22 3 wk ahead inc case 2020-12-26 US quantile 0.900 2060315
2020-12-22 4 wk ahead inc case 2021-01-02 US quantile 0.900 2236633
2020-12-22 1 wk ahead inc case 2020-12-12 US quantile 0.975 1717527
2020-12-22 2 wk ahead inc case 2020-12-19 US quantile 0.975 1911562
2020-12-22 3 wk ahead inc case 2020-12-26 US quantile 0.975 2134267
2020-12-22 4 wk ahead inc case 2021-01-02 US quantile 0.975 2374745

stephenturner added a commit that referenced this issue Dec 22, 2020
stephenturner added a commit that referenced this issue Dec 22, 2020
@stephenturner
Copy link
Contributor

I have this working in some code at f2c3e91

  1. Fit separate models for each outcome (inc cases, inc deaths, cum deaths). (I tried fitting multiple models in the same fit objects with different dependent variables, fable complains: you can't have a mable (model table) with different Y vars). So, for now, different model objectes.
  2. Pass them to the format_fit_for_submission() function. This produces the forecast at the desired horizon, bootstraps each model fit 1000 times, gets the quibble (quantile tibble) for each fit using 23 quantiles, then restricts down to the smaller subset of quantiles if you're looking at inc cases.
  3. Bind rows from each of these function calls together from each metric you're looking at to create the final submission.

fable-submission-mockup-allmetrics.csv.txt (github doesn't let you upload .csv extensions, remove the .txt)

Notes / known issues:

  • There's still an issue with the date conversion. This should probably be its own issue together with a reprex.
  • There's some hard-coding of the text formatting to convert "icases" or "cdeaths" into "inc cases" or "cum deaths", etc, happening around here. This is a gross hack, and would probably be better solved by 👇 or pretty much anything besides this method, which depends on the name of the variable you store the model objects in!
  • There's some redundancy and creation of separate objects for each outcome. This could probably be cleaned up by creating a single fit object, which is a list, with the names of that list being inc cases, inc deaths, and cum deaths. This would potentially also allow for more strict names(fit) %in% ... checking.
  • Haven't done this yet with anything lower than US-level data.

@vpnagraj run through this code a pipe at a time, see if you have any suggestions.

@vpnagraj
Copy link
Contributor Author

vpnagraj commented Dec 22, 2020

stepped through what you have (using the *-allmetrics version of the script)

pushed up some edits:

344d712

i think i have a candidate fix for the text formatting conversion of "icases" to "inc cases" ... just pass in a new argument for target_name ? seems to be working

also played around with the dates a little bit. agreed that something is way off. i reworked your code, thought i had fixed the issue (to get the epiweek date starting on sunday instead of monday) but now that i'm looking at this issue again it looks the same as your comment above (#6 (comment))

🤔

im wondering if get_cases() and our exclusion of last week (because it's incomplete) is throwing a wrench here ...

@vpnagraj
Copy link
Contributor Author

@stephenturner FYI looks like get_cases() and get_deaths() did include logic to remove the current week. that same (or similar) logic was implemented in in the TS modeling code:

https://github.com/signaturescience/focustools/blob/master/scratch/fable-submission-mockup-allmetrics.R#L29

i think it's better to do handle it that way ^ ... ie lets drop the current week exclusion from get_cases() and get_deaths()

done in 4aa7bdb

so that saved us one week of data. we're still bumping into the issue with horizon being k + 1 week (current week that we can't/shouldn't use in modeling because it is incomplete)

need to keep thinking on this ...

@stephenturner
Copy link
Contributor

I'm still cracking at this. I think the problem comes in with mmwrweeks being converted to dates, then to yearweeks, then back to dates.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

# Saturday
MMWRweek::MMWRweek("2020-12-26")$MMWRweek
#> [1] 52
# Sunday
MMWRweek::MMWRweek("2020-12-27")$MMWRweek
#> [1] 53
# Monday
MMWRweek::MMWRweek("2020-12-28")$MMWRweek
#> [1] 53

# Sunday
MMWRweek::MMWRweek2Date(2020, 53, MMWRday = 1)
#> [1] "2020-12-27"
MMWRweek::MMWRweek2Date(2020, 53, MMWRday = 1) %>% tsibble::yearweek()
#> <yearweek[1]>
#> [1] "2020 W52"
#> # Week starts on: Monday
MMWRweek::MMWRweek2Date(2020, 53, MMWRday = 1) %>% tsibble::yearweek() %>% lubridate::as_date()
#> [1] "2020-12-21"

Still trying to craft that reprex.

@stephenturner
Copy link
Contributor

stephenturner commented Dec 28, 2020

I pushed some code in a new script at 6ff70b4. Run through that. I think the best oversimplified approach here might be simply adding a +6 or -7 or whatever somewhere.

stephenturner added a commit that referenced this issue Dec 29, 2020
…th yweek based on that monday to index the tsibble #3 #6 cc @vpnagraj
stephenturner added a commit that referenced this issue Dec 29, 2020
@stephenturner
Copy link
Contributor

I the date issue is fixed now. I'm creating the tsibble with a function that adds a monday column, which is the monday of that epiweek, and bases the yearweek tsibble index column based on that week. Later, after modeling/forecasting, I get the as_date() of that yweek, which returns the monday of that (1, 2, 3, or 4) week ahead forecast, and +days(5) to get the saturday that ends that epiweek.

From the https://github.com/reichlab/covid19-forecast-hub#ensemble-model section:

For inclusion in the ensemble, we additionally require that forecasts include a full set of 23 quantiles to be submitted for each of the one through four week ahead values for forecasts of deaths, and a full set of 7 quantiles for the one through four week ahead values for forecasts of cases (see technical README for details), and that the 10th quantile of the predictive distribution for a 1 week ahead forecast of cumulative deaths is not below the most recently observed data.

I don't think the current forecasts based on the auto ARIMA models are doing this, but we should probably add a check/correction for this case, that if the 10th quantile of any cumulative forecast is below the most recently observed data, then make it equal to the most recent observed data, at a minimum.

@stephenturner
Copy link
Contributor

stephenturner commented Dec 29, 2020

This check for forecasts for cumulative deaths not below current week values is now implemented in ae43487. But I haven't yet figured out the best place for this to reside, functionally. The format_fit_for_submission Takes as input the model table (output from model()), and doesn't actually take any data as input. The current week's cumulative death value actually resides in the data. If we wrote one monster function that did both modeling, forecasting, and formatting, we could do this here, because that function would have to take the data as input, not the models. But I kind of like keeping them separate for now, because it makes tinkering around the the modeling a bit easier, doing it outside of some monster function call. Anyway, for now, the bolt-on fix in ae43487 works, and we can sort out how to best modularize/functionalize this later. @vpnagraj if you wouldn't mind, run through the script https://github.com/signaturescience/focustools/blob/master/scratch/fable-submission-mockup-allmetrics.R to see if this all looks legit to you.

@stephenturner
Copy link
Contributor

I added some code in cfdd1e8 to use the script added in 63a2ff9 to validate the submission.

> forecast_filename <- here::here("scratch/fable-submission-mockup-allmetrics-forecasts/2021-01-04-sigsci-arima.csv")
> validate_file(forecast_filename)


 Validating /Users/sturner/sigsci/irad/focustools/scratch/fable-submission-mockup-allmetrics-forecasts/2021-01-04-sigsci-arima.csv ...
VALIDATED: filename 
VALIDATED: column names
VALIDATED: no NA values
Warning in verify_targets(entry) :
  ERROR: Some entries in `targets` do not correspond to standards:1 wk ahead cum deaths, 1 wk ahead inc cases, 1 wk ahead inc deaths, 2 wk ahead cum deaths, 2 wk ahead inc cases, 2 wk ahead inc deaths, 3 wk ahead cum deaths, 3 wk ahead inc cases, 3 wk ahead inc deaths, 4 wk ahead cum deaths, 4 wk ahead inc cases, 4 wk ahead inc deaths
VALIDATED: date format 
VALIDATED: forecast_date, target_end_date
VALIDATED: no quantile crossing
VALIDATED: temporal monotonicity
VALIDATED: cum geq inc
VALIDATED: entries of `quantile`

So, everything seems to look okay except for the targets.

The code https://github.com/reichlab/covid19-forecast-hub/blob/68df08d9e6e19d55fddab4bd5abb505202023ecb/code/validation/R-scripts/functions_plausibility.R#L169-L186, checks for 1, 2, 3, 4 wk ahead inc death and cum deaths, but doesn't allow for inc cases:

#' Checking that all entries in `target` correspond to standards
#'
#' @param entry the data.frame
#'
#' @return invisibly returns TRUE if problems detected, FALSE otherwise
verify_targets <- function(entry){
  allowed_targets <- c(
    paste(0:130, "day ahead inc death"),
    paste(0:130, "day ahead cum death"),
    paste(0:20, "wk ahead inc death"),
    paste(0:20, "wk ahead cum death"),
    paste(0:130, "day ahead inc hosp")
  )
  targets_in_entry <- unique(entry$target)
  if(!all(targets_in_entry %in% allowed_targets)){
    warning("ERROR: Some entries in `targets` do not correspond to standards:",
            paste0(targets_in_entry[!(targets_in_entry %in% allowed_targets)], collapse = ", "))
    return(invisible(FALSE))
  }else{
    cat("VALIDATED: targets\n")
    return(invisible(TRUE))
  }
}

This doesn't jive with what I thought was required here to be included in the ensemble forecast (https://github.com/reichlab/covid19-forecast-hub/tree/master/data-processed#target). Perhaps this R code is no longer maintained. According to the documentation at https://github.com/reichlab/covid19-forecast-hub/blob/master/data-processed/R_forecast_file_validation.md,

For those familiar with R (but not python), there is a separate set of tests that may be useful to diagnose data formatting issues in functions_plausibility.R. We have tried to keep these in sync with the python checks automatically run during a pull request, but have now stopped maintaining the checks in R. They are kept in the repository merely as an additional resource for teams who work exclusively with R. If you discover major discrepancies, you can nonetheless let us know and we may address them as time permits.

... in fact, after digging around a little bit, it seems like this is the case!

That R script, https://github.com/reichlab/covid19-forecast-hub/blob/master/code/validation/R-scripts/functions_plausibility.R, was last updated in May. According to the README, https://github.com/reichlab/covid19-forecast-hub/tree/master/data-processed#removed-targets, N day ahead inc cases was removed in June.

@stephenturner
Copy link
Contributor

To the script in our utils/ folder, I added N wk ahead in case to the allowed targets in af56e43.

This let the results pass that validation check (after changing 'deaths' to 'death' and 'cases' to 'case' in 5de29cd). But another validation effort failed:

> validate_file(forecast_filename)


 Validating /Users/sturner/sigsci/irad/focustools/scratch/fable-submission-mockup-allmetrics-forecasts/2021-01-04-sigsci-arima.csv ...
VALIDATED: filename 
VALIDATED: column names
VALIDATED: no NA values
VALIDATED: targets
VALIDATED: date format 
VALIDATED: forecast_date, target_end_date
 Error in if (any(is_crossing)) { : missing value where TRUE/FALSE needed 

I dug into the validation scripts and there's a place right around here https://github.com/reichlab/covid19-forecast-hub/blob/68df08d9e6e19d55fddab4bd5abb505202023ecb/code/validation/R-scripts/functions_plausibility.R#L259-L282 where it checks for "quantile crossing". I'm not exactly sure what this is doing yet, but I think what's causing a problem here is that some targets have different quantiles required than others. inc deaths and cum deaths require a larger set of quantiles, while N wk ahead inc case (the newly added target) requires only a subset of those quantiles. This is spelled out in the data submission readme here.

I think this causes a problem with this old legacy code because one of the operations it performs is a widening reshape, and when there are some targets with a subset of quantiles compared to other targets, you end up with NAs in the wide matrix. I still don't fully understand what this check is looking for, but I silenced this validation problem in a2cd111 by omitting NAs from this crossing check. All the others were FALSE. This obviates the Error seen above, and all validation checks pass.

> validate_file(forecast_filename)


 Validating /Users/sturner/sigsci/irad/focustools/scratch/fable-submission-mockup-allmetrics-forecasts/2021-01-04-sigsci-arima.csv ...
VALIDATED: filename 
VALIDATED: column names
VALIDATED: no NA values
VALIDATED: targets
VALIDATED: date format 
VALIDATED: forecast_date, target_end_date
VALIDATED: no quantile crossing
VALIDATED: temporal monotonicity
VALIDATED: cum geq inc
VALIDATED: entries of `quantile`

CAVEAT: This works, but given the hacks I had to put into place to get this working, I'd recommend we either:

  1. Switch to the officially supported instructions for validating locally, https://github.com/reichlab/covid19-forecast-hub/wiki/Running-Checks-Locally
  2. Or else look around to see if someone else has forked and kept this R code up to date.

If we can find #-2 above, it sure would be more lightweight than going the #-1 route, which requires updating the upstream of the fork, installing some python pkgs, etc. Perhaps it isn't as burdensome as I think. I'll give it a spin on darwin if I can before our meeting today.

@stephenturner
Copy link
Contributor

Follow up -- #-1 is pretty trivial. I set up a new conda environment, and followed the instructions at https://github.com/reichlab/covid19-forecast-hub/wiki/Running-Checks-Locally to install requirements and validate a single forecast file.

On darwin:

(focus) sturner@darwin:/data/projects/focus/covid19-forecast-hub$ python3 code/validation/validate_single_forecast_file.py ../focustools/scratch/fable-submission-mockup-allmetrics-forecasts/2021-01-04-sigsci-arima.csv

VALIDATING ../focustools/scratch/fable-submission-mockup-allmetrics-forecasts/2021-01-04-sigsci-arima.csv
✓ ../focustools/scratch/fable-submission-mockup-allmetrics-forecasts/2021-01-04-sigsci-arima.csv is valid with no errors

🎉 🥳 🌟 ✔️

@vpnagraj
Copy link
Contributor Author

vpnagraj commented Jan 4, 2021

@stephenturner parallel thought here ...

what if we put that the python validation script / pkgs in a docker image ... and wrapped a call to taht docker image in an R function (i.e. using something like stevedore) ?

i can help with that if want to pursue. shouldn't be too big of a lift. BUT we'd obviously still need to makes sure that validation code stays current

@stephenturner
Copy link
Contributor

I'd almost always prefer to call an R function than issue a python command/script at the bash shell. Looks like the requirements are pretty minimal.

https://github.com/reichlab/covid19-forecast-hub/blob/master/visualization/requirements.txt

@vpnagraj
Copy link
Contributor Author

vpnagraj commented Jan 4, 2021

agreed. see #9

@vpnagraj
Copy link
Contributor Author

vpnagraj commented Jan 5, 2021

@stephenturner heads up i've heavily refactored the scratch submission mockup code:

https://github.com/signaturescience/focustools/blob/master/scratch/fable-submission-mockup-allmetrics.R

things to note:

  • the new ts_forecast() function now sits outside of format_fit_for_submission() (i think we can be a litte more nimble this way)
  • ts_forecast() accepts horizon AND "new_data" args ... new_data is what fable needs for forecasts that require other covariates ... if NULL (default) then the new_data will be ignored. the way its written now, ts_forecast() should work for either forecasting with/without new_data
  • i added a "seed" argument to ts_forecast() so that i could validate that the forecasts matched what you were generating previously (checked before i changed the ideaths to be predicted by lagged cases). probably a good idea to keep that in there
  • cdeaths is still being forecast using an ARIMA ... so we still need to figure out a way to get use the ideaths forecast to arrive at cdeaths (AND still get the quibble format)
  • i ran into an issue with validating the sumbission file generated with the current date (2020-01-05) ... see message below. the workaround was to force the forecast date and filename to use yesterday (2020-01-04), after which validation succeeded.
VALIDATING ../focustools/scratch/fable-submission-mockup-allmetrics-forecasts/2021-01-05-sigsci-ts.csv
✘ Error in ../focustools/scratch/fable-submission-mockup-allmetrics-forecasts/2021-01-05-sigsci-ts.csv. Error(s):
 ["target_end_date was not the expected Saturday. forecast_date=2021-01-05, target_end_date=2021-01-09. exp_target_end_date=2021-01-16, row=['2021-01-05', '1 wk ahead cum death', '2021-01-09', 'US', 'point', 'NA', '369373']", "target_end_date was not the expected Saturday. forecast_date=2021-01-05, target_end_date=2021-01-09. exp_target_end_date=2021-01-16, row=['2021-01-05', '1 wk ahead cum death', '2021-01-09', 'US', 'quantile', '0.01', '367031']", "target_end_date was not the expected Saturday. forecast_date=2021-01-05, target_end_date=2021-01-09. exp_target_end_date=2021-01-16, row=['2021-01-05', '1 wk ahead cum death', '2021-01-09', 'US', 'quantile', '0.025', '367118']", "target_end_date was not the expected Saturday. forecast_date=2021-01-05, target_end_date=2021-01-09. exp_target_end_date=2021-01-16, row=['2021-01-05', '1 wk ahead cum death', '2021-01-09', 'US', 'quantile', '0.05', '367358']", "target_end_date was not the expected Saturday. forecast_date=2021-01-05, target_end_date=2021-01-09. exp_target_end_date=2021-01-16, row=['2021-01-05', '1 wk ahead cum death', '2021-01-09', 'US', 'quantile', '0.1', '367969']", "target_end_date was not the expected Saturday. forecast_date=2021-01-05, target_end_date=2021-01-09. exp_target_end_date=2021-01-16, row=['2021-01-05', '1 wk ahead cum death', '2021-01-09', 'US', 'quantile', '0.15', '368250']", "target_end_date was not the expected Saturday. forecast_date=2021-01-05, target_end_date=2021-01-09. exp_target_end_date=2021-01-16, row=['2021-01-05', '1 wk ahead cum death', '2021-01-09', 'US', 'quantile', '0.2', '368334']", "target_end_date was not the expected Saturday. forecast_date=2021-01-05, target_end_date=2021-01-09. exp_target_end_date=2021-01-16, row=['2021-01-05', '1 wk ahead cum death', '2021-01-09', 'US', 'quantile', '0.25', '368688']", "target_end_date was not the expected Saturday. forecast_date=2021-01-05, target_end_date=2021-01-09. exp_target_end_date=2021-01-16, row=['2021-01-05', '1 wk ahead cum death', '2021-01-09', 'US', 'quantile', '0.3', '369017']", "target_end_date was not the expected Saturday. forecast_date=2021-01-05, target_end_date=2021-01-09. exp_target_end_date=2021-01-16, row=['2021-01-05', '1 wk ahead cum death', '2021-01-09', 'US', 'quantile', '0.35', '369038']", 'target_end_date was ...']

any thoughts on that ^ ?

@stephenturner
Copy link
Contributor

I don't know, unless it expected the target end date for 1 week ahead to end on the following saturday if you're dating the forecast after monday? I feel like I've seen something to this effect in the docs. Let me dig.

@vpnagraj
Copy link
Contributor Author

vpnagraj commented Jan 5, 2021

sheesh.

well maybe thats OK? i mean im working on writing the validation wrapper for the python method now. we can stick to validating only before we are ready to submit on the sunday or monday. so as long as we generate the forecasts/validate on sunday or monday (before deadline) it should be fine? i think?

@stephenturner
Copy link
Contributor

See #26. Reopening because there's currently a line hard-coding "US" as the location:

dplyr::mutate(location="US", forecast_date=lubridate::today()) %>%

dplyr::mutate(location="US", forecast_date=lubridate::today())

This will not allow for state or county-level granularity.

@vpnagraj
Copy link
Contributor Author

@stephenturner see https://github.com/signaturescience/focustools/blob/state-level-ts/R/submission.R#L72

i removed the location="US" that was hardcoded in there. the forecast object should include a location column generate with get_cases() / get_deaths():

  • granularity="national" the value will be "US"
  • granuarlity="state" the value will be full state name
  • granulairty="county" the value will be county fips code

we do need to convert the state/territory name to appropriate FIPS:

https://github.com/signaturescience/focustools/blob/state-level-ts/R/submission.R#L72

i think that will be a simple join to focustools:::locations somewhere in format_for_submission() ?

@vpnagraj
Copy link
Contributor Author

sorry to steamroll you here @stephenturner but i'm cooking on this state level stuff!

i just pushed up an edit to format_for_submission() that addresses the location join

that piece seems to be working now. mostly.

i'm seeing the following issues in validate_forecast() (full output at bottom of this comment):

  • "entries in the value column must be non-negative": these are state/territory forecasts that come through as negative. my guess is that most of the negative values are from models of territories where there are few cases (for example, location '78' is Virgin Islands and is one that has negative values predicted). so it's really an issue with models themselves, not necessarily formatting (State level forecasts #26 ). unless we want to stick a condition in format_for_submission() that bounds all values at min 0? i kind of think that should go elsewhere ...
  • "invalid location for target. location='11001'": that's the location code for DC. we need to figure out what the correct code should be
  • "target_end_date was not the expected Saturday." : that's because i'm running on a tuesday ...
[1] "entries in the `value` column must be non-negative. value='-1'. row=['2021-01-19', '1 wk ahead inc death', '2021-01-23', '02', 'quantile', '0.025', '-1']\nentries in the `value` column must be non-negative. value='-1'. row=['2021-01-19', '1 wk ahead inc death', '2021-01-23', '66', 'quantile', '0.05', '-1']\nentries in the `value` column must be non-negative. value='-1'. row=['2021-01-19', '1 wk ahead inc death', '2021-01-23', '78', 'quantile', '0.01', '-1']\nentries in the `value` column must be non-negative. value='-1'. row=['2021-01-19', '1 wk ahead inc death', '2021-01-23', '78', 'quantile', '0.025', '-1']\nentries in the `value` column must be non-negative. value='-1'. row=['2021-01-19', '1 wk ahead inc death', '2021-01-23', '78', 'quantile', '0.05', '-1']\nentries in the `value` column must be non-negative. value='-1'. row=['2021-01-19', '2 wk ahead inc death', '2021-01-30', '02', 'quantile', '0.025', '-1']\nentries in the `value` column must be non-negative. value='-1'. row=['2021-01-19', '2 wk ahead inc death', '2021-01-30', '66', 'quantile', '0.05', '-1']\nentries in the `value` column must be non-negative. value='-1'. row=['2021-01-19', '2 wk ahead inc death', '2021-01-30', '78', 'quantile', '0.01', '-1']\nentries in the `value` column must be non-negative. value='-1'. row=['2021-01-19', '2 wk ahead inc death', '2021-01-30', '78', 'quantile', '0.025', '-1']\nentries in the `value` column must be non-negative. value='-1'. row=['2021-01-19', '3 wk ahead inc death', '2021-02-06', '02', 'quantile', '0.025', '-1']\nentries in the `valu...\ninvalid location for target. location='11001', target='1 wk ahead cum death'. row=['2021-01-19', '1 wk ahead cum death', '2021-01-23', '11001', 'point', 'NA', '1426']\ninvalid location for target. location='11001', target='1 wk ahead cum death'. row=['2021-01-19', '1 wk ahead cum death', '2021-01-23', '11001', 'quantile', '0.01', '877']\ninvalid location for target. location='11001', target='1 wk ahead cum death'. row=['2021-01-19', '1 wk ahead cum death', '2021-01-23', '11001', 'quantile', '0.025', '881']\ninvalid location for target. location='11001', target='1 wk ahead cum death'. row=['2021-01-19', '1 wk ahead cum death', '2021-01-23', '11001', 'quantile', '0.05', '885']\ninvalid location for target. location='11001', target='1 wk ahead cum death'. row=['2021-01-19', '1 wk ahead cum death', '2021-01-23', '11001', 'quantile', '0.1', '887']\ninvalid location for target. location='11001', target='1 wk ahead cum death'. row=['2021-01-19', '1 wk ahead cum death', '2021-01-23', '11001', 'quantile', '0.15', '888']\ninvalid location for target. location='11001', target='1 wk ahead cum death'. row=['2021-01-19', '1 wk ahead cum death', '2021-01-23', '11001', 'quantile', '0.2', '889']\ninvalid location for target. location='11001', target='1 wk ahead cum death'. row=['2021-01-19', '1 wk ahead cum death', '2021-01-23', '11001', 'quantile', '0.25', '890']\ninvalid location for target. location='11001', target='1 wk ahead cum death'. row=['2021-01-19', '1 wk ahead cum death', '2021-01-23', '11001', 'quantile', '0.3', '890']\ninvalid location for target. location='11001', target='1 wk ahead cum death'. row=['2021-01-19', '1 wk ahead cum death', '2021-01-23', '11001', 'quantile', '0.35', '892']\ninvalid location for...\ntarget_end_date was not the expected Saturday. forecast_date=2021-01-19, target_end_date=2021-01-23. exp_target_end_date=2021-01-30, row=['2021-01-19', '1 wk ahead cum death', '2021-01-23', '01', 'point', 'NA', '6533']\ntarget_end_date was not the expected Saturday. forecast_date=2021-01-19, target_end_date=2021-01-23. exp_target_end_date=2021-01-30, row=['2021-01-19', '1 wk ahead cum death', '2021-01-23', '01', 'quantile', '0.01', '6249']\ntarget_end_date was not the expected Saturday. forecast_date=2021-01-19, target_end_date=2021-01-23. exp_target_end_date=2021-01-30, row=['2021-01-19', '1 wk ahead cum death', '2021-01-23', '01', 'quantile', '0.025', '6249']\ntarget_end_date was not the expected Saturday. forecast_date=2021-01-19, target_end_date=2021-01-23. exp_target_end_date=2021-01-30, row=['2021-01-19', '1 wk ahead cum death', '2021-01-23', '01', 'quantile', '0.05', '6391']\ntarget_end_date was not the expected Saturday. forecast_date=2021-01-19, target_end_date=2021-01-23. exp_target_end_date=2021-01-30, row=['2021-01-19', '1 wk ahead cum death', '2021-01-23', '01', 'quantile', '0.1', '6401']\ntarget_end_date was not the expected Saturday. forecast_date=2021-01-19, target_end_date=2021-01-23. exp_target_end_date=2021-01-30, row=['2021-01-19', '1 wk ahead cum death', '2021-01-23', '01', 'quantile', '0.15', '6408']\ntarget_end_date was not the expected Saturday. forecast_date=2021-01-19, target_end_date=2021-01-23. exp_target_end_date=2021-01-30, row=['2021-01-19', '1 wk ahead cum death', '2021-01-23', '01', 'quantile', '0.2', '6432']\ntarget_end_date was not the expected Saturday. forecast_date=2021-01-19, target_end_date=2021-01-23. exp_target_end_date=2021-01-30, row=['2021-01-19', '1 wk ahead cum death', '2021-01-23', '01', 'quantile', '0.25', '6445']\ntarget_end_date was not the expected Saturday. forecast_date=2021-01-19, target_end_date=2021-01-23. exp_target_end_date=2021-01-30, row=['2021-01-19', '1 wk ahead cum death', '2021-01-23', '01', 'quantile', '0.3', '6446']\ntarget_end_date was not the expected Saturday. forecast_date=2021-01-19, target_end_date=2021-01-23. exp_target_end_date=2021-01-30, row=['2021-01-19', '1 wk ahead cum death', '2021-01-23', '01', 'quantile', '0.35', '6446']\ntarget_end_date was ..."

@stephenturner
Copy link
Contributor

unless we want to stick a condition in format_for_submission() that bounds all values at min 0? i kind of think that should go elsewhere ...

Bound it at zero for now. We could get more sophisticated... for cum deaths we would bound point and all quantiles at no less than the last week's current data. Inc death/cases- seems reasonable that the +1wk ahead should be no less than 2x the difference between 0 and -1wk. Or +2wk ahead should be no less than 2x difference between 0 and -2x. And still bounded at zero. I.e., enforcing that you can't drop incident cases/deaths more than twice as much as they changed in a previous horizon backward?

Where to do it? Agree doesn't really belong in a formatting script. But the ts_forecast doesn't yet track the data (#17), so you'd have to supply that as an arg there. Perhaps some final thing after formatting for submission, something like bound_submission(submission, data)? Although that could get tricky with submissions with multiple location granularities (eg from a bind_rows on a US level forecast with a state-level forecast) from different data objects with different location granularity?

@stephenturner
Copy link
Contributor

  • "invalid location for target. location='11001'": that's the location code for DC. we need to figure out what the correct code should be

"District of Columbia" is 11 right

DC,11,District of Columbia,705749.0

@vpnagraj
Copy link
Contributor Author

ahh DC is both:

,11001,District of Columbia,705749.0

11001 must be the county FIPS

need to make a special case to handle that somehow

@stephenturner
Copy link
Contributor

Looks like there are lots of counties with the same name in different states (Washington, Jefferson, Franklin, no surprise). DC looks like the only non-county dupe.

> focustools:::locations %>% 
+   count(location_name, sort=TRUE) %>% 
+   filter(n>1)
# A tibble: 441 x 2
   location_name         n
   <chr>             <int>
 1 Washington County    31
 2 Jefferson County     26
 3 Franklin County      25
 4 Jackson County       24
 5 Lincoln County       24
 6 Madison County       20
 7 Clay County          18
 8 Montgomery County    18
 9 Union County         18
10 Marion County        17
# … with 431 more rows
> focustools:::locations %>% 
+   count(location_name, sort=TRUE) %>% 
+   filter(n>1) %>% 
+   filter(!grepl("County", location_name))
# A tibble: 1 x 2
  location_name            n
  <chr>                <int>
1 District of Columbia     2

I was worried about eg Hawaii (county) vs Hawaii (state) but no problem there.

@vpnagraj
Copy link
Contributor Author

heads up i think i have a solution for this. pushing up soon ...

@vpnagraj
Copy link
Contributor Author

edits pushed up to state-level-ts branch to address the location code issues:

@vpnagraj
Copy link
Contributor Author

i think we're good with the data prep for the state forecasts. just need to make some decisions about which states/territories to submit (#26) and make some minor edits to the pipeline function (#16 )

closing this one for now.

@stephenturner
Copy link
Contributor

This one will probably get reopened from work in #26 if getting quantiles via hilo

@stephenturner stephenturner reopened this Jan 26, 2021
@stephenturner
Copy link
Contributor

Actually, handling this in the forecast function so won't have to change this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants