Project_plan.txt

February 2, 2018::

Accomplished at NCEAS:

1.) Data pipeline for NASS yield data established
2.) All SSURGO and STATSGO files downloaded
3.) CDL crop frequency layer downloaded

To do: 

1.) Make decsions re: filtering NASS yield data
	X Availbility of CV metrics - 2/27/18, no CV metrics available for corn yield
	- Include/exclude counties with little corn production prior to ethanol legislation? Or truncate entire time series at point of leg introduction (circa 2001)?
	  - To consider, consistency of data reporting...need contiguous data
	  - Alternatively, can use the crop frequency layer -- check metadata for time period and changes in classification methods
	  - Corn yield data goes back really far, though. Is there a trade-off in cutting things off to such a recent time point.
	  - Corn prior to WWII was basically a different crop looking at yields, so I think ~1950 is the farthest back we could go. 
	- How to use irrigation numbers? Take average from 3 census years to get thresholds?
	  - Ag census only began collecting info on irrigation starting in 1997 --> USE 1997 as yield data cut off
	  - In some counties, irrigation numbers change frequently (surprising)
	  - Suggest subsetting based on both % irrigated, but also some kind of variance % irrigated figure
	    - Keep track of these smaller analyses used to make filtering choices throughout...
2.) Decide how to use SSURGO and STATSGO data
	- Which variables?
	- Whether to add measurement error?
	- Distingush between modeled and empirical measures
	- Whether and how to aggregate data across spatial scales (i.e. horizon -> component -> mukey -> county)
3.) Give Steve skeleton dataset to begin testing Bayesian modeling framework
4.) Clean/consolidate yield and soils data pipelines
5.) Decide how to use CDL crop layer to clip soils data (current thinking, use crop frequency layer to clip grids that are used for     corn production)
6.) Model with final dataset
	- Decide on variability metrics, methods
	- Use Bayesian modeling? Or ML?
7.) Figures 
	- Coefficent strengths or variable importance
	- Maps of variability across corn growing areas
8.) Explore possibility of isolating drought year data...

March 2, 2018::

Decisions made:
- Use 1997 as a cutoff for yield data given that the data on irrigated acres starts then
- Don't try to parse out yield data based on different ag census periods. All yield data should be across the same time period and number of years. -->
- Take means of % irrigated acreage across the 4 census years, as well as some measure of variation. Only select counties that have had consistent number of irrigated acres since 1997 and a % irrigated that falls in the bottom of the distibution (TBD, but likely around 5% or less).

To do:

1.) To do items 2 through 7 from 2/2/2018
2.) Check that data subset from NASS is contiguous across time period 
3.) Provide Steve with toy dataset to test out his Stan code (subset 1 state from 1997- period) DONE
4.) Make pull requests for code review to Emma once NASS data ETL is finished
5.) Begin Markdown file that includes smaller analyses used to make data filtering decisions. Will help to construct arguments in future paper. DONE


March 21, 2018::

To date:
- Completed NASS data pipeline and filtering, still some decisions to be made and some questions that popped up for me in the process of looking at data
- Developing script to load SSURGO data. Encountering a problem I remember encountering before - the valu1.gdb database has no spatial features and is meant to be joined in a GIS with the MU polygons in SSURGO. Going to see if gSSURGO includes soil C data and would circumvent this problem, but not sure.

To do:
1.) Test different ways to filter out counties where corn isn't a priority crop
2.) Check out droughtmonitor.unl.edu
3.) Plot irrigated acres against total corn acreage to look for red flags
4.) Interpolate missing yield data by looking at % drop...
5.) Connect with Brent Myers?
  
April 23, 2018::

- Check when classification scheme on crop frequency data layer changed. 2010?
- Use all of crop frequency raster, weight by number of years in corn per county. 

- Temporal yield with weather variables--need to generate reliable drought severity index
- All models with latent soil variables
- Check for need for spatial error modeling?
- Need for more explicit temporal model, like specific autoregressive model or temporal error structure?

January 2019:
How we dealth with drought. 
1. drought-data-API-call.R : Pre-select counties and pull from API
1. drought-data-processing.R is main script to process data.

March 7, 2019:

To do:
1. Subset out 50% of data and use model to predict other 50%. Want to make sure the the subsetting doesn't subset out areas that are key to variation in the data--like dropping the high SOC counties. Is random the right way?
2. Remove the part of the model with SOC (and the interaction) for the 50% of the data and re-predict whole data set. How much does predictive capacity change?
3. Run model of yield ~ SOC + soil covariates (no drought, no interaction) and plot coefficient (and error bars) of SOC on yield by year (year on x axis, coefficient on y axis)
4. Do logistic regression where response variable is 1/0 if yield in year X dropped Y% from 5-year moving average. Then plot coefficients by year, like above?


June 2019:

- Incorporating one last risk metric using the RMA data that available at https://swclimatehub.info/rma/rma-data-viewer.html

August 7, 2019:

- RMA data results are interesting, as SOC increases RMA loss cost decreases, indicates that SOC protects against losses that would otherwise require insurance payouts