effective data management

author: date: 2017-11-17 autosize: true font-family: 'Arial' width: 1920 height: 1080

workflow

title: false

research data workflow: this is your world (or it will be soon)

modified from Jenny Bryan’s UBC Stat 545 course (http://stat545.com/) who adapted it from Roger Peng (biostat.jhsph.edu/~rpeng/)

challenges

title: true

how do you keep that up-to-date?
what if something changes, what if something needs to be redone - how do you manage that?
why do the results in table 1 not seem to correspond to the results in figure 1?
why were those particular samples omitted?
where did I get these data?
how did I make that figure?

basically, if the thought of redoing your analyses is terrifying then you are doing it wrong (paraphrasing Jenny Bryan)

workflow with data

title: false

research data workflow: this is your world (or it will be soon)

now add to that publishing your data and code

modified from Jenny Bryan’s UBC Stat 545 course (http://stat545.com/) who adapted it from Roger Peng (biostat.jhsph.edu/~rpeng/)

reproducible research

title: false

reproducible: the calculation of quantitative sciencific results by independent researchers using the original data and methods (National Science Foundation Subcommittee on Replicability in Science)

not as stringent as replicable: can someone repeat the experiment and get the same result

Steps toward reproducible research, Karl Broman, Biostatistics & Medical Informatics Univ. Wisconsin–Madison, kbroman.org, github.com/kbroman, @kwbroman, Slides: bit.ly/jsm2016

what we will cover today

title: true

best practices
- naming
- organiztion
- scripting
spreadsheets
literate programming
version control
getting started
data management plans & publishing your data

BP (Best Practice): naming - principles for file names

title: true

machine readable
human readable
plays well with default ordering

"naming things" by Jenny Bryan https://speakerdeck.com/jennybc/how-to-name-files

BP: naming - machine readable

title: true

regular expression and globbing friendly
- avoid spaces, punctuation, accented characters, case sensivity
easy to compute on with deliberate use of delimiters

example: "2017-11-17_berneilwash_oxygen_day_1.csv"

underscores allow us to delimit units of metadata and facilitate searching

* easy to search for files later * easy to narrow file lists based on names * easy to extract information from the file names, e.g., by splitting * avoiding spaces, punctuation, accented characters, case sensivity will make your life much easier

"naming things" by Jenny Bryan https://speakerdeck.com/jennybc/how-to-name-files

BP: naming - human readable

title: true

names contain info about the content
easy to figure out what what something is based on the name

for example 2016_salmon_counts.csv actually conveys a lot of information about the object, and has a whole lot more meaning than fishData.csv

"naming things" by Jenny Bryan https://speakerdeck.com/jennybc/how-to-name-files

BP: naming - names matter in times of stress

title: true

which set of file(names)s would you prefer at 3 a.m. before a dealine?

"naming things" by Jenny Bryan https://speakerdeck.com/jennybc/how-to-name-files

BP: naming - plays well with default ordering

title: true

put something numeric first
use the ISO 8601 standard for dates (YYYY-MM-DD) everwhere, always without exception ever
left pad numbers with zeroes as needed
- 1_file_name.csv
- 11_file_name.csv
- 2_file_name.csv
- 01_file_name.csv
- 02_file_name.csv
- 11_file_name.csv

"naming things" by Jenny Bryan https://speakerdeck.com/jennybc/how-to-name-files

BP: dates matter (a lot!)

title: true

comprehensive map of all countries in the world that use MMDDYYYY format

use the ISO 8601 standard for dates (YYYY-MM-DD) everwhere, always without exception ever

"naming things" by Jenny Bryan https://speakerdeck.com/jennybc/how-to-name-files

BP: employ sound project organization

title: true left: 80%

your closest collaborator is you six months ago, but you do not reply to emails (Karl Broman paraphrasing Mark Holder)

make the project understandable to others where others includes your future self
segregate all the materials for a project in one directory
separate raw from processed data; put code in a separate directory
include README files

Steps toward reproducible research, Karl Broman, Biostatistics & Medical Informatics Univ. Wisconsin–Madison, kbroman.org, github.com/kbroman, @kwbroman, Slides: bit.ly/jsm2016

BP: research compendium

title: true

...a standard and easily recognisable way for organising the digital materials of a project to enable others to inspect, reproduce, and extend the research

general principles:

organize according to prevailing conventions (e.g., R package structure)
maintain a clear separation of data, method, and output, while unambiguously expressing the relationship between the three
specify the computational environment used for the original analysis
organize such that another person can know what to expect from the plain meaning of the file and directory names

Marwick B, Boettiger C, Mullen L. (2017) Packaging data analytical work reproducibly using R (and friends) PeerJ Preprints 5:e3192v1 https://doi.org/10.7287/peerj.preprints.3192v1

BP: keep the raw data raw

title: true

save the raw data