author: date: 2017-11-17 autosize: true font-family: 'Arial' width: 1920 height: 1080
title: false
research data workflow: this is your world (or it will be soon)
modified from Jenny Bryan’s UBC Stat 545 course (http://stat545.com/) who adapted it from Roger Peng (biostat.jhsph.edu/~rpeng/)
title: true
- how do you keep that up-to-date?
- what if something changes, what if something needs to be redone - how do you manage that?
- why do the results in table 1 not seem to correspond to the results in figure 1?
- why were those particular samples omitted?
- where did I get these data?
- how did I make that figure?
basically, if the thought of redoing your analyses is terrifying then you are doing it wrong (paraphrasing Jenny Bryan)
title: false
research data workflow: this is your world (or it will be soon)
now add to that publishing your data and code
modified from Jenny Bryan’s UBC Stat 545 course (http://stat545.com/) who adapted it from Roger Peng (biostat.jhsph.edu/~rpeng/)
title: false
reproducible: the calculation of quantitative sciencific results by independent researchers using the original data and methods (National Science Foundation Subcommittee on Replicability in Science)
not as stringent as replicable: can someone repeat the experiment and get the same result
Steps toward reproducible research, Karl Broman, Biostatistics & Medical Informatics Univ. Wisconsin–Madison, kbroman.org, github.com/kbroman, @kwbroman, Slides: bit.ly/jsm2016
title: true
- best practices
- naming
- organiztion
- scripting
- spreadsheets
- literate programming
- version control
- getting started
- data management plans & publishing your data
title: true
- machine readable
- human readable
- plays well with default ordering
"naming things" by Jenny Bryan https://speakerdeck.com/jennybc/how-to-name-files
title: true
- regular expression and globbing friendly
- avoid spaces, punctuation, accented characters, case sensivity
- easy to compute on with deliberate use of delimiters
example: "2017-11-17_berneilwash_oxygen_day_1.csv"
underscores allow us to delimit units of metadata and facilitate searching
* easy to search for files later * easy to narrow file lists based on names * easy to extract information from the file names, e.g., by splitting * avoiding spaces, punctuation, accented characters, case sensivity will make your life much easier
"naming things" by Jenny Bryan https://speakerdeck.com/jennybc/how-to-name-files
title: true
- names contain info about the content
- easy to figure out what what something is based on the name
for example 2016_salmon_counts.csv actually conveys a lot of information about the object, and has a whole lot more meaning than fishData.csv
"naming things" by Jenny Bryan https://speakerdeck.com/jennybc/how-to-name-files
title: true
which set of file(names)s would you prefer at 3 a.m. before a dealine?
"naming things" by Jenny Bryan https://speakerdeck.com/jennybc/how-to-name-files
title: true
- put something numeric first
- use the ISO 8601 standard for dates (YYYY-MM-DD) everwhere, always without exception ever
- left pad numbers with zeroes as needed
- 1_file_name.csv
- 11_file_name.csv
- 2_file_name.csv
-
- 01_file_name.csv
- 02_file_name.csv
- 11_file_name.csv
"naming things" by Jenny Bryan https://speakerdeck.com/jennybc/how-to-name-files
title: true
comprehensive map of all countries in the world that use MMDDYYYY formatuse the ISO 8601 standard for dates (YYYY-MM-DD) everwhere, always without exception ever
"naming things" by Jenny Bryan https://speakerdeck.com/jennybc/how-to-name-files
title: true left: 80%
your closest collaborator is you six months ago, but you do not reply to emails (Karl Broman paraphrasing Mark Holder)
- make the project understandable to others where others includes your future self
- segregate all the materials for a project in one directory
- separate raw from processed data; put code in a separate directory
- include README files
Steps toward reproducible research, Karl Broman, Biostatistics & Medical Informatics Univ. Wisconsin–Madison, kbroman.org, github.com/kbroman, @kwbroman, Slides: bit.ly/jsm2016
title: true
...a standard and easily recognisable way for organising the digital materials of a project to enable others to inspect, reproduce, and extend the research
general principles:
- organize according to prevailing conventions (e.g., R package structure)
- maintain a clear separation of data, method, and output, while unambiguously expressing the relationship between the three
- specify the computational environment used for the original analysis
- organize such that another person can know what to expect from the plain meaning of the file and directory names
Marwick B, Boettiger C, Mullen L. (2017) Packaging data analytical work reproducibly using R (and friends) PeerJ Preprints 5:e3192v1 https://doi.org/10.7287/peerj.preprints.3192v1
title: true
- save the raw data
Wilson G, Bryan J, Cranston K, Kitzes J, Nederbragt L, Teal TK (2017) Good enough practices in scientific computing. PLoS Comput Biol13(6): e1005510. https://doi.org/10.1371/journal.pcbi.1005510
title: true
- curate your data in a way that you would like to receive it
Wilson G, Bryan J, Cranston K, Kitzes J, Nederbragt L, Teal TK (2017) Good enough practices in scientific computing. PLoS Comput Biol13(6): e1005510. https://doi.org/10.1371/journal.pcbi.1005510
title: true
- use open file formats (e.g., csv not xlsx)
Wilson G, Bryan J, Cranston K, Kitzes J, Nederbragt L, Teal TK (2017) Good enough practices in scientific computing. PLoS Comput Biol13(6): e1005510. https://doi.org/10.1371/journal.pcbi.1005510
title: true
- create analysis-friendly data:
- each column a variable
- each row an observation
Wilson G, Bryan J, Cranston K, Kitzes J, Nederbragt L, Teal TK (2017) Good enough practices in scientific computing. PLoS Comput Biol13(6): e1005510. https://doi.org/10.1371/journal.pcbi.1005510
title: false
- each column a variable
- each row an observation
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
---|---|---|---|---|
5.1 | 3.5 | 1.4 | 0.2 | setosa |
4.9 | 3.0 | 1.4 | 0.2 | setosa |
4.7 | 3.2 | 1.3 | 0.2 | setosa |
4.6 | 3.1 | 1.5 | 0.2 | setosa |
5.0 | 3.6 | 1.4 | 0.2 | setosa |
5.4 | 3.9 | 1.7 | 0.4 | setosa |
Wilson G, Bryan J, Cranston K, Kitzes J, Nederbragt L, Teal TK (2017) Good enough practices in scientific computing. PLoS Comput Biol13(6): e1005510. https://doi.org/10.1371/journal.pcbi.1005510
title: false
- each column a variable
- each row an observation
Species | metric | value |
---|---|---|
setosa | Sepal.Length | 5.1 |
setosa | Sepal.Length | 4.9 |
setosa | Sepal.Length | 4.7 |
setosa | Sepal.Length | 4.6 |
setosa | Sepal.Length | 5.0 |
setosa | Sepal.Length | 5.4 |
setosa | Sepal.Width | 3.5 |
setosa | Sepal.Width | 3.0 |
setosa | Sepal.Width | 3.2 |
setosa | Sepal.Width | 3.1 |
setosa | Sepal.Width | 3.6 |
setosa | Sepal.Width | 3.9 |
setosa | Petal.Length | 1.4 |
setosa | Petal.Length | 1.4 |
setosa | Petal.Length | 1.3 |
Wilson G, Bryan J, Cranston K, Kitzes J, Nederbragt L, Teal TK (2017) Good enough practices in scientific computing. PLoS Comput Biol13(6): e1005510. https://doi.org/10.1371/journal.pcbi.1005510
title: true
- record all the steps of the data process
title: true
The most basic principle for reproducible research is: do everything via code
- downloading data from the web,
- converting an Excel file to CSV,
- renaming columns or variables,
- omitting bad samples or data points
- ...do all of these with scripts
You will be tempted to open up a data file and hand-edit. But if you get a revised version of that file, you will need to do it again, and it will be harder to figure out what it was that you did.
Some things are more cumbersome via code but you will save time in the long run.
Steps toward reproducible research, Karl Broman, Biostatistics & Medical Informatics Univ. Wisconsin–Madison, kbroman.org, github.com/kbroman, @kwbroman, Slides: bit.ly/jsm2016
title: true
title: true
title: true
- data in merged cells
- data in formatting
- small multiples
- data in formulas
use open file formats (e.g., CSV, not XLSX)
"spreadsheets" by Jenny Bryan https://speakerdeck.com/jennybc/spreadsheets
title: false
consider, we want to remove samples that we feel may have been contaminated...
in a R script
...some work...
# remove samples 4, 5, 6, that may
# have been compromised due to
# wading upstream during sampling
chemistry_data %>%
filter(!sample_id %in% c(4,5,6))
...more work...
title: true
a common approach
"The Plain Person's Guide to Plain Text Social Science" version 2017-06-19 by Kieran Healy "https://kieranhealy.org/files/papers/plain-person-text.pdf"
title: true
- the problem is that the gaps are particularly prone to errors
- literate programming is essentially the integration of code and text
Knuth, D. E. (1992), Literate programming, CSLI Lecture Notes, Stanford, CA: Center for the Study of Language and Information (CSLI), 1992
"The Plain Person's Guide to Plain Text Social Science" version 2017-06-19 by Kieran Healy "https://kieranhealy.org/files/papers/plain-person-text.pdf"
title: true
title: true
- Git watches repositories (like a directory) for changes
- It asks that you describe changes when they are made
- It remembers old versions if you need them
- It also keeps an eye out for conflicts, and forces you to resolve them
- It (through GitHub) allows multiple people to contribute to the same repository, and does all of the above for everyone at once
"Git" by Jeff Goldsmith https://speakerdeck.com/jeffgoldsmith/dsi-git-and-github
title: true
Git != GitHub
- Git lives on your computer
- GitHub is a web-based platform for storing collaboration and facilitating collaboration
"Git" by Jeff Goldsmith https://speakerdeck.com/jeffgoldsmith/dsi-git-and-github
title: true
Kieran Healy on two revolutions in computing:
"On one side, the mobile, cloud-centered, touch-screen, phone-or-tablet model has brought powerful computing to more people than ever before."
On the other side, tools for coding, data analysis, and writing are also revolutionary but mostly work by gluing together separate, specialized widgets that do much less to hide the operation system layer, and require knowledge of things like the file system.
"The Plain Person's Guide to Plain Text Social Science" version 2017-06-19 by Kieran Healy "https://kieranhealy.org/files/papers/plain-person-text.pdf"
title: true
Our path to better science in less time using open data science tools. Julia S. Stewart Lowndes, Benjamin D. Best, Courtney Scarborough, Jamie C. Afflerbach, Melanie R. Frazier, Casey C. O’Hara, Ning Jiang & Benjamin S. Halpern. Nature Ecology & Evolution 1, Article number: 0160 (2017) doi:10.1038/s41559-017-0160
title: true
Our path to better science in less time using open data science tools. Julia S. Stewart Lowndes, Benjamin D. Best, Courtney Scarborough, Jamie C. Afflerbach, Melanie R. Frazier, Casey C. O’Hara, Ning Jiang & Benjamin S. Halpern. Nature Ecology & Evolution 1, Article number: 0160 (2017) doi:10.1038/s41559-017-0160
title: true
Our path to better science in less time using open data science tools. Julia S. Stewart Lowndes, Benjamin D. Best, Courtney Scarborough, Jamie C. Afflerbach, Melanie R. Frazier, Casey C. O’Hara, Ning Jiang & Benjamin S. Halpern. Nature Ecology & Evolution 1, Article number: 0160 (2017) doi:10.1038/s41559-017-0160
title: true
strive for reproducibility from the outset
title: true
describes how data will be collected, managed, and preserved
for example, NSF's generic guidelines:
- roles and responsibilities
- types of data produced
- data and metadata standards
- policies for access and sharing
- policies for reuse, redistribution
- plans for archiving and preservation
title: true
title: false
Research Data Management
Seminar: SOS 598 (24085)
When: Spring 2018
Day/time: Friday, 12:15-1:30 PM
1 credit hour
title: false