Skip to content

Latest commit

 

History

History
694 lines (504 loc) · 21.7 KB

betterDataManagement.md

File metadata and controls

694 lines (504 loc) · 21.7 KB
<style> .footer { color: #434343; background: #ffffffff; position: fixed; top: 90%; text-align: left; width: 100%; } .header { color: black; background: #E8E8E8; position: fixed; bottom: 90%; text-align:center; width:100%; } .small-code pre code { font-size: 0.9em; } .column-left{ float: left; width: 50%; text-align: left; } .column-right{ float: right; width: 50%; text-align: right; } .center{ left: 50%; text-align: center; } </style>

effective data management

author: date: 2017-11-17 autosize: true font-family: 'Arial' width: 1920 height: 1080

workflow

title: false

research data workflow: this is your world (or it will be soon)


modified from Jenny Bryan’s UBC Stat 545 course (http://stat545.com/) who adapted it from Roger Peng (biostat.jhsph.edu/~rpeng/)

challenges

title: true

  • how do you keep that up-to-date?
  • what if something changes, what if something needs to be redone - how do you manage that?
  • why do the results in table 1 not seem to correspond to the results in figure 1?
  • why were those particular samples omitted?
  • where did I get these data?
  • how did I make that figure?

basically, if the thought of redoing your analyses is terrifying then you are doing it wrong (paraphrasing Jenny Bryan)

workflow with data

title: false

research data workflow: this is your world (or it will be soon)

now add to that publishing your data and code


modified from Jenny Bryan’s UBC Stat 545 course (http://stat545.com/) who adapted it from Roger Peng (biostat.jhsph.edu/~rpeng/)

reproducible research

title: false


reproducible: the calculation of quantitative sciencific results by independent researchers using the original data and methods (National Science Foundation Subcommittee on Replicability in Science)


not as stringent as replicable: can someone repeat the experiment and get the same result


Steps toward reproducible research, Karl Broman, Biostatistics & Medical Informatics Univ. Wisconsin–Madison, kbroman.org, github.com/kbroman, @kwbroman, Slides: bit.ly/jsm2016

what we will cover today

title: true

  • best practices
    • naming
    • organiztion
    • scripting
  • spreadsheets
  • literate programming
  • version control
  • getting started
  • data management plans & publishing your data

BP (Best Practice): naming - principles for file names

title: true

  • machine readable
  • human readable
  • plays well with default ordering

"naming things" by Jenny Bryan https://speakerdeck.com/jennybc/how-to-name-files

BP: naming - machine readable

title: true

  • regular expression and globbing friendly
    • avoid spaces, punctuation, accented characters, case sensivity
  • easy to compute on with deliberate use of delimiters

example: "2017-11-17_berneilwash_oxygen_day_1.csv"

underscores allow us to delimit units of metadata and facilitate searching


* easy to search for files later * easy to narrow file lists based on names * easy to extract information from the file names, e.g., by splitting * avoiding spaces, punctuation, accented characters, case sensivity will make your life much easier

"naming things" by Jenny Bryan https://speakerdeck.com/jennybc/how-to-name-files

BP: naming - human readable

title: true

  • names contain info about the content
  • easy to figure out what what something is based on the name

for example 2016_salmon_counts.csv actually conveys a lot of information about the object, and has a whole lot more meaning than fishData.csv


"naming things" by Jenny Bryan https://speakerdeck.com/jennybc/how-to-name-files

BP: naming - names matter in times of stress

title: true

which set of file(names)s would you prefer at 3 a.m. before a dealine?


"naming things" by Jenny Bryan https://speakerdeck.com/jennybc/how-to-name-files

BP: naming - plays well with default ordering

title: true

  • put something numeric first
  • use the ISO 8601 standard for dates (YYYY-MM-DD) everwhere, always without exception ever
  • left pad numbers with zeroes as needed
    • 1_file_name.csv
    • 11_file_name.csv
    • 2_file_name.csv

    • 01_file_name.csv
    • 02_file_name.csv
    • 11_file_name.csv

"naming things" by Jenny Bryan https://speakerdeck.com/jennybc/how-to-name-files

BP: dates matter (a lot!)

title: true

comprehensive map of all countries in the world that use MMDDYYYY format

use the ISO 8601 standard for dates (YYYY-MM-DD) everwhere, always without exception ever


"naming things" by Jenny Bryan https://speakerdeck.com/jennybc/how-to-name-files

BP: employ sound project organization

title: true left: 80%

your closest collaborator is you six months ago, but you do not reply to emails (Karl Broman paraphrasing Mark Holder)


  • make the project understandable to others where others includes your future self
  • segregate all the materials for a project in one directory
  • separate raw from processed data; put code in a separate directory
  • include README files

Steps toward reproducible research, Karl Broman, Biostatistics & Medical Informatics Univ. Wisconsin–Madison, kbroman.org, github.com/kbroman, @kwbroman, Slides: bit.ly/jsm2016

BP: research compendium

title: true

...a standard and easily recognisable way for organising the digital materials of a project to enable others to inspect, reproduce, and extend the research

general principles:

  • organize according to prevailing conventions (e.g., R package structure)
  • maintain a clear separation of data, method, and output, while unambiguously expressing the relationship between the three
  • specify the computational environment used for the original analysis
  • organize such that another person can know what to expect from the plain meaning of the file and directory names

Marwick B, Boettiger C, Mullen L. (2017) Packaging data analytical work reproducibly using R (and friends) PeerJ Preprints 5:e3192v1 https://doi.org/10.7287/peerj.preprints.3192v1

BP: keep the raw data raw

title: true

  • save the raw data

Wilson G, Bryan J, Cranston K, Kitzes J, Nederbragt L, Teal TK (2017) Good enough practices in scientific computing. PLoS Comput Biol13(6): e1005510. https://doi.org/10.1371/journal.pcbi.1005510

BP: thoughtful curation

title: true

  • curate your data in a way that you would like to receive it

Wilson G, Bryan J, Cranston K, Kitzes J, Nederbragt L, Teal TK (2017) Good enough practices in scientific computing. PLoS Comput Biol13(6): e1005510. https://doi.org/10.1371/journal.pcbi.1005510

BP: open file formats

title: true

  • use open file formats (e.g., csv not xlsx)

Wilson G, Bryan J, Cranston K, Kitzes J, Nederbragt L, Teal TK (2017) Good enough practices in scientific computing. PLoS Comput Biol13(6): e1005510. https://doi.org/10.1371/journal.pcbi.1005510

BP: analysis-friendly data

title: true

  • create analysis-friendly data:
    • each column a variable
    • each row an observation

Wilson G, Bryan J, Cranston K, Kitzes J, Nederbragt L, Teal TK (2017) Good enough practices in scientific computing. PLoS Comput Biol13(6): e1005510. https://doi.org/10.1371/journal.pcbi.1005510

BP: create analysis-friendly data

title: false

  • each column a variable
  • each row an observation
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
5.1 3.5 1.4 0.2 setosa
4.9 3.0 1.4 0.2 setosa
4.7 3.2 1.3 0.2 setosa
4.6 3.1 1.5 0.2 setosa
5.0 3.6 1.4 0.2 setosa
5.4 3.9 1.7 0.4 setosa

Wilson G, Bryan J, Cranston K, Kitzes J, Nederbragt L, Teal TK (2017) Good enough practices in scientific computing. PLoS Comput Biol13(6): e1005510. https://doi.org/10.1371/journal.pcbi.1005510

BP: create analysis-friendly data

title: false

  • each column a variable
  • each row an observation
Species metric value
setosa Sepal.Length 5.1
setosa Sepal.Length 4.9
setosa Sepal.Length 4.7
setosa Sepal.Length 4.6
setosa Sepal.Length 5.0
setosa Sepal.Length 5.4
setosa Sepal.Width 3.5
setosa Sepal.Width 3.0
setosa Sepal.Width 3.2
setosa Sepal.Width 3.1
setosa Sepal.Width 3.6
setosa Sepal.Width 3.9
setosa Petal.Length 1.4
setosa Petal.Length 1.4
setosa Petal.Length 1.3

Wilson G, Bryan J, Cranston K, Kitzes J, Nederbragt L, Teal TK (2017) Good enough practices in scientific computing. PLoS Comput Biol13(6): e1005510. https://doi.org/10.1371/journal.pcbi.1005510

BP: record everything

title: true

  • record all the steps of the data process

BP: everything in a script

title: true

The most basic principle for reproducible research is: do everything via code

  • downloading data from the web,
  • converting an Excel file to CSV,
  • renaming columns or variables,
  • omitting bad samples or data points
  • ...do all of these with scripts

You will be tempted to open up a data file and hand-edit. But if you get a revised version of that file, you will need to do it again, and it will be harder to figure out what it was that you did.

Some things are more cumbersome via code but you will save time in the long run.


Steps toward reproducible research, Karl Broman, Biostatistics & Medical Informatics Univ. Wisconsin–Madison, kbroman.org, github.com/kbroman, @kwbroman, Slides: bit.ly/jsm2016

BP: learn a language. any language, just do it

title: true

spreadsheets: the dark side

title: true

spreadsheets: the 2nd best tool for everything

title: true

the beauty and the travesty of spreadsheets is that they allow you to do just about anything

  • data in merged cells
  • data in formatting
  • small multiples
  • data in formulas

use open file formats (e.g., CSV, not XLSX)



"spreadsheets" by Jenny Bryan https://speakerdeck.com/jennybc/spreadsheets

spreadsheets versus scripting

title: false

consider, we want to remove samples that we feel may have been contaminated...

in a R script


...some work...

# remove samples 4, 5, 6, that may

# have been compromised due to

# wading upstream during sampling


chemistry_data %>%

filter(!sample_id %in% c(4,5,6))


...more work...

in a spreadsheet


literate programming

title: true

a common approach


"The Plain Person's Guide to Plain Text Social Science" version 2017-06-19 by Kieran Healy "https://kieranhealy.org/files/papers/plain-person-text.pdf"

literate programming

title: true

  • the problem is that the gaps are particularly prone to errors
  • literate programming is essentially the integration of code and text

Knuth, D. E. (1992), Literate programming, CSLI Lecture Notes, Stanford, CA: Center for the Study of Language and Information (CSLI), 1992


"The Plain Person's Guide to Plain Text Social Science" version 2017-06-19 by Kieran Healy "https://kieranhealy.org/files/papers/plain-person-text.pdf"

version control

title: true

version control: Git

title: true

  • Git watches repositories (like a directory) for changes
  • It asks that you describe changes when they are made
  • It remembers old versions if you need them
  • It also keeps an eye out for conflicts, and forces you to resolve them
  • It (through GitHub) allows multiple people to contribute to the same repository, and does all of the above for everyone at once

version control: Git and GitHub

title: true

Git != GitHub


  • Git lives on your computer
  • GitHub is a web-based platform for storing collaboration and facilitating collaboration

I am not saying it is easy

title: true

Kieran Healy on two revolutions in computing:

"On one side, the mobile, cloud-centered, touch-screen, phone-or-tablet model has brought powerful computing to more people than ever before."

On the other side, tools for coding, data analysis, and writing are also revolutionary but mostly work by gluing together separate, specialized widgets that do much less to hide the operation system layer, and require knowledge of things like the file system.


"The Plain Person's Guide to Plain Text Social Science" version 2017-06-19 by Kieran Healy "https://kieranhealy.org/files/papers/plain-person-text.pdf"

OHI: nature ecology & evolution

title: true


Our path to better science in less time using open data science tools. Julia S. Stewart Lowndes, Benjamin D. Best, Courtney Scarborough, Jamie C. Afflerbach, Melanie R. Frazier, Casey C. O’Hara, Ning Jiang & Benjamin S. Halpern. Nature Ecology & Evolution 1, Article number: 0160 (2017) doi:10.1038/s41559-017-0160

OHI: framework

title: true


Our path to better science in less time using open data science tools. Julia S. Stewart Lowndes, Benjamin D. Best, Courtney Scarborough, Jamie C. Afflerbach, Melanie R. Frazier, Casey C. O’Hara, Ning Jiang & Benjamin S. Halpern. Nature Ecology & Evolution 1, Article number: 0160 (2017) doi:10.1038/s41559-017-0160

OHI: evolution of a workflow

title: true


Our path to better science in less time using open data science tools. Julia S. Stewart Lowndes, Benjamin D. Best, Courtney Scarborough, Jamie C. Afflerbach, Melanie R. Frazier, Casey C. O’Hara, Ning Jiang & Benjamin S. Halpern. Nature Ecology & Evolution 1, Article number: 0160 (2017) doi:10.1038/s41559-017-0160

BP: forethought

title: true

strive for reproducibility from the outset

data management plan

title: true

describes how data will be collected, managed, and preserved

for example, NSF's generic guidelines:

  • roles and responsibilities
  • types of data produced
  • data and metadata standards
  • policies for access and sharing
  • policies for reuse, redistribution
  • plans for archiving and preservation

publishing your data

title: true

RDM course slide

title: false

Research Data Management

Seminar: SOS 598 (24085)

When: Spring 2018

Day/time: Friday, 12:15-1:30 PM

1 credit hour

bookmarks

title: false