(Notes & Code Samples in the handouts)
http://github.com/sburns/advanced-redcap-interfaces
http://bit.ly/advanced-redcap
I'm an open book, this talk can be found on my github.
A rendered version can be found at the bit.ly URL.
Ratio of coders to non-coders?
The handouts have my talking points. Where applicable, they also contain code samples. I hope you find this useful.
I have no affiliation with REDCap other than I submit a lot of bug reports, though less recently which is a good thing.
We study reading disabilities in children using behavior and imaging measures.
We study reading disabilities in children using behavior and imaging measures.
We study reading disabilities in children using behavior and imaging measures.
We study reading disabilities in children using behavior and imaging measures.
- Lab members touched every piece of data.
- Issues and effort to join data across paradigms.
- Stored data in spreadsheets.
- Always behind in data analysis.
- No traceable analyses.
- Copy/Paste from Excel to REDCap.
- Moved as much as possible to REDCap projects.
- We analyze some data within (milli-) seconds of capture.
- Automate everything.
- Automate the automation.
- Start analyses from a single source.
If it's not in REDCap, it doesn't exist.
Store absolutely everything in REDCap. Capture experimental tasks in there.
Automate the automation. People run advanced analyses and they don't even konw it.
Humans touch data as little as possible, most especially after the collection process.
I want to talk about them because I think they're both important and extremely powerful.
These workflows can span the entire range of research and are limitless.
We have the tools to put our machines to better use, it's time to use them to improve our science/research.
If your head isn't swimming with ideas about how to improve your research, I haven't done my job.
Machines peform all definable data analyses because they:
- Perform reproducible work.
- Are extremely cheap.
- Operate deterministically.
As opposed to humans who:
- Cannot be relied upon to reproduce their own work.
- Are extremely expensive.
- Some RAs might even be considered pseudorandom processes.
Keyword here is definalbe...machines can only follow a set of rules.
If you can't express a rule for the analysis, a human must do it.
Otherwise, a machine should do it.
Just to get everyone on the same page
- Not relational, can only track one "entity" per project
- Each Project is only a single table
REDCap does not support traditional database features as found RDBMS like Oracle, MySQL, etc.
Namely, it is not relational. A REDCap project can only track single "entities".
Other databases let you add fancy relationships (one-to-one, one-to-many) to your models.
This decision reduces a lot complexity and I applaud them for it.
- No administration
- Easy schema definition
- Client & Server architecture
- GUI is browser-based
- Advanced web features (why we're here today)
Web application is a front-end to a large server backend.
No installation required on users end.
IMO, REDCap should be judged not as a less capable database but as a super spreadsheet that insulates users from many of the pitfalls of file-based tabular databases.
We're here today to talk about these advanced web features
REDCap | SQL/Excel/etc |
---|---|
Project | Table |
Data Dictionary | Schema |
Record | Row |
Field | Column |
Form or Instrument | Set of columns |
Unique Identifier | Primary Key, Index, etc |
Generally speaking, we're working with these components.
What most of us think of REDCap is really the front-end website. It presents the GUI of REDCap.
It facilitaties all data saves/download requests/etc to the REDCap back-end server application.
There's also the API which can be thought of more like a CLI. It provides programmatic means to access the server application.
I'll assume that research group is comprised of lab members (the humans) and one or more general purpose (hopefully secure) machines to do the lab's bidding.
Programmatic access to push to/pull from REDCap
Notifications across the internet whenever data is saved
These features are the building blocks of extremely advanced data managment techniques and they're ready to use right now.
These two features make REDCap the foundation for advanced data management workflows
Investing in these workflows will improve your work. The techincal details of how these two features work are less important than the workflows they enable, which I'll talk about towards the end.
A way for software programs to ask for and push data to REDCap projects.
Instead of downloading/uploading data through the GUI, we can let scripts and applications to do it for us.
https://redcap.vanderbilt.edu/api/
PyCap facilitates using the REDCap API within python applications.
Documentation @ http://sburns.github.io/PyCap
HTTP-based API for communicating with Projects
Aside: when you go to google.com, your browser submits a GET request. When you press the "search" button, your browser submits a POST request. A different way to submit information across the web.
All languages worth using have an http library.
In-/output formats: csv
, xml
, json
You'll find code snippets on the handouts, bash on top and python below
To install pycap $ pip install PyCap
API Documentation from REDCap https://redcap.vanderbilt.edu/api/help/
A table about your table. Useful to determine whether fields exist before imports/etc.
Equivalent to downloading the data dictionary through the web application.
!bash
$ curl -X POST https://redcap.vanderbilt.edu/api/ \
-d token=XXX \
-d content=metadata \
-d format=json
!python
from redcap import Project
project = Project('https://redcap.vanderbilt.edu/api/', TOKEN)
md = project.export_metadata()
Download the entire database
Can also request certain fields or records
Honors access rules (keeps PHI safe)
!bash
$ curl -X POST https://redcap.vanderbilt.edu/api/ \
-d token=XXX \
-d content=record \
-d format=csv
$ curl -X POST https://redcap.vanderbilt.edu/api/ \
-d token=XXX \
-d content=record \
-d format=csv \
-d records=1,2,3 \
-d fields=age,sex,gender
!python
data = project.export_records()
# data is a list of dictionary objects with one dictionary per record
csv = project.export_records(format='csv') # or 'xml'
df = project.export_records(format='df') # Pandas DataFrame
sliced = project.export_records(records=['1', '2', '3'],
fields=['age', 'sex', 'gender'])
Getting a DataFrame is really helpful because pandas automatically provides type coercion in the DF construction (a text field column that stores numbers -> floats)
Update fields for existing records
Add new records on the fly
!bash
$ curl -X POST https://redcap.vanderbilt.edu/api/ \
-d token=XXX \
-d content=record \
-d format=csv \
-d data="participant_id,new_data\n1,hooray"
!python
project.import_records([{'participant_id': '1', 'new_data': 'hooray'}])
# Or upload many records at once
from my_module import modify_records
data = project.export_records()
modified = modify_records(data)
response = project.import_records(modified)
assert response['count'] == len(modified) # Just to make sure
Download
!bash
$ curl -X POST https://redcap.vanderbilt.edu/api/
-d token=XXX \
-d returnFormat=json \
-d content=file \
-d action=export \
-d record=1 \
-d field=file > exported_file.txt
!python
content, headers = project.export_file(record='1', field='file')
# write out a new file using the filename as stored in REDCap
with open(headers['name'], 'w') as f:
f.write(content)
Importing
!bash
$ curl -X POST https://redcap.vanderbilt.edu/api/
-d token=XXX \
-d returnFormat=json \
-d content=file \
-d action=import \
-d record=1 \
-d field=file \
-d file=@localfile.txt
!python
local_fname = 'to_upload.pdf'
with open(local_fname, 'rb') as fobj:
response = project.import_file(record='1', field='file',
fname=local_fname, fobj=fobj)
# Whatever is passed to the fname argument will appear in the REDCap UI
Deleting
!bash
$ curl -X POST https://redcap.vanderbilt.edu/api/
-d token=XXX \
-d returnFormat=json \
-d content=file \
-d action=delete \
-d record=1 \
-d field=file \
!python
project.delete_file('1', 'file')
## Advanced & Automated Field Calculations
(not an exhaustive list!)
Your PI just came to you and said she wants to calculate a field based on the values of many others. There's probably some advanced logic/look up tables required.
- Download the data as spreadsheet/SPSS/etc.
- Implement the field calculation and execute.
- Re-upload data.
Why this is bad
- Must re-do the process for each new record.
- Can you trust who ever does this to always do it perfectly? Or are we introducing bias?
- Delete the spreadsheet and a piece of the methods section potentially goes to the trash.
While REDCap does provide a calculated field feature, their implementation has issues
- Implemented as Javascript, requires viewing the field in the web app for the calculation to execute.
- No access to third-party code (statistics/conversion tables/etc)
- Must re-implement across Projects. Somewhat mediated by using the same data dictionary.
What if you have hundreds/thousands of records? Some poor RA has to view and save each record.
- Write the (testable!) implementation once
- Easily use against many projects
- Fast, Error-Free IO of the calculated fields
- Upfront cost amortized across all automated calculations
Software testing in academia is for another talk, but do you really want to publish using untested methodologies?
Using the API, we can write and test advanced field calculations once and apply them across projects easily.
While the calculation itself is probably no faster than when a human does it in excel/spss, we save time, energy and reduce mistakes in the download/upload process.
And we're free to implement whatever logic necessary.
## Hooks to external databases
## Reproducible group/cohort determination
## Automated database cleanup
- A major thrust of our work is neuroimaging in children.
- We use REDCap as an interface for our image processing backend that runs at ACCRE.
- Imaging Data is record-aligned with behavioral data we store in REDCap.
This is difficult to explain/talk about generally, but REDCap did a lot of hard work in making a nice UI for humans to input generic data.
The API presents a stationary target that other systems can be written against to grab information and use it to perform "business logic".
Reduces Friction between researchers and their data.
Use the same analyses for pilot and production data.
Theoretically, your research software that uses the REDCap API is documented, tested and should help write the Methods sections.
REDCap file
fields!
Some experiments necessarily produce intermediate files. We want to analyze these files and put results in REDCap.
We can use file fields to simplify the connection to our analysis infrastructure.
A general approach
- Lab member runs test, uploads intermediate file to specific field
- Automated program exports file to local filesystem
- Automated program analyses and uploads results to REDCap
No need to share a filesystem
- Disconnect an unsecure environment (the lab member) from an environment with potential PHI (where the analysis runs).
- The automated program can also organize files better/faster/cheaper than any human.
We use eprime, an application for making and running experimental psychological tests.
It spits out files with subject input, etc.
These files are raw data that need to be analyzed, and the results should go to redcap. We want to then store the file for "safe keeping".
What about the other direction?
Given a record in the database:
- An automated program can produce a novel file based on the record
- Upload it to REDCap (safely move PHI)
- Alert lab members of the file creation
Doing manually, these reports take long amounts of time to do well. Pronouns/actual scores/etc. On average about 2-3 hours per report a human could be doing other things.
Using the data export and file upload methods, we can create these reports automatically and store the resulting file in REDCap so other lab members have access to them.
I've only scratched the surface here. The possibilities of workflows enabled by the API are essentially limitless.
The API provides all of the low level pieces.
Poll? (yuck)
- Register a single URL to your Project
- ANY Form save --> HTTP POST request to URL.
- Deploy arbitrary data workflows on a webserver
- Workflows execute in real time as data is saved.
- Otherwise normal users can execute very advanced processing.
- "It just happens"
Fake the incoming payload, blast off many analyses.
Fields in the incoming payload:
project_id
: Project Identifierinstrument
: What form was saved?record
: Which record?redcap_event_name
: For which event (longitudinal projects only)redcap_data_access_group
: What "kind" of user saved it?[instrument]_complete
: What is the status of the form?
- ...can setup/maintain/secure a webserver.
- ...has the resources to write the web application.
Middleware is required to route incoming requests to the correct workflow
switchboard
:
- Parses incoming
POST
requests from REDCap (or whomever). - Executes only the functions whose workflows match the request.
In Production for our lab but rough around the edges
(github.com/sburns/switchboard
want to help?)
A shared "switchboard" webserver:
- Just one webserver to maintain & protect.
- Shared infrastructure is good.
- Remove excuses for groups to use these features.
- Optimize pieces and all groups benefit.
Implementing software against the API is traceable, testable, applicable across many projects
Remove all barriers to running analyses
This talk can be found at:
Please email with questions & open issues on my code