-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy path01-nhanes_data_exploration.Rmd
386 lines (285 loc) · 12.5 KB
/
01-nhanes_data_exploration.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
---
title: "R notebook"
output: rmarkdown::github_document
editor_options:
markdown:
wrap: 72
---
# Accessing Data from NHANES
## Introduction
In this notebook, we will review the NHANES dataset.
Specifically, we will:
- Review what Rmd is.
- Explain the study design of NHANES cohort.
- Load accelerometer and demographics data.
- Run Quality Control (QC) exclusions.
- Explore the data.
## About this notebook
This notebook was generated for use in Machine learning short courses
delivered by the Oxford Wearables Group by Ben Maylor and Charilaos
Zisou.
## R Markdown
This is an R Markdown document. Markdown is a simple formatting syntax
for authoring documents combining text and code in a reproducible way.
More info here <http://rmarkdown.rstudio.com>.
For example, below is some code to print today's date so that we know
when we ran this .Rmd file. To run it, press the green `play` button at
the top right of the `chunk`. The output will appear underneath the
chunk (NOT in the console as you may be used to).
```{r date}
Sys.Date()
```
This will also produce tables and figures inside the .Rmd source too:
```{r plot, fig.height=3.5, fig.width=3.5}
x <- 1:20
y <- x^2
plot(x, y)
rm(x, y)
```
We have added a few empty chunks for you to add your own code. Though if
you want to add your own just type \`\`\`{r} and hit enter.
We will be using .Rmd for our two sessions working with NHANES data.
The next section describes what NHANES is so we have understand the data
better before a) exploring it and b) performing any epidemiology
analyses on it.
## NHANES introduction
The National Health and Nutrition Examination Survey (NHANES) is a
program of studies designed to assess the health and nutritional status
of adults and children in the United States. Since 1999, The survey has
examined a nationally representative sample of \~5,000 people each year
located in counties across the USA.
During these examinations, people provide demographic, socio-economic,
dietary and health-related information via computer-based questions. A
physical examination also produces medical, dental and physiological
measurements and laboratory tests are conducted for biochemical
measurements. There have been several sub-studies during NHANES where
participants were also asked to wear an accelerometer for 7 days of
free-living. More on that later...
The majority of this data is made available online for public access.
See: <https://wwwn.cdc.gov/nchs/nhanes/default.aspx>
**The next 2 sections are just for your information to provide context
to acquiring the data. You do not need to do anything.**
### Downloading Demographic, Lab and Questionnaire data
Data from each survey year can be downloaded individually through the
website in .XPT format. For example, to download demographic information
such as ethnicity, age, education level and household income, I navigate
to
<https://wwwn.cdc.gov/nchs/nhanes/search/datapage.aspx?Component=Demographics&CycleBeginYear=2011>,
check the `Doc file` to see what is in the codebook for the `Data File`.
when I am satisfied this is the correct data, I can just download the
file directly to my pc. XPT files can easily be opened directly into R
using the `haven` package.
### Downloading Accelerometer Data
NHANES has collected accelerometer data during the below years:
1. 2003-2004 (Hip-worn ActiGraph)
2. 2005-2006 (Hip-worn ActiGraph)
3. ***2011-2012 (Wrist-worn ActiGraph)***
4. ***2013-2014 (Wrist-worn ActiGraph)***
For this workshop we are only interested in the wrist-worn datasets
between 2011-2014 (in bold) which was made accessible \~2 years ago and
has not been investigated as extensively as the earlier hip-worn data.
The methods we have covered during this course for step counts, sleep
and physical activity are also directly applicable to the wrist data.
This data can be accessed through this link
<https://wwwn.cdc.gov/nchs/nhanes/default.aspx> For example, if you
navigate to `NHANES 2013-2014` \> Examination Data, you will see
`Physical Activity Monitor - Raw Data 80Hz` available to download. This
data currently comes in a compressed format for each participant, and
contains up to 194 files split by hour of recording.
Due to the time taken to download these and merge them, we have already
downloaded the data, merged the files and run our stepcount package on
them to derive stepping-based metrics which we will now load and use.
## Load NHANES data
### Load required packages
First we will load the libraries that we intend to use in the following
sections:
```{r Packages, warning=FALSE}
pkgs <- c("dplyr", "ggplot2", "reshape2") # packages we need
pkgs_inst <- pkgs[!{
pkgs %in% rownames(installed.packages())
}]
install.packages(pkgs_inst, repos = "https://www.stats.bris.ac.uk/R/")
lapply(pkgs, library, character.only = TRUE)
rm(pkgs, pkgs_inst)
```
Now we load in the 3 separate files that have been generated for you:
```{r Load NHANES data}
data_steps <- read.csv("data/nhanes_stepcount.csv")
data_mortality <- read.csv("data/nhanes_mortality.csv")
data_covariates <- read.csv("data/nhanes_covariates.csv")
```
Then we merge them together so we have the one file to work with going
forward
```{r merge NHANES data}
# Merge on steps_data so we automatically drop any participants who do not have any data from stepcount
steps_mortality_data <- merge(data_mortality, data_steps, by = "eid")
# Now add demographic data
NHANES_data <- merge(data_covariates, steps_mortality_data)
# Save this file in case we want to re-load it
write.csv(NHANES_data, "data/nhanes_prepped_data.csv", row.names = F)
# clean up the environment
rm(steps_mortality_data, data_covariates, data_mortality, data_steps)
```
Now we can just work with NHANES_data going forward. You can access a
data dictionary in this repo `nhanes_data_dictionary.xlsx` which
describes what each variable in the dataset is.
## Run quality-related exclusions on the data
Quality-related exclusions are really important in our field, as we will
often have participants with poor data quality. For example: Poor wear
compliance; Device failure or sensor error; Premature battery depletion,
and others.
Therefore before we explore the data or run any epidemiology analyses,
we will first clean our dataset using the following common checks within
our group at Oxford:
- Those with poor calibration (calibOK variable)
- Unrealistically high acceleration values over 24h (ENMO \>100mg)
- Insufficient wear time (\<3 days and coverage24hours=FALSE)
An interesting review at the bottom (Pulsford, 2023) discusses
cleaning approaches such as these and others in more detail.
```{r QC}
# N files prior to cleaning
nrow(NHANES_data)
# Data Quality-related cleaning
NHANES_data <- NHANES_data %>%
filter(CalibOK == 1) %>% # Poor calibration
filter(ENMO.mg. < 100) %>% # High ENMO
filter(WearTime.days. >= 3) %>% # Less than 3 valid days
filter(Covers24hOK == TRUE) %>% # Data does not cover the 24h timespan
filter(!is.na(StepsDayMedAdjusted)) # Remove rows where median daily steps is NA
# N files after cleaning
nrow(NHANES_data)
# Save the cleaned file. We will be using this
write.csv(NHANES_data, "data/nhanes_prepped_data.csv", row.names = F)
```
**Exercise 1:**
- Are there any additional variables that you use to clean data in
your group? Discuss this with those around you.
- Based on the numbers above, what was the compliance rate for monitor
wear for our analysis? hint: This is usually expressed as a
percentage
```{r Compliance calculation}
# your code here #
```
## NHANES data exploration
Now we can explore the data for the remainder of this session so we
become more familiar with it.
NHANES over-samples minority ethnicities and older adults amongst other
sub-samples, so we've had a look at that below amongst other
demographics. Bear in mind, we have removed over 400 participants for
poor accelerometer data, so we should keep an eye out for potential
disparities created by this (We could also generate an `NHANES_excluded`
data frame to explore, but we won't do so here for time reasons)
We've used a few different ways of summarising and displaying the data,
but feel free to make your own based on how you would do this on your
own data.
```{r Inspect data, warning=F}
# table summary for ethnicitiy and sex splits
table(NHANES_data$ethnicity, NHANES_data$sex)
# Histogram for bmi and age
ggplot(NHANES_data) +
geom_histogram(aes(x = bmi), binwidth = 2.5) +
theme_bw()
ggplot(NHANES_data) +
geom_histogram(aes(x = age), binwidth = 5, fill = "grey80", colour = "black") +
theme_bw()
# Note the large sample in the oldest age bin in the histogram, because NHANES over-represents this demographic compared with the population
# Or we can label the categories (using crude categories for white Europeans)
NHANES_data %>%
mutate(
bmi_cat = case_when(
bmi < 18.5 ~ "underweight",
bmi >= 18.5 & bmi < 25 ~ "healthy",
bmi >= 25 & bmi < 30 ~ "overweight",
bmi >= 30 & bmi < 35 ~ "obese_1",
bmi >= 35 & bmi < 40 ~ "obese_2",
bmi >= 40 ~ "obese_3"
),
# Order the categories or else R will show them in alphanumerical order
bmi_cat = factor(bmi_cat, levels = c(
"underweight", "healthy", "overweight",
"obese_1", "obese_2", "obese_3"
))
) %>%
ggplot() +
geom_bar(aes(x = bmi_cat)) +
theme_bw()
```
### **Exercise 2:** Summarise deaths
- How many deaths were there in our cleaned dataset?
- How many deaths for each labelled cause? `dth_cvd_f`, `dth_can_f`,
`dth_oth_f`
```{r Death summaries}
# your code here #
```
We could also look at the peak cadences, which are calculated as the
mean cadence for highest xx minutes of stepping windows. We also include
a 95th percentile value for cadences to mirror work done by Prof Laurent
Servais in children with Duchenne muscular dystrophy (DMD):
**Servais et al. 2024:**
<https://www.nature.com/articles/s41598-024-80177-9>
Let's look at the 3 produced by stepcount together so it's easier to
compare them side-by-side.
```{r Inspect cadence data, warning=F}
NHANES_cad <- NHANES_data %>%
select(
CadencePeak1Adjusted.steps.min.,
CadencePeak30Adjusted.steps.min.,
Cadence95thAdjusted.steps.min.
) %>%
melt()
ggplot(NHANES_cad, aes(x = value)) +
geom_histogram(binwidth = 5) +
scale_x_continuous(expand = c(0, 0), breaks = seq(0, 180, 10)) +
facet_wrap(~variable, ncol = 1, scales = "fixed") +
theme_bw() +
labs(x = "Cadence (steps/min)", y = "Frequency")
rm(NHANES_cad)
```
The above should now make a bit more sense if you choose to explore the
values in greater detail.
### Exercise 3: Self-exploration of the data
Now we want you to have a go at exploring the stepcount-derived metrics
yourself to see how active the group was. For this we will be looking at
the adjusted values. refer to the data dictionary to make sure you
understand what the variable is.
Some good starting variables to explore:
- `WearTime.days.`
- `ENMOAdjusted.mg.`
- `StepsDayAvgAdjusted`
- `StepsDayMedAdjusted`
```{r Inspect other variables}
# Your code here #
```
### Exercise 4: Discussion points
Having had a look at the data, discuss with the group or the person next
to you:
1. How active is the dataset as a whole?
2. Are there any differences in stepping metrics between weekend and
weekday steps?\
*hint: you can mostly copy the code used from the previous chunk to
view this*
3. How active is this sample in the US compared with the UK? We used
stepcount in UK BioBank here: **Small et al. 2024**
<https://journals.lww.com/acsm-msse/fulltext/2024/10000/self_supervised_machine_learning_to_characterize.9.aspx>
4. Can you explain any differences in behaviour between the 2 groups?
5. Are there any summaries from the accelerometer data (we will go
through demographics tomorrow) for any variables which you feel may
need considering in tomorrows epidemiology analysis?
```{r Additional exploration}
# Your code here #
```
## Further reading
**How calibration and non-wear time are calculated by Stepcount**
<https://biobankaccanalysis.readthedocs.io/en/latest/methods.html>
**Accelerometer data quality-related cleaning:**
- Doherty,2017:
<https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0169649>
- Pulsford,2023:
<https://ijbnpa.biomedcentral.com/articles/10.1186/s12966-022-01388-9>
**NHANES data documentation:**\
<https://wwwn.cdc.gov/Nchs/Nhanes/2011-2012/PAXMIN_G.htm>
**An example of NHANES PA and mortality analysis (using earlier hip-worn
accelerometer data):**
- Fishman,2016
<https://journals.lww.com/acsm-msse/fulltext/2016/07000/association_between_objectively_measured_physical.11.aspx>
End of Notebook