forked from statOmics/PSLS
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path02-concepts.Rmd
729 lines (490 loc) · 27.1 KB
/
02-concepts.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
---
title: "2. Basic Concepts"
author: "Lieven Clement"
date: "statOmics, Ghent University (https://statomics.github.io)"
---
<a rel="license" href="https://creativecommons.org/licenses/by-nc-sa/4.0"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a>
<iframe width="560" height="315" src="https://www.youtube.com/embed/gih6met-LSY?si=8D45xEsBKLwMulmc" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
# Introduction
```{r setup, include=FALSE, cache=FALSE}
knitr::opts_chunk$set(
include = TRUE, comment = NA, echo = TRUE,
message = FALSE, warning = FALSE, cache = TRUE
)
library(tidyverse)
library(NHANES)
```
```{r pop2Samp2Pop, out.width='80%',fig.asp=.8, fig.align='center',echo=FALSE}
if ("pi" %in% ls()) rm("pi")
kopvoeter <- function(x, y, angle = 0, l = .2, cex.dot = .5, pch = 19, col = "black") {
angle <- angle / 180 * pi
points(x, y, cex = cex.dot, pch = pch, col = col)
lines(c(x, x + l * cos(-pi / 2 + angle)), c(y, y + l * sin(-pi / 2 + angle)), col = col)
lines(c(x + l / 2 * cos(-pi / 2 + angle), x + l / 2 * cos(-pi / 2 + angle) + l / 4 * cos(angle)), c(y + l / 2 * sin(-pi / 2 + angle), y + l / 2 * sin(-pi / 2 + angle) + l / 4 * sin(angle)), col = col)
lines(c(x + l / 2 * cos(-pi / 2 + angle), x + l / 2 * cos(-pi / 2 + angle) + l / 4 * cos(pi + angle)), c(y + l / 2 * sin(-pi / 2 + angle), y + l / 2 * sin(-pi / 2 + angle) + l / 4 * sin(pi + angle)), col = col)
lines(c(x + l * cos(-pi / 2 + angle), x + l * cos(-pi / 2 + angle) + l / 2 * cos(-pi / 2 + pi / 4 + angle)), c(y + l * sin(-pi / 2 + angle), y + l * sin(-pi / 2 + angle) + l / 2 * sin(-pi / 2 + pi / 4 + angle)), col = col)
lines(c(x + l * cos(-pi / 2 + angle), x + l * cos(-pi / 2 + angle) + l / 2 * cos(-pi / 2 - pi / 4 + angle)), c(y + l * sin(-pi / 2 + angle), y + l * sin(-pi / 2 + angle) + l / 2 * sin(-pi / 2 - pi / 4 + angle)), col = col)
}
par(mar = c(0, 0, 0, 0), mai = c(0, 0, 0, 0))
plot(0, 0, xlab = "", ylab = "", xlim = c(0, 10), ylim = c(0, 10), col = 0, xaxt = "none", yaxt = "none", axes = FALSE)
rect(0, 6, 10, 10, border = "red", lwd = 2)
text(.5, 8, "population", srt = 90, col = "red", cex = 2)
symbols(3, 8, circles = 1.5, col = "red", add = TRUE, fg = "red", inches = FALSE, lwd = 2)
set.seed(330)
grid <- seq(0, 1.3, .01)
for (i in 1:50)
{
angle1 <- runif(n = 1, min = 0, max = 360)
angle2 <- runif(n = 1, min = 0, max = 360)
radius <- sample(grid, prob = grid^2 * pi / sum(grid^2 * pi), size = 1)
kopvoeter(3 + radius * cos(angle1 / 180 * pi), 8 + radius * sin(angle1 / 180 * pi), angle = angle2)
}
text(7.5, 8, "Cholesterol in population", col = "red", cex = 1.2)
rect(0, 0, 10, 4, border = "blue", lwd = 2)
text(.5, 2, "sample", srt = 90, col = "blue", cex = 2)
symbols(3, 2, circles = 1.5, col = "red", add = TRUE, fg = "blue", inches = FALSE, lwd = 2)
for (i in 0:2) {
for (j in 0:4)
{
kopvoeter(2.1 + j * (3.9 - 2.1) / 4, 1.1 + i)
}
}
text(7.5, 2, "Cholesterol in sample", col = "blue", cex = 1.2)
arrows(3, 5.9, 3, 4.1, col = "black", lwd = 3)
arrows(7, 4.1, 7, 5.9, col = "black", lwd = 3)
text(1.5, 5, "EXP. DESIGN (1)", col = "black", cex = 1.2)
text(8.5, 5, "ESTIMATION &\nINFERENCE (3)", col = "black", cex = 1.2)
text(7.5, .5, "DATA EXPLORATION &\nDESCRIPTIVE STATISTICS (2)", col = "black", cex = 1.2)
```
## Experimental Design (1)
- Researcher determines the **population** to which they wants to generalize their conclusions.
- Financial and logistic limitations $\rightarrow$ **representative sample** from population
## Data analysis (2 & 3)
- **Data-Exploration en Descriptive Statistics (2)**: explore, visualize, summarize, gain insight, check assumptions
- **Statistical Inference (3)**: Generalize what we observe in the sample towards the population so that we can draw general conclusions on the biological process we study. We need statistical models to analyze the data, and, to quantify and report on variability and uncertainty.
---
# Example
- National Health and Nutrition Examination Survey (NHANES)
- American demografic study
- Large number of physical, demographic, nutritional, life style and health characteristics
```{r nhanes, tidy=FALSE,echo=FALSE}
library(NHANES)
knitr::kable( NHANES[c(1,4,5,6,7,8),c(1,3,20,23,34,72)]
,format = "markdown")
```
---
# Variables
- We measure *variables* on subjects in the sample
- A Variable is a charactistic e.g. Direct cholesterol, Age, Gender,, ...
- It varies from subject tot subject in the population and thus also within the sample as well as from sample to sample.
## *Types* of variables
1. *Qualitative variables*: a limited number of outcome categories, non-numeric.
- *nominal variables*: no natural ordening, e.g. gender, blood group, eye color, ...
- *ordinal variables*: ordening, e.g. BMI class, smoking status (1: never smoked, 2: stopped smoking, 3: smoker)
2. *Numeric variables*:
- *discrete variables*: counts e.g. number of parners in life span, ...
- *continuous variables*: can (in theory) take each possible value between certain limits e.g. Age, Weight, BMI, fluorescence measurement in ELISA assay
- Often dichotomised to turn it in a nominal qualitative variable $\rightarrow$ information loss
---
# Population
- Aim of scientific study: make general statements on a process at the level of the population.
- E.g. assess if the cholesterol level is on average different between males and females who are elder than 25.
$\rightarrow$ assess cholesterol level in population above 25.
- Population in statistics is a theoretical concept
- It is in continuous evolution/change
- Often interest to generalize conclusions to future subject $\rightarrow$ so population cannot be entirely observed at the present.
- Can typically be considered to be infinite.
- Population has to be clearly defined!
*Inclusion criteria* are characteristics a subject/experimental unit must have to belong to the population, e.g.
- age above 25
- normal BMI
- ...
*Exclusion criteria* characteristics that the subject/experimental unit is not allowed to have to belong to the population, e.g.
- pregnancy in study on new type of drug
- diabetes, history of hard drugs, low health status when the aim is to delineate a range of normal values of blood pressure in a population of healthy individuals
- ...
---
# Random Variables
- Variables (e.g. direct cholesterol) vary in the population from subject to subject!
- Variables are thus *random* because their value changes in the population.
- **Crucial question**: How precise are the conclusions on the population based on a group of subjects in a sample!
- We will thus observe differences from sample to sample.
- Variability of the data plays a crucial role!
---
## Convention
- Use capital letters for a study characteristic (e.g. direct cholesterol) to indicate that it changes in the population without thinking about the observed value of a particular subject.
- Variable $X$ is a *random variable* and is the result of *random sampling* of the characteristic from the population.
- Random variable $X$ is thus the unknown variable that represent a measurement that we plan to collect on a random subject, but that we have not collected yet.
- Typically a sequence of random variables $X_1,\ldots X_n$ will be collected in the study (with n subjects or experimental units).
- The concept of random variables is necessary to reason on how the results and conclusions change from sample to sample.
- Random variables can be qualitative, quantitative, discrete, continuous, ...
---
# Describing the population
- It is impossible to predict the value of a random variable.
- The realised value of $X$ is subject to random variability.
- Suppose that we are interested in the IQ of subject. If we know how the data are distributed, we can use probability theory to calculate the probability that the IQ of a random subject of the population will be above 110.
---
## Intermezzo probability theory
### Discrete random variables
- Supose that we measure a discrete random variable $X$
- All possible values for the random variable $X$ are called the sample space $\Omega$.
- For gender the sample space is $\Omega=(0,1)$ with 0 (male) or 1 (female).
- Suppose that we role a dice, then the sample space is $\Omega=(1,2,3,4,5,6)$.
- An event $A$ is a subset of the sample space
- Get an even number when rolling a dice: $A=(2,4,6)$.
- Can also be $A=(1)$ one subset of the sample space.
- Event space $\mathcal{A}$ is the class of all possible events associated with a given experiment.
- Two events ($A_1$ and $A_2$) are multiple exclusive if they cannot occur together
- e.g. event of the odd numbers $A_1=(1,3,5)$ and the event of getting $A_2=(6)$
- so $A_1 \bigcap A_2=\emptyset$.
- Probability $P(A)$ is a function $P: A \rightarrow [0,1]$ which satisfies
1. $P(A) \geq 0$ and $P(A) \leq 1$ for each $A \in \mathcal{A}$
2. $P(\Omega)=1$
3. For multiple exclusive events $A_1, A_2, \ldots A_k$ the probability $P(A_1 \cup A_2 \ldots \cup A_k)= P(A_1) + \ldots + P(A_k)$
- Dice example
- odd number $A=(1,3,5)$: this is the union of 3 multiple exclusive events $A_1=1$, $A_2=5$ and $A_3=5$ so
$P(A)=P(1)+P(3)+P(5)=1/6+1/6+1/6=0.5$
- $\Omega=(1,2,3,4,5,6)$: $P(\Omega)=1$
- If we draw two subjects (j and k) independently from the population then the joint probability on
$P(X_j,X_k)= P(X_j)P(X_j)$
---
#### Probability mass function
- The probability mass function for a random variable $X$ describes the probability of each possible value of the sample space.
- Example: Gender is a binary variable (0:male, 1:female) and binary variables are Bernoulli distributed. 50.8% of the subjects of the American population are female and 49.2% are male. Let $\pi$ be the probability on a female $\pi=0.508$.
so
$$ X\sim \left \{
\begin{array}{lcl}
P(X=0) &=& 1-\pi\\
P(X=1) &=& \pi
\end{array} \right . $$
```{r}
data.frame(X = c(0, 1), prob = c(0.492, 0.508)) %>%
ggplot(aes(x = X, xend = X, y = 0, yend = prob)) +
geom_segment() +
ylab("Probability")
```
Random variable $X$ follows an Bernoulli distribution $B(\pi)$ with parameter $\pi=0.508$,
$$B(\pi)= \pi^x(1-\pi)^{(1-x)}$$
---
#### Cumulative distribution function
- The cumulative distribution function is the function F(x) that calculates the probability to observe a random variable X for which $X\leq x$:
$$ F(x) = \sum\limits_{\forall X\leq x} P(x)$$
- Gender example $F(0)=1-\pi$ and F(1)=P(X=0) + P(X=1)=1
```{r}
data.frame(X = c(0, 1), cumprob = c(0.492, 1)) %>%
ggplot(aes(x = X, xend = X, y = 0, yend = cumprob)) +
geom_segment() +
ylab("F(x)")
```
- Dice:
```{r}
data.frame(X = 1:6, cumprob = cumsum(rep(1 / 6, 6))) %>%
ggplot(aes(x = X, xend = X, y = rep(0, 6), yend = cumprob)) +
geom_segment() +
ylab("F(x)")
```
---
#### Mean
The mean or the expected value $E[X]$ of a discrete random variable is given by
$$E[X]=\sum\limits_{x\in\Omega} x P(X=x)$$
- Gender example
- $E[X]= 0 \times (1-\pi) + 1 \times \pi = \pi$
- The mean equals $E[X]=0.508$.
- Dice example:
$$
E[X]= 1 \times 1/6 + 2 \times 1/6 + \ldots + 6 \times 1/6 = `r sum(1:6)/6`
$$
---
#### Variance
The variance is a measure for the variability of a random variable and is given by
$$E[(X-E[X])^2]=\sum\limits_{x\in\Omega} (x-E[X])^2 P(X=x)$$
- Gender example
\begin{eqnarray}
E[(X-E[X])^2]&=&(0-\pi)^2\times (1-\pi)+(1-\pi)^2 \times \pi\\
&=& \pi^2 (1-\pi) + (1-\pi)^2 \pi\\
&=&\pi (1-\pi)(\pi+1-\pi)\\
&=&\pi(1-\pi)
\end{eqnarray}
---
### Continuous random variable
- The density function $f(x)$ describes how likely it is to observe a particular value of random variable X when we sample a random subject from the population.
- Many biological characteristics are approximatively normally distributed (upon transformation)
$$f(x) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(x-\mu)^2}{2\sigma^2}}$$
- This is denoted in shorthand as $f(x) = N(\mu,\sigma^2)$
- The IQ in the population is known to follow a normal distribution with mean $\mu=100$ and standard deviation $\sigma=15$.
$$IQ \sim N(100,15^2)$$
- R we can use the dnorm function to calculate the density of particular values of X=x.
- The arguments of `dnorm` are `mean` ($\mu$) and `sd` (standard deviation $\sigma$).
```{r IQ, fig.align='center'}
par(mar = c(5, 4, 4, 2) + 0.1, mai = c(1.02, 0.82, 0.82, 0.42))
grid <- seq(40, 160, .1)
plot(grid, dnorm(grid, mean = 100, sd = 15),
xlab = "IQ",
col = 2, ylab = "Densiteit", type = "l", lwd = 2, cex.lab = 1.5, cex.axis = 1.5
)
```
- Within certain limits, continuous variables can take all possible values so the sample space $\Omega$ is infinitely large.
---
#### Cumulative distribution
- Again the cumulative distribution $F(X)=P(X\leq x)$.
- Because X is continuous we will calculate this probability using an integral
$$F(x)=\int \limits_{-\infty}^x f(x) dx$$
- Note that $f(x)=0$ if x does not belong to the sample space.
- We can calculate $F(x)$ for a normally distributed random variable using the `pnorm` function again with arguments `mean` and `sd`.
```{r}
plot(grid, pnorm(grid, mean = 100, sd = 15), type = "l", xlab = "IQ", ylab = "Probability")
```
So the probability that the IQ of a random subject is below 80 can be obtained by
```{r}
pnorm(80, mean = 100, sd = 15)
```
```{r fig.align='center',echo=FALSE}
grid2 <- seq(40, 80, .01)
plot(grid, dnorm(grid, mean = 100, sd = 15),
xlab = "IQ",
col = 2, ylab = "Densiteit", type = "l", lwd = 2, cex.lab = 1.5, cex.axis = 1.5
)
polygon(x = c(grid2, 80, 40), y = c(dnorm(grid2, 100, 15), 0, 0), col = 2, border = 2)
text(80, dnorm(80, mean = 100, sd = 15), paste0("P(X < 80) = ", round(pnorm(80, 100, 15) * 100, 1), "%"), col = 2, cex = 1.5, pos = 4)
```
```{r fig.align='center',echo=FALSE}
plot(grid, pnorm(grid, mean = 100, sd = 15), type = "l", xlab = "IQ", ylab = "Probability", lwd = 2, col = 2)
lines(c(80, 80, 80, 0), c(0, rep(pnorm(80, 100, 15), 3)), lty = 2)
```
- For the largest possible value of $X$ we integrate over the entire sample space $\Omega$ so
$$\int \limits_{x \in \Omega} f(x) dx=1$$
- So the area under the density function equals 1!
---
#### Mean and Variance.
- The mean or the expected value $E[X]$ of a continuous random variable is given by
$$\int \limits_{x \in \Omega} x f(x) dx$$
- For the normal distribution
$$\int \limits_{-\infty}^{+\infty} x f(x) dx = \mu$$
- The variance $E[(X-E[X])^2]$ is given by
$$\int \limits_{x \in \Omega} (x-E[X])^2 f(x) dx$$
- For the normal distribution we get
$$\int \limits_{-\infty}^{+\infty} (x-\mu)^2 f(x) dx = \sigma^2$$
- It is often difficult to interpret the variance because it is not in the same unit as the random variable and the mean.
We therefore often use the standard deviation
$$SD=\sqrt{E[(X-E[X])^2]}$$
The SD for a normal distribution, $\sigma$ has the nice interpretation that approximately 68% of the population has a value for the characteristic X within the interval of one standard deviation ($\sigma$) around the mean:
$$P(\mu-\sigma < X < \mu + \sigma) \approx 0.68$$
```{r fig.align='center',echo=FALSE}
grid2 <- seq(85, 115, .01)
plot(grid, dnorm(grid, mean = 100, sd = 15),
xlab = "IQ",
col = 2, ylab = "Densiteit", type = "l", lwd = 2, cex.lab = 1.5, cex.axis = 1.5
)
polygon(x = c(grid2, 115, 85), y = c(dnorm(grid2, 100, 15), 0, 0), col = "grey", border = "grey")
text(115, dnorm(115, mean = 100, sd = 15), paste0("P(85<X<115) = ", round((pnorm(115, 100, 15) - pnorm(85, 100, 15)) * 100, 1), "%"), col = 2, cex = 1.3, pos = 4)
abline(v = 100)
text(4, 0, expression(mu))
lines(c(85, 115), rep(dnorm(115, 100, 15), 2))
text(107.5, dnorm(114.5, 100, 15), expression(sigma))
text(100 - 15 / 2, dnorm(114.5, 100, 15), expression(sigma))
```
- For normally distributed random variables approximately 95% of the subjects in the population have a value that lays in two standard deviations ($2 \sigma$) of the mean
$$P[\mu - 2 \sigma < X < \mu + 2 \sigma]\approx 0.95$$
```{r fig.align='center',echo=FALSE}
grid2 <- seq(70, 130, .01)
plot(grid, dnorm(grid, mean = 100, sd = 15),
xlab = "IQ",
col = 2, ylab = "Densiteit", type = "l", lwd = 2, cex.lab = 1.5, cex.axis = 1.5
)
polygon(x = c(grid2, 130, 70), y = c(dnorm(grid2, 100, 15), 0, 0), col = "grey", border = "grey")
text(115, dnorm(115, mean = 100, sd = 15), paste0("P(70<X<130) = ", round((pnorm(130, 100, 15) - pnorm(70, 100, 15)) * 100, 1), "%"), col = 2, cex = 1.3, pos = 4)
abline(v = 100)
text(4, 0, expression(mu))
lines(c(70, 130), rep(dnorm(130, 100, 15), 2))
text(115, dnorm(128, 100, 15), expression(paste(2,sigma)))
text(85, dnorm(128, 100, 15), expression(paste(2,sigma)))
```
---
- In R cumulative distribution can be calculated with the function pnorm. This function has arguments q the quantile, the mean and the standard deviation.
- If you want help you can always type:
```{r, eval=FALSE}
?pnorm
```
- What is the probability that a random subject in the population will have an IQ below 90?
```{r}
pnorm(q = 90, mean = 100, sd = 15)
```
- What is the probability that a random subject in the population will have an IQ below 110?
- What is the probability that a random subject in the population will have an IQ between 90 and 110?
---
## Standardization
- Normal data are often standardized.
$$z=\frac{x-\mu}{\sigma}$$
- Upon standardization the data follow a standard normal distribution with mean $\mu=0$ and variance $\sigma^2=1$:
$$z \sim N(0,1)$$
We can use the qnorm function to calculate the quantile $z_{2.5\%}$ and $z_{97.5\%}$ corresponding to $F(z_{2.5\%})=0.025$ and $F(z_{97.5\%})=0.975$, respectively.
```{r}
qnorm(0.025)
qnorm(0.975)
```
This indeed indicates that about 97.5%-2.5%=95% of a standard normal random variable falls within the interval [-2,2], or within 2 times the standard deviation ($\sigma=1$) from the mean ($\mu=0$).
---
# Sample
- In real studies we typically do not know the distribution in the population.
- Due to financial and logistic reasons we can almost never study the entire population.
- The population parameters (e.g. mean IQ, variance of IQ) can therefore not be obtained without error.
- Only as small subset of the population can be studied: the *sample*
- Sample according to a structured design: select **subject completely at random from the population** so that every subject has an equal probability to end up in the sample $\rightarrow$ **Representative sample**.
- The sample $x_1, x_2, . . . , x_{n}$ can be considered to be $n$ realisations of the same random variable $X$, for subjects $i = 1,2,...,n$.
- The distribution in the population is unknown and has to be estimated.
- If we can assume that the studied characteristic follows a particular distribution (e.g. a normal distribution $N(\mu,\sigma^2)$ then we only have to estimate the population parameters (e.g $\mu$ and $\sigma^2$) based on the sample.
- We refer to them as estimates and denote them by $\hat \mu$ and $\hat \sigma^2$.
---
## NHANES example
- Gender in the population
- Select $n=10000$ subjects at random from the American population.
- Once the random individuals are sampled from the population we have observed $n$ realisations of the random variable $X$.
- *Convention*: Observed values $\rightarrow$ are denoted with a small letter $x$.
- $x$ is a particular value measured/observed in a conducted experiment and no longer an unknown variable.
## In summary
- Unknown values of the studied population characteristic for 1 to $n$ subjects in a sample are random variables: $X_1, \ldots, X_n$
- We have to reason on this in order to understand how the observations, estimates and conclusions of a study can change from sample to sample.
- In a sample we observe the realised outcomes $x_1, x_2, \dots, x_n$: e.g. the observed genders or the observed direct cholesterol levels of the subjects in the sample.
---
# Gender Example
```{r}
library(NHANES)
NHANES %>% ggplot(aes(x = Gender)) +
geom_bar()
```
- Gender is a binary variable.
- It thus follows a Bernoulli distibution.
- The parameter of a Bernoulli distribution is the mean $\pi$.
- We can estimate $\pi$ based on the sample using the sample mean $\bar x = \sum\limits_{i=1}^n x_i$
- Note, that the sample mean is also a random variable! It also varies from sample to sample!
```{r}
NHANES$Gender %>% head()
```
- Note, that Gender is a factor and that the females are the reference class (first class).
- R by default uses the level which comes first in the alfabet as the reference class.
- We can recode the Gender using a 0 and 1 coding.
- When we use the as.numeric() function the factor Gender is transformed in a numeric value. It has two values 1 or 2. 1 stands for the first level (females) and 2 for the second level males. If we subtract 1 from it we have a 0 and 1 encoding (0 for females and 1 for males).
```{r}
NHANES <- NHANES %>% mutate(gender = as.numeric(Gender) - 1)
mean(NHANES$gender)
```
- Note, that due to the encoding the sample mean is an estimate for the fraction of males in the population.
- Always be careful with the encoding!
---
# Direct cholesterol example
## Empirical distribution
- We can estimate the direct cholesterol distribution of females using a histogram
```{r}
NHANES %>%
filter(Gender == "female") %>%
ggplot(aes(x = DirectChol)) +
geom_histogram()
```
- Note, that the distribution is skewed with a tail to the right.
- We can estimate the cumulative distribution function using the empirical cumulative distribution function.
- Every observation in the sample is observed once.
- So the empirical distribution of the sample is a discrete distribution with probability of 1/n on every observation.
- The empirical cumulative distribution function then becomes
$$ECDF(x) = \sum\limits_{x_i \leq x} \frac{1}{n} = \frac{\# (x_i \leq x)}{n}$$
```{r}
fem <- NHANES %>% filter(Gender == "female" & !is.na(DirectChol))
fem %>%
ggplot(aes(x = DirectChol)) +
stat_ecdf()
```
- We also illustrate this for a sample with sample size 10
```{r}
set.seed(1)
fem10 <- NHANES %>%
filter(Gender == "female" & !is.na(DirectChol)) %>%
sample_n(size = 10)
fem10 %>% ggplot(aes(x = DirectChol)) +
stat_ecdf()
```
---
## Normal approximation
- In the introduction we have seen that the log transformed direct cholesterol levels had a nice bell shape.
```{r}
fem %>% ggplot(aes(x = DirectChol %>% log2())) +
geom_histogram(aes(y = ..density.., fill = ..count..)) +
xlab("Direct Cholesterol (log2)") +
stat_function(fun = dnorm, color = "red", args = list(mean = mean(fem$DirectChol %>% log2()), sd = sd(fem$DirectChol %>% log2())))
```
- We can now approximate the distribution of log2 transformed direct cholesterol levels using a normal distribution.
- We only have to estimate two parameters: the mean and the variance. We can do this based on the sample mean ($\bar x$) and sample variance ($s^2$) or sample standard deviation ($s$).
```{r}
xBar <- mean(fem$DirectChol %>% log2())
sdBar <- sd(fem$DirectChol %>% log2())
xBar
sdBar
```
- We can do the same thing for the small sample with 10 women.
```{r}
fem10 %>% ggplot(aes(x = DirectChol %>% log2())) +
geom_histogram(aes(y = ..density.., fill = ..count..), bins = 10) +
xlab("Direct Cholesterol (log2)") +
stat_function(fun = dnorm, color = "red", args = list(mean = mean(fem10$DirectChol %>% log2()), sd = sd(fem10$DirectChol %>% log2()))) +
xlim(-2, 2)
```
```{r}
xBar10 <- mean(fem10$DirectChol %>% log2())
sdBar10 <- sd(fem10$DirectChol %>% log2())
xBar10
sdBar10
```
---
## Reference intervals
- Normal values for the cholesterol levels in the population can be calculated using a reference interval. Typically a 95% reference interval is used, so that 95% of the subjects in the population are expected to have a value for the characteristic that falls into the reference interval.
- We can do this based on the empirical distribution using the quantile function. We need to calculate the quantiles $\hat{F}(x_{2.5\%})=0.025$ and $\hat{F}(x_{97.5\%})=0.975$ so that 95% of the values are located in the interval [$x_{2.5\%}$, $x_{97.5\%}$] .
- Large sample
```{r}
quantile(fem$DirectChol, prob = c(0.025, 0.975))
```
- So based on the large sample, we estimate that 95% of the females in the population have a direct cholesterol level in the interval [`r quantile(fem$DirectChol,prob=c(0.025,0.975))`].
- Small sample
```{r}
quantile(fem10$DirectChol, prob = c(0.025, 0.975))
```
- Note, that this estimate is very crude. In the small sample, We do not have enough observations to have a good approximation of the extreme quantiles.
### Normal approximation
- We can use the function qnorm to calculate quantiles from the normal distribution.
- We also know that a 95% reference interval is located rougly in two standard deviations around the mean.
- We will have to use the function `2^` to transform the result back to the direct cholesterol domain.
- Large sample
```{r}
qnorm(0.025, mean = xBar, sd = sdBar) %>% 2^.
qnorm(0.975, mean = xBar, sd = sdBar) %>% 2^.
2^(xBar - 2 * sdBar)
2^(xBar + 2 * sdBar)
```
- Small sample
```{r}
qnorm(0.025, mean = xBar10, sd = sdBar10) %>% 2^.
qnorm(0.975, mean = xBar10, sd = sdBar10) %>% 2^.
2^(xBar10 - 2 * sdBar10)
2^(xBar10 + 2 * sdBar10)
```
---
## Conclusions
- For the large sample, the empirical distribution (quantile function) and the normal approximation gives us approximately the same result.
- For the small sample, however, the normal approximation works much better than the one based on the empirical distribution.
- This is because we are looking at extreme quantiles 2.5% and 97.5%.
- Indeed, we have very few observations at our disposal in the small sample to estimate these quantiles directly from the observations.
- With the normal approximation we can use all the data to estimate the mean and the variance. So if the normal assumption holds, we get better estimates for these extreme quantiles.
---
# Statistics
- Formula will be used to estimate the parameters of a distribution in the population based on the sample. We call these statistics or estimators.
- The numeric result obtained by evaluating these formula are also called statistics or estimates.
- Researcher want to known the unknown parameters from the population and will thus estimate them using the statistics observed or calculated based on the sample
- Because we calculate the statistics based on the observations in the sample, they will also vary from sample to sample and are random variables and we denote them with capital letters (e.g. $\bar X$ for the sample mean and $S^2$ for the sample variance).
- So when we analyse data, we have to reason on how the statistics of interest will vary from sample to sample.
- When the statistics refer to a numeric value realised in a particular sample, we will use a small letter: $\bar x$ and $s^2$.
---
# Convention
- **Population parameters** are fixed but unknown and we will denote them with $\rightarrow$ **Greek symbols**.
- **Statistics** that we use to estimate unknown parameters based on the sample are denoted with **letters** are with a hat.
- e.g. for normal distribution
| Population | Sample|
| :---: | :---: |
| $\mu$ | $\bar X$ or $\hat \mu$ |
| $\sigma^2$ | $S^2$ or $\hat \sigma$|