-
Notifications
You must be signed in to change notification settings - Fork 58
/
LMM_Intro.rmd
1163 lines (871 loc) · 43.2 KB
/
LMM_Intro.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Linear Mixed Models (LMMs) - Introduction"
author: "Joshua F. Wiley"
date: "`r Sys.Date()`"
output:
tufte::tufte_html:
toc: true
number_sections: true
---
Download the raw `R` markdown code here
[https://jwiley.github.io/MonashHonoursStatistics/LMM_Intro.rmd](https://jwiley.github.io/MonashHonoursStatistics/LMM_Intro.rmd).
These are the `R` packages we will use.
```{r setup}
options(digits = 2)
## new packages are lme4, lmerTest, and multilevelTools
library(tufte)
library(haven)
library(data.table)
library(JWileymisc)
library(lme4)
library(lmerTest)
library(multilevelTools)
library(visreg)
library(ggplot2)
library(ggpubr)
```
# Revision to Prepare for Linear Mixed Models
We can model a straight line (linear regression) as
$$ y_i = b_0 + b_1 * x_i + \varepsilon_i $$
where:
- $y_i$ is the outcome variable for the *i*th person
- $x_i$ is the predictor/explanatory variable value for the *i*th person
- $\varepsilon$_i is the residual/error term, the difference between the
observed outcome for the *i*th person and their predicted value from the model.
- $b_0$ is the intercept, the expected (model predicted) value of $y$
when $x = 0$, written, $E(Y | x = 0)$
- $b_1$ is the slope of the line, and captures how much $y$ is
expected to change for a one unit change in $x$.
In `R` we can write the linear regression model
$$ y_i = b_0 + b_1 * x_i + \varepsilon_i $$
as follows:
```
lm(y ~ 1 + x, data = your_dataset)
```
The `lm` stands for a **l**inear **m**odel. $y$ is the outcome.
The tilde (~) separates the outcome from the predictors.
The predictors are $1$ and $x$. One is a constant, which
captures the intercept, $b_0$.
$x$ is whatever our predictor/explanatory variable is.
We specify the dataset so `R` knows where to find the variables.
`r margin_note("The intercept is 324.08 (the 'Estimate' column) and is
the expected value of hp when mpg is 0. The slope is -8.83 indicating
that each one unit increase in mpg is associated with an 8.83 lower
expected hp value. P-values can be interpretted from the column
called 'Pr(>|t|)' and may use scientific notation, see:
https://en.wikipedia.org/wiki/Scientific_notation")`
Here is a specific example estimated in `R` using the built in
`mtcars` dataset which has data on the horsepower `hp` and
miles per gallon `mpg` of 32 difference cars.
We predict `hp` from an intercept and `mpg` and
get a model summary.
```{r}
m <- lm(hp ~ 1 + mpg, data = mtcars)
summary(m)
```
Visually, we can represent the regression as a straight line
where the slope is the coefficient for `mpg`, $b_1$, and the
intercept, $b_0$ is the expected `hp` when `mpg` is 0.
We can see the largest residual here: its the point the furthest
from the regression line.
```{r, echo = FALSE, fig.cap = "Graph showing regression of hp on mpg with linear regression line and the maximum residual from the regression line."}
ggplot(mtcars, aes(mpg, hp)) +
geom_point(size = 3, colour = "grey") +
stat_smooth(method = "lm", formula = y ~ x, size = 1.5) +
theme_pubr() +
annotate("text", x = 16, y = max(mtcars$hp),
label = "max residual = 143.4", hjust = 0)
```
In linear regression, we can conduct statistical inference as
$$ \frac{b}{se} \sim \mathcal{t}(df = N - k) $$
where:
- $b$ = regression coefficient
- $se$ = standard error
- $N$ = total sample size
- $k$ = number of parameters estimates
- $\mathcal{t}$ is the t-distribution on $df$ degrees of freedom
This approach allows us to calculate p-values and confidence intervals.
To see what the $t$ distribution looks like for various degrees of
freedom and how these impact p-values, look at this demonstration:
https://rpsychologist.com/d3/tdist/
Linear regression **assumes** that observations are independent of each
other. However, this is not always the case. **linear mixed models** are
an approach to regression that allows us to relax the assumption that
our observations are independent.
# Linear Mixed Models (LMMs) - Introduction
Often, observations are not independent.
For example:
- Repeated measures data, such as in longitudinal studies;
repeated measures experiments, where observations are clustered
within people
- When individuals are clustered or grouped, such as people within
families, schools, companies, resulting in clustering within the
higher order unit
Clustered data versus repeated measures or longitudinal data may
seem quite different, but from a statistical perspective, they
post may of the same challenges that are solved in basically the same
ways. We will focus on repeated measures for now, but note the
statistical methods apply to both contexts.
Here is some hypothetical data on a few people where systolic blood
pressure (SBP) was measured at three time points:
| ID | SBP1 | SBP2 | SBP3 |
|----|------|------|------|
| 1 | 135 | 130 | 125 |
| 2 | 120 | 125 | 121 |
| 3 | 121 | 125 | . |
This data is stored in "wide" format. That is each repeated
measure is stored as a separate variable. For LMMs, we typically
want data in a long format, like this
| ID | Time | SBP |
|----|------|-----|
| 1 | 1 | 135 |
| 1 | 2 | 130 |
| 1 | 3 | 125 |
| 2 | 1 | 120 |
| 2 | 2 | 125 |
| 2 | 3 | 121 |
| 3 | 1 | 121 |
| 3 | 2 | 125 |
in a long format, data are stored with each measure in one variable
and an additional variable to indicate the time point or
which specific assessment is being examined.
In long format, multiple rows may belong to any single unit.
Not all units (here IDs) have to have the same number of rows.
For example some people like ID3 may have missed a time point.
With clustered data, you could have a large or small school with
a different number of students (where student is the repeated measure
within school rather than time points within a person).
The `reshape()` function in `R` can be used to reshape data in
a wide format to a long format or from a long format to a wide format.
See the "Working with Data" topic for examples and details
[http://joshuawiley.com/MonashHonoursStatistics/WorkData.html#reshaping-data](http://joshuawiley.com/MonashHonoursStatistics/WorkData.html#reshaping-data).
Regardless of wide or long format, these data are clearly not
independent. The different blood pressure readings within a
person are likely more related to each other than to blood pressure
readings from a different person.
## Considering Non Independence
As part of PSY4210 there was a small data collection exercise where
you completed some questions every day. We load that data using the
`read_sav()` function and plot the daily energy ratings from a few
different people in the following figure.
```{r, fig.height = 4, fig.width = 5}
dd <- as.data.table(read_sav("DD 19032020.sav")) # daily
ggplot(dd[ID %in% c(2, 3, 6, 7)],
aes(factor(ID), energy)) +
geom_point(alpha = .2) +
stat_summary(fun = mean, colour = "blue", shape = 18) +
theme_pubr()
```
Not only do observations vary across days within a person, but
different people have different mean energy levels.
Our first thoughts on analysing data where observations are not
independent, may be to consider methods we already know, such as
linear regression (or GLMs) or ANOVAs.
You might think about would be ANOVA. We could use a
repeated measures ANOVA. RM ANOVA can handle repeated measures if they
are:
- at discrete time points (e.g., T1, T2, T3)
- everyone has the same number of time points
- the outcome is continuous, normally distributed data
However, RM ANOVA has many limitations. It cannot handle...
- continuous time (e.g., if one person completes on days 0, 13, 22,
and another on 0, 1, 20)
- if someone is missing data on any timepoint, they are completely
excluded from analaysis (unless you do something like imputation)
- continuous predictors/explanatory variables (e.g., age in years) nor
other types of outcomes
If ANOVAs are not ideal, what would happen if we use a linear
regression? The following figure shows the linear regression lines for
the association between energy and self esteem for each individual
participant (just four for example) and the single line that is the
result of an overall linear regression (in blue).
```{r, fig.height = 4, fig.width = 5}
ggplot(dd[ID %in% c(2, 3, 6, 7)],
aes(energy, selfesteem_LSE)) +
stat_smooth(method = "lm", se = FALSE, colour = "blue", size = 2) +
stat_smooth(aes(group = ID), method = "lm", se = FALSE, colour = "black") +
theme_pubr()
```
The overall linear regression has to be the same for everyone because
in linear regression we have one intercept and one slope. The fact
that different people may have different levels of energy or self
esteem and the fact that the association between energy and selfesteem
may not be the same for all people cannot be captured by the linear
regression.
Statistically, the linear regression assumption of independence also
would be violated making the standard errors, confidence intervals,
and p-values from a linear regression on non-independent data biased
and wrong.
Both the fact that regular regression / GLMs cannot capture individual
differences in the level of each line (different intercepts by ID) nor
the fact that different people may have a different relationship
between the predictor and outcome (different slopes by ID) and the
issue around independence stem from the fact that in the regressions /
GLMs we have learned so far, all the coefficients are **fixed
effects**, meaning that they are **fixed** (assumed) to be identical
for every participant.
When you have only one observation per participant, fixed effects are
necessary as it is impossible to estimate coefficients for any
individual participant. We can only analyze by aggregating across
individuals. With repeated measures, however, it is possible and
important to consider whether a coefficient differs between people.
If a regression coefficient differs for each participant we say that
it varies or is a random variable. Because of this we call
coefficients that differ by person random effects, which are different
from fixed effects because they are not assumed to be identical for
everyone, but instead vary randomly by person.
## LMM Theory
Linear Mixed Effects Models (LMMs) extend the fixed-effects only linear
regression models covered in previous units to include both fixed and
random effects. That is, to be able to include regression coefficients
that are identical for everyone (fixed effects) and regression
coefficients that vary randomly for each participant (random
effects).
Because LMMs include both fixed and random effects, they are called
mixed models.
Let's see how this works starting with the simplest linear regression
model and then moving into the simplest linear mixed model.
The simplest linear regression model is an intercept only model.
$$ y_i = b_0 * 1 + \varepsilon_i $$
The intercept here, $b_0$, is the expected outcome value when all
other predictors in the model are zero. Since there are no other
predictors, this will be the mean of the outcome. The fixed effect
means the linear regression model assumes that all participants have
the same mean. Shown graphically, here are a few participants' data
with a blue dot showing the mean for each person. The regression line
with an intercept only will be the blue line. It assumes that the mean
(intercept) for all people is identical, which is not true.
We need to make $b_0$ a random effect -- to allow it to vary randomly
across participants.
```{r, fig.height = 4, fig.width = 5}
ggplot(dd[ID %in% c(2, 3, 4, 5)],
aes(ID, energy)) +
geom_point(alpha = .2) +
stat_summary(fun = mean, colour = "blue", shape = 18) +
stat_smooth(method = "lm", se = FALSE, formula = y ~ 1) +
theme_pubr()
```
We can see the actual distribution of mean energy levels by ID easily
enough as follows.
```{r}
plot(testDistribution(dd[, .(
MeanEnergy = mean(energy, na.rm = TRUE)), by = ID]$MeanEnergy),
varlab = "Mean (average) Energy by ID")
```
The key point of the previous graph is to show that indeed, different
participants have different mean energy levels. The mean energy levels
vary and in this case appear to fairly closely follow a normal
distribution. The means are a random variable.
If we assume that a random variable follows a particular distribution
(e.g., a normal distribution), we can describe it with two parameters:
the mean and standard deviation.
Assuming a random variable comes from a distribution gives us another
way to think about fixed and random effects.
- **Fixed Effects** approximate the distribution by: $M$ = estimated
mean; $SD$ = 0 (i.e., the standard deviation is **fixed** at 0)
- **Random Effects** approximate the distribution by:
$M$ = estimated mean; $SD$ = estimated standard deviation
(the SD is free to vary).
In this way, you can think of both fixed and random effects as both
using a distribution to approximate a random variable (the association
between say energy and self esteem for different people) but fixed
effects make the strong assumption that the distribution has $SD = 0$
whereas random effects *relax* that assumption and allow the
possibility that $SD > 0$.
## Intraclass Correlation Coefficient (ICC)
In repeated measures / multilevel (twolevel) data, we can decompose
variability into two sources: Variability **between** individuals and
variability **within** individuals.
We call the ratio of between variance to total variance the Intraclass
Correlation Coefficient (ICC). The ICC varies between 0 and 1.
- 0 indicates that all individual means are identical so that all
variability occurs within individuals.
- 1 indicates that within individuals all values are the same with all
variability occurring between individuals.
We can define some equivalent notations:
$$ TotalVariance = \sigma^{2}_{between} + \sigma^{2}_{within} = \sigma^{2}_{randomintercept} + \sigma^{2}_{residual} $$
With that definition of the total variance, we can define the ICC as:
$$ ICC = \frac{\sigma^{2}_{between}}{\sigma^{2}_{between} + \sigma^{2}_{within}} = \frac{\sigma^{2}_{randomintercept}}{\sigma^{2}_{randomintercept} + \sigma^{2}_{residual}} $$
A basic understanding of these equations is helpful as it let's us
interpret the ICC. Suppose that every person's mean was
identical. There would be no variation at the between person level.
That is, $\sigma^{2}_{between} = 0$. That results in:
$$ ICC = \frac{0}{0 + \sigma^{2}_{within}} = 0 $$
When there are no differences between people, $ICC = 0$.
What happens if there is no variation within people? What if all the
variation is between people?
That is, $\sigma^{2}_{within} = 0$. That results in:
$$ ICC = \frac{\sigma^{2}_{between}}{\sigma^{2}_{between} + 0} = 1 $$
When there are no differences within people, $ICC = 1$.
The ICC is the ratio of the differences between people to the total
variance. A small ICC near 0 tells us that almost all the variation
exists within person, not between person. A high ICC near 1 tells us
that almost all the variation occurs between people with very little
variation within person. The ICC can be interpretted as the percent of
total variation that is between person.
The following figure shows two examples. The High Between variance
graph would have a **high** ICC and the Low Between variance graph
would have a **low** ICC.
```{r, fig.height = 10, fig.width = 7, fig.cap = "Example showing the difference between high and low between person variance in data."}
set.seed(1234)
ex.data.1 <- data.table(
ID = factor(rep(1:4, each = 10)),
time = rep(1:10, times = 4),
y = rnorm(40, rep(1:4, each = 10), .2))
ex.data.2 <- data.table(
ID = factor(rep(1:4, each = 10)),
time = rep(1:10, times = 4),
y = rnorm(40, 2.5, 1))
ggarrange(
set_palette(ggplot(ex.data.1,
aes(time, y, colour = ID, shape = ID)) +
stat_smooth(method = "lm", formula = y ~ 1, se=FALSE) +
geom_point() +
theme_pubr(), "jco"),
set_palette(ggplot(ex.data.2,
aes(time, y, colour = ID, shape = ID)) +
stat_smooth(method = "lm", formula = y ~ 1, se=FALSE) +
geom_point() +
theme_pubr(), "jco"),
ncol = 1,
labels = c("High Between", "Low Between"))
```
`r margin_note("Beyond the scope of this unit, sometimes we have even
more levels of data, like repeated observations measured within
people and people measured within classrooms or cities, etc. We can
similarly calculate ICCs for each level, i.e., city, person,
etc. To do this using the iccMixed() function, you would just add
more ID variables.")`
We can calculate the ICC in `R` using the `iccMixed()` function.
It requires the name of the dependent variable, the ID variable,
indicating which rows in the dataset belong to which units, and the
dataset name. It returns the standard deviation (Sigma) attributable
to the units (ID; the between variance) and to the residual (the
within variance). The column labelled ICC is the proportion of total
variance attributable to each effect. The main ICC would be the ICC
for the ID, here 0.41 for stress.
```{r}
iccMixed("stress", id = "ID", data = dd)
```
The ICC can differ across variables. For example, one variable might
be quite stable across days and another very unstable. When working
with repeated measures data, the ICC for each repeated measure
variable is a useful descriptive statistic to report. For example, the
following calculates the ICC for energy, 0.3, which is quite a bit lower
than that for stress, indicating that energy is less stable from day
to day than is stress, or equivalently that compared to energy, stress
has relatively more variation between people then within them.
```{r}
iccMixed("energy", id = "ID", data = dd)
```
## Between and Within Effects
The ICC illustrates an idea that comes up a lot in multilevel or mixed
effects models: the idea of between and within effects. A total effect
or the observed total score can be decomposed into some part
attributable to between and some part attributable to a within
effect as shown in the following diagram.
```{r, echo = FALSE, fig.cap = "Decomposing multilevel effects"}
DiagrammeR::grViz("
digraph 'Decomposing multilevel effects' {
graph [overlap = true, fontsize = 14]
node [fontname = Helvetica, shape = rectangle]
T [label = 'Total']
B [label = 'Between']
W [label = 'Within']
T -> {B W}
}
")
```
Between effects exist **between** units whereas within effects exist
**within** a unit. In psychology, often the unit is a person and the
between effects are differences between people whereas the within
effects are differences within an individual.
```{marginfigure}
When you subtract the mean from a variable, the new mean will be equal
to 0. For example, if you have three numbers:
$$ (1, 3, 5) $$
the mean is 3, if you subtract the mean you get:
$$ (-2, 0, +2) $$
and the mean of the deviations are 0.
If you have:
$$ (4, 5, 6) $$
the mean is 5 and if you subtract the mean you get:
$$ (-1, 0, +1) $$
as deviation scores and the mean of these deviations is 0.
It is a fact that the mean of deviations from a mean will always
be 0. Because of how we normally calcualte a within person variable,
from a total variable, we separate out the mean into the between
portion and the within portion has deviations from individuals' own
means and so the mean of the within portion will be 0.
```
For example, suppose that in one individual, Person A, over three
days, we observed the following scores on happiness: 1, 3, 5.
Those total observed scores could be decomposed into one part
attributable to a stable, between person effect (the mean) and another
part that purely captures daily fluctuations, the within person effect
(deviations from the individuals' own mean). We could break down those
numbers: 1, 3, 5 as shown in the following figure.
```{r, echo = FALSE, fig.cap = "Decomposing multilevel effects - concrete example"}
DiagrammeR::grViz("
digraph 'Decomposing multilevel effects example' {
graph [overlap = true, fontsize = 12]
node [fontname = Helvetica, shape = rectangle]
subgraph Between {
node [fontname = Helvetica, shape = rectangle]
B1 [label = 'Day 1: 3']
B2 [label = 'Day 2: 3']
B3 [label = 'Day 3: 3']
B1 -> B2
B2 -> B3
}
subgraph Within {
node [fontname = Helvetica, shape = rectangle]
W1 [label = 'Day 1: -2']
W2 [label = 'Day 2: 0']
W3 [label = 'Day 3: +2']
W1 -> W2
W2 -> W3
}
Total
Total -> Between
Total -> Within
Between -> B1 [lhead = between]
Within -> W1 [lhead = within]
}
")
```
Notice that the between person effect does not vary across days. It is
constant for a single individual. The within person portion, however,
does vary across days.
In this case, the between person effects would be the mean for each
person, matching a different intercept for each person from our ICC
example. The variance in these means (intercepts) is the between
person variance. The within person effects are individual deviations
from individuals' own means and their variance would be the residual
variance. Together those variances add up to the total variance and
these portions are what the ICC captures.
## Linear Mixed Models
```{marginfigure}
Linear mixed models (LMMs) or mixed effects models are also
sometimes called "multilevel models" or
"hierarchical linear models" (HLMs). The multilevel model or
hierarchical linear model names both come from the fact that
they are used for data with multiple, hierarchical levels,
like observations nested within people, or kids nested
within classrooms. However, all three of these are synonyms
for the same statistical model, that includes both fixed and random
effects, as random effects are the method used to address the
hierarchical or multilevel nature of the data.
One difference, however, is that HLMs imply a specific hierarchy,
which may not always be present in data. For example, if all siblings
from a family were recruited as well as students in the same
classroom, there would be crossed effects with any given person a
member of a higher level family unit and a higher level classroom
unit. HLMs generally are not setup for situations where there is not a
pure hierarchy. Mixed effects models address this easily as it just
requires different random effects (one for families, one for
classrooms, for example). Thus, all HLMs are mixed models, but not all
mxied models are HLMs. Another example would be if in an experiment,
we randomly sampled difficulty levels and every participant completed
all combinations. Difficulty level could be a random effect and
participant could be a random effect, with observations nested within
both, but they do not form a clear hierarchy (i.e., difficulty level
is not above/below participant).
```
Unlike linear regression, which models all regression coefficients as
fixed effects, mixed effects regression models some coefficients as
fixed effects and others as random effects (hence the term, "mixed"
effects).
A random effect is a regression coefficient that is allowed to vary
as a random variable. Random effects provide a way of dealing with
non-independence in the data by explicitly modeling the systematic
differences between different units.
The simplest linear mixed model (LMM) is an intercept only model,
where the intercept is allowed to be a random variable.
$$ y_{ij} = b_{0j} * 1 + \varepsilon_{ij} $$
Here we have an observed outcome, $y$, with observations for a
specific person (unit), $j$, at a specifc timepoint / lower level
unit, $i$. In addition, note that our intercept, which in linear
regression is $b_0$ is now $b_{0j}$ indicating that the intercept is
an estimated intercept for each unit, $j$. In psychology our units
often are people, but as noted before the "unit"may be something else,
such as students within classrooms, or kids within families, etc.
As before, we assume that:
$$ y_{ij} \sim \mathcal{N}(\boldsymbol{\eta}, \sigma_{\varepsilon_{ij}}) $$
by convention, we decompose random effects, here the random intercept,
as follows:
$$ b_{0j} = \gamma_{00} + u_{0j} $$
where
- $\gamma_{00}$ is the mean intercept across all people. There are
no variable subscripts under it, only numbers, as it does *not* vary
across different people.
- $u_{0j}$ is the individual unit deviations from the mean
intercept, $\gamma_{00}$.
Under this decomposition, $\gamma_{00}$, is a fixed effect, the
average intercept for all units, and is interpretted akin to what we
are used to in linear regression, $b_0$. $u_{0j}$ is how much each
individual units intercept differs from the average, they are
deviations.
We also make another, new distributional assumption, we assume that:
$$ u_{0j} \sim \mathcal{N}(0, \sigma_{int}) $$
In other words, we assume that individual units' deviations from the
fixed effect, average intercept follow a normal distribution with mean
0 and standard deviation equal to the standard deviation of the
deviations.
The actual parameters of the model that must be estimated are:
- $\gamma_{00}$, the fixed effect intercept
- $\sigma_{int}$, the standard deviation of the individual deviations
from the fixed effect intercept, the $u_{0j}$
- $\sigma_{\varepsilon}$, the residual error term
A powerful feature of LMMs is that despite needing to allowing/model
individual differences in a coefficient, here the intercept, we do not
have to actually estimate a separate intercept parameter for each unit
/ person. In fact, it takes only one more parameter than a regular
linear regression. The only additional parameter is the standard
deviation of the individual intercepts. We also have another
distribution assumption, by assuming the the random effect intercept
also follows a normal distribution, but in exchange for the one extra
parameter and one new assumption, we are able to relax the assumption
of linear regression that each observation is independent.
To fit a LMM in `R`, we use the `lmer()` function from the `lme4`
package. `lmer()` stands for **l**inear **m**ixed **e**ffects
**r**egression, and it follows a syntax very similar to `lm()`. The
main difference/addition is a new section in parentheses to indicate
which specific regression coefficients should be random effects and
what the ID variable of the unit is that should be used for the random
effects. Here we will fit an intercept only LMM on stress from the
daily data collection exercise data. We add a random intercept by ID
using the syntax: `(1 | ID)`.
```{marginfigure}
Summary of output:
Type of model, how fit (Restricted Maximum Likelihood, REML) and how
statistical inference was performed (t-tests using the Satterthwaite's
approximation method to calculate approximate degrees of freedom)
Formula used to define the model is echoed for reference
REML criterion, which is kin to a log likelihood value. We'll talk
about this more in a later lecture.
Scaled residuals, which are rescaled from the raw residuals.
Interpret these similarly to usual residuals from linear regression.
A new section: *Random Effects*.
This shows the variance and standard deviation (just square root of
variance) for the random intercept and the residual variance (our
residual error term from linear regression).
Total number of observations included and the number of unique
units, here IDs, representing different people. These numbers are
what were included in the analysis.
Fixed Effects regression coefficient table for LMMs is interpretted
comparably to the regression coefficients table for a linear
regression. One note is that the degrees of freedom, df, cannot
readily be calculated in LMMs. The Satterthwaite's approximation
method was used here, but this approximation is not perfect and
because it is an estimated degrees of freedom, you can get decimal
points, not just whole numbers. The decimals are not an error.
```
The summary shows information about how the model was estimated and
its overall results.
```{r}
m <- lmer(stress ~ 1 + (1 | ID), data = dd)
summary(m)
```
We can interpret that in this model, there were
`r nobs(m)` observations included in the linear mixed model, and these
observations came from
`r ngrps(m)[["ID"]]` different people.
On average, these people had a stress score of
`r fixef(m)[["(Intercept)"]]`, the fixed effect intercept, and this
was signifciantly different from zero. By examining the random effects
standard deviations, `Std.Dev.` we can see that the random intercept
standard deviation is almost as large as the residual standard
deviation, which is consistent with the earlier ICC we calculated for
stress that nearly half the variance in stress is between people with
the other half occuring within people.
```{marginfigure}
In mixed models we do not technically estimate parameters for each
individual. That is, we do not directly estimate random effects like
$b_{0j}$. We estimate the average of the distribution, $\gamma_{00}$
and the standard deviation fo the distribution, $\sigma_{int}$.
The `coef()` function then does not report estimated coefficients per
se, but what are called Best Linear Unbiased Predictors (BLUPs).
The BLUPs use the model estimated parameters from the LMM along with
the data, to basically predict what each person's own, conditional
mean is. This combines both their actual data and the model. People
with more data will have a BLUP closer to the observed mean of their
own data in an intercept only model. People with less data (say only
1-2 observations) will have a BLUP that is closer to the average mean
of all people as its assumed that the mean of their 1-2 data points is
likely very noisy/inaccurate due to small sample size.
*Side note: while many of the concepts from LMMs apply to GLMMs just as
many concepts from LMs apply to GLMs, the BLUPs or conditional means
do not quite generalize. In GLMMs we can get conditional modes, but
not conditional means. This only matters if you would be doing, for
example, logistic mixed effects regression models.
```
We can see the fixed effects coefficients only from the model by using
the `fixef()` function. If we want to see the predictions of the
coefficients for each person, i.e., using the random effects, we can
use the `coef()` function.
```{r}
fixef(m)
coef(m)
```
The random intercept coefficient estimates are similar conceptually to
what we would calculate if we calcualted the average stress score, per
person, across days. However, there are some differences. To see
these, we'll put both into the same dataset.
```{r}
## make a data table of the random intercepts and IDs
randomintercept <- data.table(
ID = as.numeric(rownames(coef(m)$ID)),
RI = coef(m)$ID[, "(Intercept)"])
## view the first few rows
head(randomintercept)
## calculate the means and number of observations, by ID
individualMeans <- dd[!is.na(stress), .(
Means = mean(stress),
N = .N), by = ID]
## merge the two datasets together by ID
rimeans <- merge(randomintercept, individualMeans, by = "ID", all=FALSE)
## order the dataset by mean
rimeans <- rimeans[order(Means)]
rimeans[, ID := factor(ID, levels = ID)]
## view the merged data
head(rimeans)
```
Now we can make a figure with an arrow for each person that starts at
the raw mean and ends at the random intercept estimate for that
individual. The fixed effect intercept is added as a line. The
individual arrows are coloured based on how many days of stress data
each person had available.
```{r, fig.width = 5, fig.height = 7, fig.cap = "Figure showing how compared to raw means, random intercept estimates from LMMs tend to shrink indivdiual estimates towards the overall fixed effect estimate, the solid line in the middle. The degree of shrinkage tends to be greater for individuals that are further away from the fixed effect estimate and is greater for individuals with fewer data points."}
ggplot(rimeans, aes(ID, xend = ID, y = Means, yend = RI, colour = N)) +
geom_hline(yintercept = fixef(m)[["(Intercept)"]]) +
geom_segment(arrow = arrow(length = unit(.15, "cm"))) +
coord_flip() +
theme_pubr() +
ggtitle("Shrinkage From Raw to LMM Means")
```
This figure highlights how shrinkage is a distinction of using LMMs
compared to simply fitting models (here a mean) separately on each
individual person. The LMMs tend to shrink estimates towards the
overall fixed effect estimate. In other words more extreme estimates
get pulled in to be closer to the mean, which has an effect of
stabilising relatively extreme values. How much an estimate is
shrunken depends on both how extreme it is (the more extreme, the more
shrinkage) and on how many observations are available for that
unit. If a unit has many data points, it will have less shrinkage.
For some more discussion and examples of this, see:
- https://m-clark.github.io/posts/2019-05-14-shrinkage-in-mixed-models/
- https://www.tjmahr.com/plotting-partial-pooling-in-mixed-effects-models/
Finally, we can run model diagnostics on LMMs similarly to how we did
for linear regressions. If we run the `modelDiagnostics()` function on
our model, we get an error about an unknown class of type haven_labelled.
```{r, error=TRUE}
md <- modelDiagnostics(m, ev.perc = .001)
```
This is telling us that `modelDiagnostics()` is not sure how to work
with data of that type in `R`. Because we know stress is basically
numeric data, we can convert stress into a numeric variable in `R`
using `as.numeric()`. This is basically just a way of telling `R` that
it really is just numeric data. Then we refit our model and again ask
for diagnostics. Note that if we had lots of variables in our model,
we might have to check the classes of multiple variables to find out
which one was the problem. Since we only had stress in the model, we
know its that.
```{r}
## check the R class of stress
class(dd$stress)
## convert stress in R to a numeric class
dd[, stress := as.numeric(stress)]
```
Now, we can refit our model and calculate diagnostics. The plot of the
residuals and the residuals versus predicted values plots are familiar
and are used to assess the same two assumptions for LMMs as for linear
regression models. We also get a density plot for `ID : (Intercept)`
the random effect for intercepts. This is to assess the assumption
that the random effects, the random intercept coefficients are also
about normally distributed. In this case, all the assumptions appear
reasonably well met, with only one relatively extreme stress residual
noted at 3.21.
```{r}
## refit model
m <- lmer(stress ~ 1 + (1 | ID), data = dd)
## calculate diagnostics
md <- modelDiagnostics(m, ev.perc = .001)
## plot diagnostics
plot(md, ask = FALSE, ncol = 2, nrow = 2)
```
In later lectures, we will dig further into estimating LMMs and adding
predictors with both fixed and random effects.
# Data Cleaning for LMMs
A common data cleaning task is to identify and address extreme values.
In multilevel / repeated measures data, extreme values may exist at
the between person level (i.e., a person who is relatively extreme
compared to other people) or at the within person level (i.e., an
assessment that is relatively extreme for that person).
In order to assess whether an individual is, on average, similar to or
relatively extreme from other indivdiuals or whether a specific data
point within an individual is similar to or different from that
individuals' own typical values, we often calculate the mean per
person and the deviations from the mean. This essentially decomposes a
repeated measures variable into part that is between person and part
that is within person. One way to make it clear which is which is to
add a prefix, like `B` for between and `W` for within.
To clean `stress`, we will start by calculating the average for each
person and the deviations from each person's own average, using the
`meanDeviations()` function. We will create two new variables in the
dataset: `Bstress` and `Wstress` and these will be created by ID.
If we look at the first few rows of the data, we can see that if we
were to add up the between and within stress variables, we would get
back the original stress variable.
```{r}
dd[, c("Bstress", "Wstress") := meanDeviations(stress), by = ID]
head(dd[, .(Bstress, Wstress, stress, ID)])
```
We are going to start by examining the data at the within person
level. The reason for this is that if there is a particular day that
is very extreme for a particular person, that extreme value on a day
not only will be extreme at the within person level, but because the
average stress for a person is calculated by averaging all of their
days of data, that extreme value will impact their mean at the between
level. So we want to first clean the within person level of data.
Examining the data at the within level helps to identify whether, for
a particular person, any observation is extreme/an outlier compared to
what is normal for that person. We do this by examining the within
stress variable, which is how different each days stress data are for
each person from their own mean. For this, we use the original daily
dataset because we need the repeated measures included.
```{r}
plot(testDistribution(dd$Wstress,
extremevalues = "theoretical", ev.perc = .005),
varlab = "Within stress (Wstress)")
```
```{marginfigure}
Note, we do not always use the top and bottom 0.5% of a theoretical
distribution. This is used here as it helps to identify some extreme
values and it is relatively conservative. However, for excluding
cases, its also common to use a more extreme threshold, like the top
and bottom 0.1%, which would be obtained as ev.perc = .001.
```
In the figure, based on the specified definition of extreme values,
the top and bottom 0.5% of a theoretical normal distribution, we can
see there is one extremely low stress value and two extremely high
stress values.
We can either figure out approximately where those scores are and
manually subset the dataset to find out which IDs and which days, or
use the `testDistribution()` function again. Doing it manually