This repository has been archived by the owner on Sep 20, 2023. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathbootstrap.qmd
276 lines (206 loc) · 10.8 KB
/
bootstrap.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
---
title: "Parametric bootstrap for mixed-effects models"
jupyter: julia-1.8
---
The speed of MixedModels.jl relative to its predecessors makes the parametric bootstrap much more computationally tractable.
This is valuable because the parametric bootstrap can be used to produce more accurate confidence intervals than methods based on standard errors or profiling of the likelihood surface.
This page is adapted from the [MixedModels.jl docs](https://juliastats.org/MixedModels.jl/v4.7.1/bootstrap/)
## The parametric bootstrap
[Bootstrapping](https://en.wikipedia.org/wiki/Bootstrapping_(statistics)) is a family of procedures
for generating sample values of a statistic, allowing for visualization of the distribution of the
statistic or for inference from this sample of values.
Bootstrapping also belongs to a larger family of procedures called [*resampling*](https://en.wikipedia.org/wiki/Resampling_(statistics)), which are based on creating new samples of data from an existing one, then computing statistics on the new samples, in order to examine the distribution of the relevant statistics.
A _parametric bootstrap_ is used with a parametric model, `m`, that has been fit to data.
The procedure is to simulate `n` response vectors from `m` using the estimated parameter values
and refit `m` to these responses in turn, accumulating the statistics of interest at each iteration.
The parameters of a `LinearMixedModel` object are the fixed-effects
parameters, `β`, the standard deviation, `σ`, of the per-observation noise, and the covariance
parameter, `θ`, that defines the variance-covariance matrices of the random effects.
A technical description of the covariance parameter can be found in the [MixedModels.jl docs](https://juliastats.org/MixedModels.jl/v4.7.1/optimization/).
Lisa Schwetlick and Daniel Backhaus have provided a more beginner-friendly description of the covariance parameter in the [documentation for MixedModelsSim.jl](https://repsychling.github.io/MixedModelsSim.jl/v0.2.6/simulation_tutorial/).
For today's purposes -- looking at the uncertainty in the estimates from a fitted model -- we can simply use values from the fitted model, but we will revisit the parametric bootstrap as a convenient way to simulate new data, potentially with different parameter values, for power analysis.
For example, a simple linear mixed-effects model for the `Dyestuff` data in the [`lme4`](http://github.com/lme4/lme4)
package for [`R`](https://www.r-project.org) is fit by
```{julia}
using AlgebraOfGraphics
using CairoMakie
using DataFrames
using MixedModels
using MixedModelsMakie
using ProgressMeter
using Random
using AlgebraOfGraphics: AlgebraOfGraphics as AoG
CairoMakie.activate!(; type="svg") # use SVG (other options include PNG)
ProgressMeter.ijulia_behavior(:clear);
```
Note that the precise stream of random numbers generated for a given seed can change between Julia versions.
For exact reproducibility, you either need to have the exact same Julia version or use the [StableRNGs](https://github.com/JuliaRandom/StableRNGs.jl) package.
## A model of moderate complexity
The `kb07` data [@Kronmueller2007] are one of the datasets provided by the `MixedModels` package.
```{julia}
kb07 = MixedModels.dataset(:kb07)
```
Convert the table to a DataFrame for summary.
```{julia}
kb07 = DataFrame(kb07)
describe(kb07)
```
The experimental factors; `spkr`, `prec`, and `load`, are two-level factors.
```{julia}
contrasts = Dict(:spkr => EffectsCoding(),
:prec => EffectsCoding(),
:load => EffectsCoding(),
:subj => Grouping(),
:item => Grouping())
```
The `EffectsCoding` contrast is used with these to create a ±1 encoding.
Furthermore, `Grouping` constrasts are assigned to the `subj` and `item` factors.
This is not a contrast per-se but an indication that these factors will be used as grouping factors for random effects and, therefore, there is no need to create a contrast matrix.
For large numbers of levels in a grouping factor, an attempt to create a contrast matrix may cause memory overflow.
It is not important in these cases but a good practice in any case.
We can look at an initial fit of moderate complexity:
```{julia}
form = @formula(rt_trunc ~ 1 + spkr * prec * load +
(1 + spkr + prec + load | subj) +
(1 + spkr + prec + load | item))
m0 = fit(MixedModel, form, kb07; contrasts)
```
The default display in Quarto uses the [pretty MIME show method](https://juliastats.org/MixedModels.jl/v4.7.1/mime/) for the model and omits the estimated correlations of the random effects.
The `VarCorr` extractor displays these.
```{julia}
VarCorr(m0)
```
None of the two-factor or three-factor interaction terms in the fixed-effects are significant.
In the random-effects terms only the scalar random effects and the `prec` random effect for `item` appear to be warranted, leading to the reduced formula
```{julia}
# formula f4 from https://doi.org/10.33016/nextjournal.100002
form = @formula(rt_trunc ~ 1 + spkr * prec * load + (1 | subj) + (1 + prec | item))
m1 = fit(MixedModel, form, kb07; contrasts)
```
```{julia}
VarCorr(m1)
```
These two models are nested and can be compared with a likelihood-ratio test.
```{julia}
MixedModels.likelihoodratiotest(m0, m1)
```
The p-value of approximately 14% leads us to prefer the simpler model, `m1`, to the more complex, `m0`.
## Bootstrap basics
To bootstrap the model parameters, first initialize a random number generator then create a bootstrap sample
```{julia}
const RNG = MersenneTwister(42)
samp = parametricbootstrap(RNG, 1_000, m1)
df = DataFrame(samp.allpars)
first(df, 10)
```
Especially for those with a background in [`R`](https://www.R-project.org/) or [`pandas`](https://pandas.pydata.org),
the simplest way of accessing the parameter estimates in the parametric bootstrap object is to create a `DataFrame` from the `allpars` property as shown above.
We can use `subset` to subset out relevant rows of a dataframe.
A density plot of the estimates of `σ`, the residual standard deviation, can be created as
```{julia}
σres = subset(df, :type => ByRow(==("σ")), :group => ByRow(==("residual")); skipmissing=true)
plt = data(σres) * mapping(:value) * AoG.density()
draw(plt; axis=(;title="Parametric bootstrap estimates of σ"))
```
A density plot of the estimates of the standard deviation of the random effects is obtained as
```{julia}
σsubjitem = subset(df, :type => ByRow(==("σ")), :group => ByRow(!=("residual")); skipmissing=true)
plt = data(σsubjitem) * mapping(:value; layout=:names, color=:group) * AoG.density()
draw(plt; figure=(;supertitle="Parametric bootstrap estimates of variance components"))
```
The bootstrap sample can be used to generate intervals that cover a certain percentage of the bootstrapped values.
We refer to these as "coverage intervals", similar to a confidence interval.
The shortest such intervals, obtained with the `shortestcovint` extractor, correspond to a highest posterior density interval in Bayesian inference.
We generate these for all random and fixed effects:
```{julia}
shortestcovint(samp)
```
and convert it to a dataframe:
```{julia}
DataFrame(shortestcovint(samp))
```
```{julia}
draw(
data(samp.β) * mapping(:β; color=:coefname) * AoG.density();
figure=(; resolution=(800, 450)),
)
```
For the fixed effects, MixedModelsMakie provides a convenience interface to plot the combined coverage intervals and density plots
```{julia}
ridgeplot(samp)
```
Often the intercept will be on a different scale and potentially less interesting, so we can stop it from being included in the plot:
```{julia}
ridgeplot(samp; show_intercept=false, xlabel="Bootstrap density and 95%CI")
```
## Singularity
Let's consider the classic dysetuff dataset:
```{julia}
dyestuff = MixedModels.dataset(:dyestuff)
mdye = fit(MixedModel, @formula(yield ~ 1 + (1 | batch)), dyestuff)
```
```{julia}
sampdye = parametricbootstrap(MersenneTwister(1234321), 10_000, mdye)
dfdye = DataFrame(sampdye.allpars)
first(dfdye, 10)
```
```{julia}
σbatch = subset(dfdye, :type => ByRow(==("σ")), :group => ByRow(==("batch")); skipmissing=true)
plt = data(σbatch) * mapping(:value) * AoG.density()
draw(plt; axis=(;title="Parametric bootstrap estimates of σ_batch"))
```
Notice that this density plot has a spike, or mode, at zero.
Although this mode appears to be diffuse, this is an artifact of the way that density plots are created.
In fact, it is a pulse, as can be seen from a histogram.
```{julia}
plt = data(σbatch) * mapping(:value) * AoG.histogram(;bins=100)
draw(plt; axis=(;title="Parametric bootstrap estimates of σ_batch"))
```
A value of zero for the standard deviation of the random effects is an example of a *singular* covariance.
It is easy to detect the singularity in the case of a scalar random-effects term.
However, it is not as straightforward to detect singularity in vector-valued random-effects terms.
For example, if we bootstrap a model fit to the `sleepstudy` data
```{julia}
sleepstudy = MixedModels.dataset(:sleepstudy)
msleep = fit(MixedModel, @formula(reaction ~ 1 + days + (1 + days | subj)),
sleepstudy)
```
```{julia}
sampsleep = parametricbootstrap(MersenneTwister(666), 10_000, msleep);
dfsleep = DataFrame(sampsleep.allpars);
first(dfsleep, 10)
```
the singularity can be exhibited as a standard deviation of zero or as a correlation of ±1.
```{julia}
DataFrame(shortestcovint(sampsleep))
```
A histogram of the estimated correlations from the bootstrap sample has a spike at `+1`.
```{julia}
ρs = subset(dfsleep, :type => ByRow(==("ρ")), :group => ByRow(==("subj")); skipmissing=true)
plt = data(ρs) * mapping(:value) * AoG.histogram(;bins=100)
draw(plt; axis=(;title="Parametric bootstrap samples of correlation of random effects"))
```
or, as a count,
```{julia}
count(ρs.value .≈ 1)
```
Close examination of the histogram shows a few values of `-1`.
```{julia}
count(ρs.value .≈ -1)
```
Furthermore there are even a few cases where the estimate of the standard deviation of the random effect for the intercept is zero.
```{julia}
σs = subset(dfsleep, :type => ByRow(==("σ")), :group => ByRow(==("subj")), :names => ByRow(==("(Intercept)")); skipmissing=true)
count(σs.value .≈ 0)
```
There is a general condition to check for singularity of an estimated covariance matrix or matrices in a bootstrap sample.
The parameter optimized in the estimation is `θ`, the relative covariance parameter.
Some of the elements of this parameter vector must be non-negative and, when one of these components is approximately zero, one of the covariance matrices will be singular.
The `issingular` method for a `MixedModel` object that tests if a parameter vector `θ` corresponds to a boundary or singular fit.
This operation is encapsulated in a method for the `issingular` function that works on `MixedModelBootstrap` objects.
```{julia}
count(issingular(sampsleep))
```
# References
::: {#refs}
:::