power simulation lecture (#43)

* power simulation lecture
RePsychLing · Jun 27, 2024 · 1a731cd · 1a731cd
1 parent 6274493
commit 1a731cd
Show file tree

Hide file tree

Showing 3 changed files with 300 additions and 0 deletions.
diff --git a/Project.toml b/Project.toml
@@ -26,6 +26,7 @@ Markdown = "d6f4376e-aef5-505a-96c1-9c027394607a"
 MixedModels = "ff71e718-51f3-5ec2-a782-8ffcbfa3c316"
 MixedModelsExtras = "781a26e1-49f4-409a-8f4c-c3159d78c17e"
 MixedModelsMakie = "b12ae82c-6730-437f-aff9-d2c38332a376"
+MixedModelsSim = "d5ae56c5-23ca-4a1f-b505-9fc4796fc1fe"
 ProgressMeter = "92933f4c-e287-5a05-a399-4b506db050ca"
 RCall = "6f49c342-dc21-5d91-9882-a32aef131414"
 RegressionFormulae = "545c379f-4ec2-4339-9aea-38f2fb6a8ba2"
@@ -52,6 +53,7 @@ MacroTools = "0.5"
 MixedModels = "4.22"
 MixedModelsExtras = "1, 2"
 MixedModelsMakie = "0.3.26"
+MixedModelsSim = "0.2.7"
 RCall = "0.13"
 RegressionFormulae = "0.1.3"
 StandardizedPredictors = "1"

diff --git a/power_simulation.qmd b/power_simulation.qmd
@@ -0,0 +1,290 @@
+---
+title: "Using simulation to estimate uncertainty and power"
+subtitle: "Or how I learned how to stop worrying and love the bootstrap"
+jupyter: julia-1.9
+format:
+  html:
+    embed-resources: true
+    css: styles.css
+---
+
+```{julia}
+#| output: false
+using AlgebraOfGraphics
+using CairoMakie
+using DataFrames
+using Distributions
+using MixedModels
+using MixedModelsMakie
+using MixedModelsSim
+using ProgressMeter
+using StatsBase
+using Random
+
+using AlgebraOfGraphics: AlgebraOfGraphics as AOG
+using MixedModels: dataset
+ProgressMeter.ijulia_behavior(:clear)
+```
+
+# The Parametric Bootstrap
+
+Let us consider the `kb07` dataset.
+
+```{julia}
+kb07 = dataset(:kb07)
+contrasts = Dict(:spkr => EffectsCoding(),
+                 :prec => EffectsCoding(),
+                 :load => EffectsCoding())
+fm1 = fit(MixedModel,
+          @formula(rt_trunc ~ 1 * spkr * prec * load +
+                             (1 | subj) +
+                             (1 | item)),
+          kb07; contrasts)
+```
+
+We can perform a *parametric bootstrap* on the model to get estimates of our uncertainty.
+In the parametric bootstrap, we use the *parameters* we estimated to simulate new data.
+If we repeat this process many times, we are able to "pick ourselves up by our bootstraps"
+and examine the variability we would expect to see based purely on chance if the ground truth
+exactly matched our estimates.
+In this way, we are able to estimate our uncertainty -- we cannot be more certain than the 'natural'
+variability we would have for a given parameter value.
+
+```{julia}
+pb1 = parametricbootstrap(MersenneTwister(42), 1000, fm1;
+                          optsum_overrides=(;ftol_rel=1e-8))
+```
+
+
+:::{.callout-tip collapse=true, title="`optsum_overrides`"}
+The option `optsum_overrides` allows us to pass additional arguments for controlling the model fitting of each simulation replicate.
+`ftol_rel=1e-8` lowers the threshold for changes in the objective -- more directly, the optimizer considers the model converged if the change in the deviance is less than $10^{-8}$, which is a very small change, but larger than the default $10^{-12}$. Because the majority of the optimization time is spent in final fine-tuning, changing this threshold can greatly speed up the fitting time at the cost of a small loss of quality in fit. For a stochastic process like the bootstrap, that change in quality just adds to the general noise, but that's acceptable tradeoff in order to get many more replicates.
+:::
+
+Now, if we look at the docstring for `parametricbootstrap`, we see that there are keyword-arguments for the various model parameters:
+
+:::{.border id="docstring"}
+```{julia}
+#| code-fold: true
+@doc parametricbootstrap
+```
+:::
+
+These keyword arguments are forward on to the `simulate!` function, which simulates a new dataset based on model matrices and parameter values.
+The model matrices are simply taken from the model at hand.
+By default, the parameter values are the estimated parameter values from a fitted model.
+
+:::{.border id="docstring"}
+```{julia}
+#| code-fold: true
+@doc simulate!
+```
+:::
+
+So now we have a way to simulate new data with new parameter values once we have a model.
+We just need a way to create a model with our preferred design.
+We'll use the MixedModelsSim package for that.
+
+## Simulating data from scratch.
+
+The MixedModelsSim package provides a function `simdat_crossed` for simulating effects from a crossed design:
+
+:::{.border id="docstring"}
+```{julia}
+#| code-fold: true
+@doc simdat_crossed
+```
+:::
+
+Let's see what that looks like in practice.
+We'll look at a simple 2 x 2 design with 20 subjects and 20 items.
+Our first factor `age` will vary between subjects and have the levels `old` and `young`.
+Our second factor `frequency` will vary between items and have the levels `low` and `high`.
+Finally, we also need to specify a random number generator to use for seeding the data simulation.
+
+```{julia}
+subj_n = 20
+item_n = 20
+subj_btwn = Dict(:age => ["old", "young"])
+item_btwn = Dict(:frequency => ["low", "high"])
+const RNG = MersenneTwister(42)
+dat = simdat_crossed(RNG, subj_n, item_n;
+                     subj_btwn, item_btwn)
+Table(dat)
+```
+
+We have 400 rows -- 20 subjects x 20 items.
+Similarly, the experimental factors are expanded out to be fully crossed.
+Finally, we have a dependent variable `dv` initialized to be draws from the standard normal distribution $N(0,1)$.
+
+:::{.callout-note title="Latin squares, partial crossing, and continuous covariates"}
+`simdat_crossed` is designed to simulate a fully crossed factorial design.
+If you have a partially crossed or Latin squaresdesign, then you could delete the "extra" cells to reduce the fully crossed data here to be partially crossed.
+For continuous covariates, we need to separately construct the covariates and then use a tabular join to create the design.
+We'll examine an example of this later.
+:::
+
+```{julia}
+simmod = fit(MixedModel,
+             @formula(dv ~ 1 + age * frequency +
+                          (1 + frequency | subj) +
+                          (1 + age | item)), dat)
+println(simmod)
+```
+
+make sure to discuss contrasts here
+
+```{julia}
+β = [250.0, -25.0, 10, 0.0]
+simulate!(RNG, simmod; β)
+```
+
+
+```{julia}
+fit!(simmod)
+```
+
+```{julia}
+σ = 25.0
+fit!(simulate!(RNG, simmod; β, σ))
+```
+
+```{julia}
+# relative to σ!
+subj_re = create_re(2.0, 1.3)
+```
+
+```{julia}
+item_re = create_re(1.3, 2.0)
+```
+
+```{julia}
+θ = createθ(simmod; subj=subj_re, item=item_re)
+```
+
+```{julia}
+fit!(simulate!(RNG, simmod; β, σ, θ))
+```
+
+```{julia}
+samp = parametricbootstrap(RNG, 1000, simmod; β, σ, θ)
+```
+
+```{julia}
+ridgeplot(samp)
+```
+
+```{julia}
+let f = Figure()
+    ax = Axis(f[1, 1])
+    coefplot!(ax, samp;
+              conf_level=0.8,
+              vline_at_zero=true,
+              show_intercept=true)
+    ridgeplot!(ax, samp;
+               conf_level=0.8,
+               vline_at_zero=true,
+               show_intercept=true,
+               xlabel="Normalized density and 80% range")
+    scatter!(ax, β, length(β):-1:1;
+             marker=:x,
+             markersize=20,
+             color=:red)
+    f
+end
+```
+
+
+```{julia}
+coefpvalues = DataFrame()
+@showprogress for subj_n in [20, 40, 60, 80, 100, 120, 140],  item_n in [40, 60, 80, 100, 120, 140]
+    dat = simdat_crossed(RNG, subj_n, item_n;
+                         subj_btwn, item_btwn)
+    simmod = MixedModel(@formula(dv ~ 1 + age * frequency +
+                                     (1 + frequency | subj) +
+                                     (1 + age | item)),
+                        dat)
+
+    θ = createθ(simmod; subj=subj_re, item=item_re)
+    simboot = parametricbootstrap(RNG, 100, simmod;
+                                  β, σ, θ,
+                                  optsum_overrides=(;ftol_rel=1e-8),
+                                  progress=false)
+    df = DataFrame(simboot.coefpvalues)
+    df[!, :subj_n] .= subj_n
+    df[!, :item_n] .= item_n
+    append!(coefpvalues, df)
+end
+```
+
+```{julia}
+power = combine(groupby(coefpvalues, [:coefname, :subj_n, :item_n]),
+                :p => (p -> mean(p .< 0.05)) => :power)
+```
+
+```{julia}
+power = combine(groupby(coefpvalues,
+                        [:coefname, :subj_n, :item_n]),
+                :p => (p -> mean(p .< 0.05)) => :power,
+                :p => (p -> sem(p .< 0.05)) => :power_se)
+
+```
+
+```{julia}
+select!(power, :coefname, :subj_n, :item_n, :power,
+        [:power, :power_se] => ByRow((p, se) -> [p - 1.96*se, p + 1.96*se]) => [:lower, :upper])
+```
+
+
+```{julia}
+data(power) * mapping(:subj_n, :item_n, :power; layout=:coefname) * visual(Heatmap) |> draw
+```
+
+
+```{julia}
+dat = simdat_crossed(RNG, subj_n, item_n;
+                     subj_btwn, item_btwn)
+dat = DataFrame(dat)
+```
+
+```{julia}
+item_covariates = unique!(select(dat, :item))
+```
+
+
+```{julia}
+item_covariates[!, :chaos] = rand(RNG,
+                                  Normal(5, 2),
+                                  nrow(item_covariates))
+```
+
+```{julia}
+leftjoin!(dat, item_covariates; on=:item)
+```
+
+```{julia}
+simmod = fit(MixedModel,
+             @formula(dv ~ 1 + age * frequency * chaos +
+                          (1 + frequency | subj) +
+                          (1 + age | item)), dat; contrasts)
+```
+
+TODO: continuous covariate
+TODO: bernoulli response
+TODO: savereplicates
+
+```{julia}
+dat[!, :dv] = rand(RNG, Bernoulli(), nrow(dat))
+```
+
+```{julia}
+dat[!, :dv] = rand(RNG, Poisson(), nrow(dat))
+```
+
+```{julia}
+dat[!, :n] = rand(RNG, Poisson(), nrow(dat)) .+ 3
+```
+
+```{julia}
+dat[!, :dv] = rand.(RNG, Binomial.(dat[!, :n])) ./ dat[!, :n]
+```
diff --git a/styles.css b/styles.css
@@ -1 +1,9 @@
 /* css styles */
+
+#docstring{
+    border: solid black;
+    border-width: thick;
+    background-color: #f8f5f0;
+    margin-left: 5%;
+    margin-right: 5%;
+}