LMM1.rmd

---
title: "Linear Mixed Models (LMMs) - Part 1"
author: "Joshua F. Wiley"
date: "`r Sys.Date()`"
output: 
  tufte::tufte_html: 
    toc: true
    number_sections: true
---

Download the raw `R` markdown code here
[https://jwiley.github.io/MonashHonoursStatistics/LMM1.rmd](https://jwiley.github.io/MonashHonoursStatistics/LMM1.rmd).
These are the `R` packages we will use.

```{r setup}
options(digits = 4)

## new packages are lme4, lmerTest, and multilevelTools

library(data.table)
library(JWileymisc)
library(lme4)
library(lmerTest)
library(multilevelTools)
library(visreg)
library(ggplot2)
library(ggpubr)
library(haven)

## load data collection exercise data
## merged is a a merged long dataset of baseline and daily
dm <- as.data.table(read_sav("Merged.sav"))

```

# Descriptive Statsitics on Multilevel Data

With multilevel data, basic descriptive statistics can be calculated
in different ways. We will use the merged dataset from the daily
collection exercise. This dataset merges the baseline and daily
datasets in a single, long format file.

To start with, suppose we had such a merged file (generally, a
convenient way to store and analyze multilevel data) and we wanted to
calculate descriptive statistics on some between person variables,
such as age and sex. We might start off as we have before using
`egltable()` for some summary descriptive statistics.

```{r betweendesc1}

egltable(c("age", "female"), data = dm)

```

Note that as `female` is coded 0/1 and not a factor, by default we are
given a mean and standard deviation. This is not so helpful. We could
convert to a factor or set `strict=FALSE` in which case `egltable()`
does not **strict**ly follow the class of each variable and instead if
a variable only has a few unique values, assumes its
categorical/discrete, regardless of its official class in `R`. 

```{r betweendesc2}

egltable(c("age", "female"), data = dm, strict=FALSE)

```

Now we can see the mean and standard deviation for age and the
frequencies and percentages of women and men in the study. However,
this reveals a problem/error. There were not 32 men and 151 women in
the study. What this actually is shown is the average age, weighted by
the number of days of data available for each participant, and the
total number of daily surveys completed by men and women. That happens
because the between person variables are repeated in long format.

Let's look at the data just for the first two participants in the
following table. We can see that on all days, age and female scores
are identical for IDs 1 and 2.

```{r, results = 'asis'}

knitr::kable(dm[ID %in% c(1, 2), .(ID, day, age, female)])

```

This is how between person variables are typically stored when in a
dataset combined with repeated measures data in long format. However,
it renders descriptive statistics on them probably not what we really
wanted. The means are essentially weighted means, participants who
complete more days of data get a higher weighting because their ages
are repeated more times. For categorical/discrete variables, what we
get is the number of assessments / observations for each level of the
categorical variable, in this case number of observations belonging to
men and women. In this case, when using data tables, the solution is
easy: drop duplicated rows and only keep one row of data per ID and
then remake the table. The following code shows how to do this.
The difference is in using `data = dm[!duplicated(ID)]`
instead of `data = dm`.

```{r betweendesc3}

egltable(c("age", "female"), data = dm[!duplicated(ID)], strict=FALSE)

```

Yielding this nice descriptive statistics table:


```{r, results = 'asis'}

knitr::kable(egltable(c("age", "female"), data = dm[!duplicated(ID)], strict=FALSE))

```

When working with multilevel data and a mix of between variables
(especially sociodemographics that are often asked once) and repeated
measures variables, watch out for which are which type and select only
one row per ID when calculating descriptives or graphs for between
person variables. The same applies when making exploratory plots or
figures.

```{r, fig.cap = "This is wrong, has many duplicates for each age."}

plot(testDistribution(dm$age), varlab = "this is wrong")

```

```{r, fig.cap = "This is right, only has one age per person."}

plot(testDistribution(dm[!duplicated(ID)]$age), varlab = "this is right")

```

For repeated measures variables or any variable that can vary within a
unit, we have several options for how to calculate descriptive
statistics. The choices depend on whether the variable is continuous
or categorical/discrete.

## Continuous Variables

With multilevel data in long format, if we calculate the mean and variance
of a variable, that would average across units and observations within
units for the mean. The variance will incorporate both differences between
units and variance within units (how much the data points vary around
each unit's own mean). That is, it is essentially the total variance.

Conversely, if we first average observations within a unit (e.g.,
person), then the mean will be the average of the individual averages,
and the variance will only be the variability between individual
units' means. That is, we could first create a between person variable
by averagine scores for each unit and then calculate descriptives as
usual for that variable. 

```{r}

dm[, c("Bstress", "Wstress") := meanDeviations(stress), by = ID]

egltable(c("Bstress"), data = dm[!duplicated(ID)])

```

We can interpret the descriptive statistics for `Bstress` as,
"The mean and standard deviation of the average level of stress across
days was 3.77 (SD = 1.08)."


For continuous variables, we also could calculate descriptives
on all data points directly rather than on the average by person. 
Using all data points *can* unequally weight different participants 
(e.g., imagine one participant contributed 100 data points and 100
other participants each contributed 1 data point, the average wil be
50% the 100 participants with 1 data point and 50% from the 1
participant with 100 data points). The issue of weighting tends to be
less important to the extent that clusters are all about the same
size, and makes no difference if all clusters are identical (e.g.,
everyone has exactly 10 observations). If there are no systematic
differences such that, for example, participants with the highest or
lowest levels of stress tend to have more/fewer data points, then the
mean is likely to be quite similar regardless of the approach.

The standard deviation from all data points is more like the total
variance, combining variance from between and within
participants. Thus, the standard deviation we expect to be at least
the same and most likely larger than the standard deviation of the
individual means, since those individual means "smooth" or remove
variation within person. 

```{r}

egltable(c("stress"), data = dm)

```

In this example we can see that while the mean changes by .05, the
standard deviation changes by quite a it more, reflecting the added
within person variance.

A final approach that can be taken for continuous variables is to only
pick a single observation from each unit to use in calculations. This
tends to work best with longitudinal data where, for example, you
could pick either the first day or the last day and calculate
descriptives just for that one day as an example. To do this, we use
the overall variable, but we subset the dataset to only include one
day.

```{r}

egltable(c("stress"), data = dm[day == 1])

```

This we can interpret as any other "usual" descriptive statistics. It
is the average level of stress and standard deviation across
participants, on the first day. If there are few time points (e.g., in
a longitudinal study with just baseline, post, and follow-up), it
might make sense to simply report the means and standard deviations of
continuous variables at each time point. With a long dataset, this can
easily be done by specifying the time point as the **g**rouping
variable. **Note:** `egltable()` does not calculate correct tests,
effect sizes, or p-values when the grouping variable is a repeated
measures and not independent groups, so if you do it this way, ignore
the tests and p-values, which assume independence.

```{r}

egltable(c("stress"), g = "day", data = dm)

```

If the data are already in wide format, one would simply calculate
descriptive statistics on each of the separate variables representing
each time point.

### Summary

With multilevel, continuous variables, three approaches we
discussed for calculating descriptive statistics are:

1. average by person and reprort descriptives on the individual
   averages;
2. report descriptives on the overall variable which captures the
   total variance but possibly unequally weights participants;
3. report descriptives on individual timepoints/assessments.

## Categorical Variables

Compared to continuous variables, there are fewer options for
presenting descriptive statistics with multilevel categorical data.
For example, suppose people reported each day whether they: walked,
rode a bike, or went to the gym for exercise. 
It does not make sense to average these data per person, although it
could be possible to average, for example, the proportion of days each
participant engaged in any one of those activities.
However, categorical / discrete data are typically presented as
frequencies and percentages. Averages of frequencies, especially with
skewed data are not all that easy to interpret. For example, suppose
that people either ride a bike or walk each day, using averages would
appear to show that on average people walk half the days and ride a
bike half the days.

The general choices, then for multilevel categorical data descriptives
would either be to report descriptives for a variable overall or to
report descriptives for a specific time point (e.g., day 1, 2, etc.)
if longitudinal data.

```{r}

## overall
egltable("int_fr", data = dm, strict = FALSE)

## one day
egltable("int_fr", data = dm[day==1], strict = FALSE)

``` 

These results show the total frequency and percentage of days that are
0 / 1 overall (first table) and the frequency and percentage of people
that reported 0 / 1 on day 1 (second table).

## Putting it All Together

Calculating and reporting descriptives statsitics from multilevel data
stored in long format can require putting everything we talked about
together. In this example, we want a descriptive statistics table for
the following variables:

- female (measured baseline only; categorical)
- age (measured baseline only; continuous)
- selfesteem (measured baseline only; continuous)
- stress (measured daily; continuous, wanted average levels)
- int_fr (measured daily; categorical)

First for any continuous variables that we want to report on average
levels only, `stress`, we create a between persion version.
Then we subset the dataset to one row per person and calculate
descriptives and store these in `desc1`. Then we calculate any
descriptives we want on the daily data, here for `int_fr`.
Then we name the columns the same using `setnames()` and finally
we combine the two tables by binding them rowise, using `rbind()` and
make a nice table with `kable()` from the `knitr` package.

```{r, results = 'asis'}

dm[, c("Bstress", "Wstress") := meanDeviations(stress), by = ID]

desc1 <- egltable(c("female", "age", "selfesteem", "Bstress"),
           data = dm[!duplicated(ID)], strict = FALSE)
desc2 <-   egltable(c("int_fr"), data = dm, strict = FALSE)
setnames(desc1, c("", "M (SD)/N (%)"))
setnames(desc2, c("", "M (SD)/N (%)"))

knitr::kable(rbind(desc1, desc2))

```

From these results we can see that 84% of the participants were women,
the average age was 22.8y, the average of the mean across days stress
was 3.77, and 61.7% of the days completed had `int_fr = 1`.

# Linear Mixed Models 

## ML and REML

Two common estimators used with linear mixed mdoels are 
**M**aximum **L**ikelihood (ML) and 
**Re**stricted **M**aximum **L**ikelihood (REML). You can think of
this somewhat the same as the formula for calculating a population's
variance versus estimating the population variance from a 
sample^[Also see: https://stats.stackexchange.com/questions/16008/what-does-unbiasedness-mean/16009#16009]:

$$
\sigma^2_{pop} = \frac{\sum_{i=1}^{n} (X_i - \bar{X})^{2}}{n}
$$

$$
\sigma^2_{sample} = \frac{\sum_{i=1}^{n} (X_i - \bar{X})^{2}}{n - 1}
$$

Generally speaking, REML, is less biased so we prefer it as an
estimator and it is the usual default in LMMs. However, for some
comparisons between models to be valid and some statistical tests to
be valid, we need true maximum likelihood (ML) estimates so you will
sometimes see the default REML option "turned off" to get pure ML
estimates or a model be refit using ML instaed of REML. This is done
in `R` using `REML = FALSE` to get ML.

## Random Intercept Model

There are two main uses of intercept only models:

- To calculate the intraclass correlation coefficient (ICC)
- As a comparison to see how much better a more complex model
  fits. 

To calculate the ICC, we use this equation:

$$ICC = \frac{\sigma^{2}_{intercept}}{\sigma^{2}_{intercept} +
\sigma^{2}_{residual}}$$

Following is an example of an intercept only model, where there is
both a fixed effects intercept and a random intercept.
The outcome variable is `stress`.  All predictors come after the
tilde, `~`. In this case, the only "predictors" are the fixed and
random intercept, represented by `1`. The random intercept is random
by `ID`. The function to fit linear mixed models is `lmer()` and
comes from the `lme4` package. It also requires a dataset be
specified, here `dm`. We can get a summary using `summary()`.

```{r}

## get rid of the haven_labelled class type for stress
dm[, stress := as.numeric(stress)]

ri.m <- lmer(stress ~ 1 + (1 | ID),
            data = dm,
            REML = TRUE)

summary(ri.m)

iccMixed("stress", id = "ID", data = dm)

``` 

There are four main "blocks" of output from the summary.

1. A repetition of the model options, formula we used, and dataset
   used. This is for records so you know exactly what the model was.
   In *this* model, it shows use that we fit a LMM using restricted
   maximum likelihood (REML) and that the degrees of freedom were
   approximated using Satterthwaite's method. The outcome variable is
   stress (`stress`) and there are only intercept predictors,
   `1`. The REML criterion at convergence is kind of like the log
   likelihood (LL), but unfortunately cannot be readily used to
   compare across models as easily as the actual LL (e.g., in AIC or
   BIC, which we'll talk about more later).
2. Scaled Pearson residuals. These are raw residuals divided by the
   estimated standard deviation, so that they can be roughly
   interpretted as z-scores. The minimum and maximum are useful for 
   identifying whether there are outliers present in the model
   residuals.
   In *this* model, we can see that the lowest residual is 
   `r min(residuals(ri.m, type = "pearson", scaled=TRUE))` 
   and the maximum residual is 
   `r max(residuals(ri.m, type = "pearson", scaled=TRUE))`
   which are not too large. Absolute residuals of (e.g., 10) would be
   large enough that they are extremely unlikely by chance alone
   and likely represent outliers.
3. Random effects. These show a summary of the random effects in the
   model. Random effects are basically always also fixed effects, so
   the random effects only shows the standard deviation and variance
   of random effects, plus, if applicable, their correlations. The
   means are showon in the fixed effects section. In the case of a
   random intercept only model like this one, there are only two
   random effects: (1) the random intercept and (2) the random
   residual. We have both the standard deviation and variance of
   both. We will use the variances to calculate ICCs.
   In *this* model, the standard deviation of the random intercept, 
   tells us that the average or typical difference between an
   individual's average stress, and the population average
   stress is
   `r as.data.frame(VarCorr(ri.m))[1, "sdcor"]`.
   The standard deviation of the residuals
   tells us that the average or typical difference between an
   individual stress score and the predicted stress
   score is
   `r as.data.frame(VarCorr(ri.m))[2, "sdcor"]`.
   The random effects section also tells us how many observations and
   unique people/groups went into the analysis. 
   In *this* model we can see that we had `r as.integer(ngrps(ri.m))` 
   people providing `r nobs(ri.m)` unique observations.
4.  Fixed effects. This section shows the fixed effects. It is a
    table, where each row is for a different effect / predictor and
    each column gives a different piece of information.
	The "Estimate" is the actual parameter estimate (i.e., THE fixed
    effect, the regression coefficient, etc.). The "Std. Error" is the
    standard error of the estimate, which captures uncertainty in the
    coefficient due to sampling variation. The "df" is the
    Satterthwaite estimated degrees of freedom. As an estimate, it may
    have decimals. The "t value" is the ratio of the coefficient to
    its standard error, that is: $t = \frac{Estimate}{StdError}$. 
	The "Pr(>|t|)" is the p-value, the probability that by chance
    alone one would obtain as or a larger absolute t-value. The
    vertical bars indicate absolute values and the "Pr" stands for
    probability value. Note that `R` uses 
	[scientific E notation](https://en.wikipedia.org/wiki/Scientific_notation).
	The number following the "e" indicates how many places to the
    right (if positive) or left (if negative) the decimal point should
    be moved. For example, 0.001 could be written 1e-3. 0.00052 could
    be written 5.2e-4. These often are used for p-values which may be
    numbers very close to zero.
	In *this* model, we can see that the fixed effect for the
    intercept is `r fixef(ri.m)[["(Intercept)"]]` which is the like
    the mean of the random intercept and tells us the average
    level of stress, in this instance since there are no
    other predictors in the model.

Profile likelihood confidence intervals can be obtained using the 
`confint()` function. These confidence intervals capture the
uncertainty in parameter estimates for both the fixed and random
effects due to sampling variation. They do not capture indivdiual
differences directly. Note that you only get confidence intervals for
random effects when using the profile method, not when
`method = "Wald"` although the Wald method is much faster.
Profile likelihood confidence intervals are the default and are
generally more accurate than Wald based confidence intervals, but also
can be slower to calculate. Here both ways are shown.

```{r}

## Profile Confidence Intervals
ri.ci <- confint(ri.m, method = "profile", oldNames = FALSE)
ri.ci

## Wald Confidence Intervals
confint(ri.m, method = "Wald", oldNames = FALSE)

```

### Diagnostics

Typical diagnostics and checks include checking for outliers,
assessing whether the distributional assumptions are met, checking for
homogeneity of variance and checking whether there is a linear
association between predictors and outcome. With only an intercept,
there is no need for checking whether a linear association is
appropriate.

As for linear regressions, we can use `modelDiagnostics()` to
calculate diagnostics and `plot()` to get a plot of the diagnostics.

The top left plot of the residuals helps check for outliers on the
residuals.
These plots show one relatively extreme value on the residuals. In
this case, using the scaled pearson residuals, which are roughly like
z scores, the size of the residual outliers are not too big as to
likely be an issue (3.21 is not that large in z scores).
The bottom left plot shows the distribution of the random
intercept. No outliers are observed.

The density plots (and QQ deviates plot for residuals) indicate very
mild non-normality, but it is not too extreme and close enough for
inference.

Finally, on the top right, we check the homogeneity of variance. 
There is no clear trend in the residuals indicating that the
homogeneity of variance assumption is reasonably met.
The residuals show a characteristic banding when dealing with
"continuous" variable that have a handful of possible values.
This is not a problem per se, although if too extreme may indicate
that treating the data as continuous is not a great.

```{r}

plot(modelDiagnostics(ri.m), ncol = 2, nrow = 2, ask = FALSE)

``` 

With diagnostics reasonably met, we proceed with a write up.

### Sample Write Up

An intercept only linear mixed model was fit to 
`r nobs(ri.m)` stress scores from 
`r as.integer(ngrps(ri.m))` people. The intraclass correlation
coefficient was 
`r as.data.frame(VarCorr(ri.m))[1, "vcov"] / sum(as.data.frame(VarCorr(ri.m))[, "vcov"])` 
indicating that about 40% of the total variance in stress
was between people and the other 60% is within person due to
fluctuations across days. The fixed effect intercept revealed that the
average [95% CI] stress was 
`r fixef(ri.m)[["(Intercept)"]]`
`r sprintf("[%0.2f, %0.2f]", ri.ci[3, 1], ri.ci[3, 2])`.
However, there were individual differences, with the standard
deviation for the random intercept being
`r as.data.frame(VarCorr(ri.m))[1, "sdcor"]`
indicating that there are individual differences in the mean
stress. Assuming the random intercepts follow a normal distribution,
we expect most people to fall within one standard deviation of the
mean, which in these data would be somewhere between:
`r fixef(ri.m)[["(Intercept)"]] + c(-1, 1) *  as.data.frame(VarCorr(ri.m))[1, "sdcor"]`. 


## Random Intercept and Fixed Effects Models

Following is an example of a LMM with fixed effects and a random
intercept (no random slopes). Although we did not explicitly add a
fixed effects intercept by adding `1` to the equation, it is there by
default. We still have a random intercept.

```{r}

fp.m <- lmer(stress ~ energy + (1 | ID),
            data = dm,
            REML = TRUE)

summary(fp.m)

``` 

There are four main "blocks" of output from the summary.

1. A repetition of the model options, formula we used, and dataset
   used. This is for records so you know exactly what the model was.
   In *this* model, it shows use that we fit a LMM using restricted
   maximum likelihood (REML) and that the degrees of freedom were
   approximated using Satterthwaite's method. The outcome variable is
   stress (`stress`) and energy is a predictor.
   The REML criterion at convergence is kind of like the log
   likelihood (LL), but unfortunately cannot be readily used to
   compare across models as easily as the actual LL (e.g., in AIC or
   BIC).
2. Scaled Pearson residuals. These are raw residuals divided by the
   estimated standard deviation, so that they can be roughly
   interpretted as z-scores. The minimum and maximum are useful for 
   identifying whether there are outliers present in the model
   residuals.
   In *this* model, we can see that the lowest residual is 
   `r min(residuals(fp.m, type = "pearson", scaled=TRUE))` 
   and the maximum residual is 
   `r max(residuals(fp.m, type = "pearson", scaled=TRUE))`
   which while a bit large, 
   are not so extreme if interpretted as z-scores as to be
   concerning. Absolute residuals of 10, for example, would be large enough
   that they are extremely unlikely by chance alone and likely
   represent outliers. We can see there are some more extreme positive
   than negative residuals. That means that predictions are sometimes
   too (extremely) low rather than too (extremely) high.
3. Random effects. These show a summary of the random effects in the
   model. Random effects are basically always also fixed effects, so
   the random effects only shows the standard deviation and variance
   of random effects, plus, if applicable, their correlations. The
   means are showon in the fixed effects section. In the case of a
   model where the only random effect is the intercept, the
   random effects show: (1) the random intercept and (2) the random
   residual. We have both the standard deviation and variance of
   both. 
   In *this* model, the standard deviation of the random intercept, 
   tells us that the average or typical difference between an
   individual's estimated stress when energy is 0, 
   and the population average estimated stress when energy is
   0 is
   `r as.data.frame(VarCorr(fp.m))[1, "sdcor"]`.
   The standard deviation of the residuals
   tells us that the average or typical difference between an
   individual stress score and the predicted stress
   score is
   `r as.data.frame(VarCorr(fp.m))[2, "sdcor"]`.
   The random effects section also tells us how many observations and
   unique people/groups went into the analysis. 
   In *this* model we can see that we had `r as.integer(ngrps(fp.m))` 
   people providing `r nobs(fp.m)` unique observations.
4.  Fixed effects. This section shows the fixed effects. It is a
    table, where each row is for a different effect / predictor and
    each column gives a different piece of information.
	The "Estimate" is the actual parameter estimate (i.e., THE fixed
    effect, the regression coefficient, etc.). The "Std. Error" is the
    standard error of the estimate, which captures uncertainty in the
    coefficient due to sampling variation. The "df" is the
    Satterthwaite estimated degrees of freedom. As an estimate, it may
    have decimals. The "t value" is the ratio of the coefficient to
    its standard error, that is: $t = \frac{Estimate}{StdError}$. 
	The "Pr(>|t|)" is the p-value, the probability that by chance
    alone one would obtain as or a larger absolute t-value. The
    vertical bars indicate absolute values and the "Pr" stands for
    probability value. Note that `R` uses 
	[scientific E notation](https://en.wikipedia.org/wiki/Scientific_notation).
	The number following the "e" indicates how many places to the
    right (if positive) or left (if negative) the decimal point should
    be moved. For example, 0.001 could be written 1e-3. 0.00052 could
    be written 5.2e-4. These often are used for p-values which may be
    numbers very close to zero.
	In *this* model, we can see that the fixed effect for the
    intercept is `r fixef(fp.m)[["(Intercept)"]]` which is like
    the mean of the random intercept and tells us the average
    estimated stress score when energy = 0.
	The fixed effect (regression coefficient) for energy is 
	`r fixef(fp.m)[["energy"]]` which tells us how much on average
    (fixed effect) lower stress is expected to be when energy
    is one unit higher. 

Profile likelihood confidence intervals can be obtained using the 
`confint()` function. These confidence intervals capture the
uncertainty in parameter estimates for both the fixed and random
effects due to sampling variation. They do not capture indivdiual
differences directly. Note that you only get confidence intervals for
random effects when using the profile method, not when
`method = "Wald"` although the Wald method is much faster.

```{r}

fp.ci <- confint(fp.m, method = "profile", oldNames = FALSE)
fp.ci

```

### Diagnostics and Checks

Typical diagnostics and checks include checking for outliers,
assessing whether the distributional assumptions are met, checking for
homogeneity of variance and checking whether there is a linear
association between predictors and outcome. 

Results look about the same as for the intercept only model, although
two relatively extreme residuals are now identified. Still,
interpretted as z scores, they are not bad enough to worry much
about.
 
```{r}

plot(modelDiagnostics(fp.m),
     ncol = 2, nrow = 2, ask = FALSE)

```

### Sample Write Up

To examine the association of stress and energy, a linear
mixed model was fit. The final model included `r nobs(fp.m)` stress
scores from `r as.integer(ngrps(fp.m))` people.  
The fixed effect intercept revealed that the
average [95% CI] stress when energy is 0 was 
`r fixef(fp.m)[["(Intercept)"]]`
`r sprintf("[%0.2f, %0.2f]", fp.ci[3, 1], fp.ci[3, 2])`.
However, there were individual differences, with the standard
deviation for the random intercept being
`r as.data.frame(VarCorr(fp.m))[1, "sdcor"]`
indicating that there are individual differences in the mean
stress. Assuming the random intercepts follow a normal distribution, 
we expect most people to fall within one standard deviation of the
mean, which in these data would be somewhere between:
`r fixef(fp.m)[["(Intercept)"]] + c(-1, 1) *  as.data.frame(VarCorr(fp.m))[1, "sdcor"]`. 
Using Satterthwaite's approximation for degrees of freedom revealed
that energy was statistically significantly associated with stress 
(p = .001). On average across people, a one unit higher energy score
was associated with `r fixef(fp.m)[["energy"]]` lower stress scores.


## Between and Within Effects

When we have a time-varying predictor variable, we can separate it
into a between and within portion and both of these new variables can
be included in a LMM as predictors. After creating a between and
within version of energy, `Benergy` and `Wenergy` we just enter both
as fixed effects predictors.

```{r}

dm[, c("Benergy", "Wenergy") := meanDeviations(energy), by = ID]

fp.m2 <- lmer(stress ~ Benergy + Wenergy + (1 | ID),
              data = dm)

summary(fp.m2)

``` 

There are four main "blocks" of output from the summary.

1. A repetition of the model options, formula we used, and dataset
   used. This is for records so you know exactly what the model was.
   In *this* model, it shows use that we fit a LMM using restricted
   maximum likelihood (REML) and that the degrees of freedom were
   approximated using Satterthwaite's method. The outcome variable is
   stress (`stress`) and `Bstress` and `Wenergy` are fixed effects
   predictors. 
   The REML criterion at convergence is kind of like the log
   likelihood (LL), but unfortunately cannot be readily used to
   compare across models as easily as the actual LL (e.g., in AIC or
   BIC).
2. Scaled Pearson residuals. These are raw residuals divided by the
   estimated standard deviation, so that they can be roughly
   interpretted as z-scores. The minimum and maximum are useful for 
   identifying whether there are outliers present in the model
   residuals.
   In *this* model, we can see that the lowest residual is 
   `r min(residuals(fp.m2, type = "pearson", scaled=TRUE))` 
   and the maximum residual is 
   `r max(residuals(fp.m2, type = "pearson", scaled=TRUE))`
   which while a bit large, 
   are not so extreme if interpretted as z-scores as to be
   concerning. Absolute residuals of 10, for example, would be large enough
   that they are extremely unlikely by chance alone and likely
   represent outliers. We can see there are some more extreme positive
   than negative residuals. That means that predictions are sometimes
   too (extremely) low rather than too (extremely) high.
3. Random effects. These show a summary of the random effects in the
   model. Random effects are basically always also fixed effects, so
   the random effects only shows the standard deviation and variance
   of random effects, plus, if applicable, their correlations. The
   means are showon in the fixed effects section. In the case of a
   model where the only random effect is the intercept, the
   random effects show: (1) the random intercept and (2) the random
   residual. We have both the standard deviation and variance of
   both. 
   In *this* model, the standard deviation of the random intercept, 
   tells us that the average or typical difference between an
   individual's estimated stress when `Benergy` and `Wenergy` are 0, 
   and the population average estimated stress when the predictors are
   0 is
   `r as.data.frame(VarCorr(fp.m2))[1, "sdcor"]`.
   The standard deviation of the residuals
   tells us that the average or typical difference between an
   individual stress score and the predicted stress
   score is
   `r as.data.frame(VarCorr(fp.m2))[2, "sdcor"]`.
   The random effects section also tells us how many observations and
   unique people/groups went into the analysis. 
   In *this* model we can see that we had `r as.integer(ngrps(fp.m2))` 
   people providing `r nobs(fp.m2)` unique observations.
4.  Fixed effects. This section shows the fixed effects. It is a
    table, where each row is for a different effect / predictor and
    each column gives a different piece of information.
	The "Estimate" is the actual parameter estimate (i.e., THE fixed
    effect, the regression coefficient, etc.). The "Std. Error" is the
    standard error of the estimate, which captures uncertainty in the
    coefficient due to sampling variation. The "df" is the
    Satterthwaite estimated degrees of freedom. As an estimate, it may
    have decimals. The "t value" is the ratio of the coefficient to
    its standard error, that is: $t = \frac{Estimate}{StdError}$. 
	The "Pr(>|t|)" is the p-value, the probability that by chance
    alone one would obtain as or a larger absolute t-value. The
    vertical bars indicate absolute values and the "Pr" stands for
    probability value. Note that `R` uses 
	[scientific E notation](https://en.wikipedia.org/wiki/Scientific_notation).
	The number following the "e" indicates how many places to the
    right (if positive) or left (if negative) the decimal point should
    be moved. For example, 0.001 could be written 1e-3. 0.00052 could
    be written 5.2e-4. These often are used for p-values which may be
    numbers very close to zero.
	In *this* model, we can see that the fixed effect for the
    intercept is `r fixef(fp.m2)[["(Intercept)"]]` which is like
    the mean of the random intercept and tells us the average
    estimated stress score when both the between and within energy
    = 0. The fixed effect (regression coefficient) for `Benergy` is 
	`r fixef(fp.m2)[["Benergy"]]` which tells us how much on average
    (fixed effect) lower stress is expected to be when average energy
    is one unit higher. That is, for people who, in general (on
    average) have higher energy, how much lower stress do they have on
    average across days?
	The fixed effect (regression coefficient) for `Wenergy` is 
	`r fixef(fp.m2)[["Wenergy"]]` which tells us how much on average
    (fixed effect) lower stress is expected to be when energy
    is one unit higher than an individuals' own average. 
	That is, on days when someone has one more unit higher energy than
    usual (than their own mean), how much lower stress do they have on
    that same day?
	The approximate p-values indicate that both the average and daily
    energy scores are statistically significant, indicating that both
    people who have more energy in general have significantly lower
    stress in general, and that beyond these averages, within people,
    higher energy days are significantly associated with lower stress
    days. 

Profile likelihood confidence intervals can be obtained using the 
`confint()` function. These confidence intervals capture the
uncertainty in parameter estimates for both the fixed and random
effects due to sampling variation. They do not capture indivdiual
differences directly. Note that you only get confidence intervals for
random effects when using the profile method, not when
`method = "Wald"` although the Wald method is much faster.

```{r}

fp.ci2 <- confint(fp.m2, method = "profile", oldNames = FALSE)
fp.ci2

```
### Diagnostics and Checks

Typical diagnostics and checks include checking for outliers,
assessing whether the distributional assumptions are met, checking for
homogeneity of variance and checking whether there is a linear
association between predictors and outcome. 

Results look about the same as for the intercept only model, although
two relatively extreme residuals are now identified and the random
intercept appears to be closer to a normal distribution now. Still,
interpretted as z scores, the residuals are not bad enough to worry
much about. Homogeneity of variance still appears approximately met.
 
```{r}

plot(modelDiagnostics(fp.m2),
     ncol = 2, nrow = 2, ask = FALSE)

```

### Sample Write Up

To examine the association of stress and energy, a linear
mixed model was fit. To understand the association of energy and
stress at both the between person and within person level, daily
energy ratings were separated into average energy across the five days
of the study (between energy) and daily deviations of energy from
individuals' own mean energy level (within energy).
The final model included `r nobs(fp.m2)` stress
scores from `r as.integer(ngrps(fp.m2))` people.  
The fixed effect intercept revealed that the
average [95% CI] stress when between and within energy are 0 was 
`r fixef(fp.m2)[["(Intercept)"]]`
`r sprintf("[%0.2f, %0.2f]", fp.ci2[3, 1], fp.ci2[3, 2])`.
However, there were individual differences, with the standard
deviation for the random intercept being
`r as.data.frame(VarCorr(fp.m2))[1, "sdcor"]`
indicating that there are individual differences in the mean
stress. Assuming the random intercepts follow a normal distribution, 
we expect most people to fall within one standard deviation of the
mean, which in these data would be somewhere between:
`r fixef(fp.m2)[["(Intercept)"]] + c(-1, 1) *  as.data.frame(VarCorr(fp.m2))[1, "sdcor"]`. 
Using Satterthwaite's approximation for degrees of freedom revealed
that both between and within person energy were statistically
significantly associated with stress (both p < .05). 
On average across people, at the between person level, 
a one unit higher average energy score
was associated with `r fixef(fp.m2)[["Benergy"]]` lower stress
scores.
On average across people, at the within person level, 
a one unit higher than average daily energy
score was associated with `r fixef(fp.m2)[["Wenergy"]]` lower stress
scores that day. The magnitude suggests that a one unit higher average
energy level has about twice the magnitude of association with average
stress as does a one unit higher daily energy level.

## Statistical Inference

There is ambiguity in terms of how best to calculate degrees of
freedom (df) for LMMs. By default `R` does not calculate the df and so
does not provide p-values for the regression coefficients (fixed
effects) from LMMs.

One easy, albeit imperfect, solution is to use the `lmerTest`
package. `lmerTest` use Satterthwaite's method to calculate
approximate degrees of freedom and use these for the t-tests and
p-values for each regression coefficient. To use `lmerTest` simply
make sure that **both** `lme4` and `lmerTest` packages are installed
and that you load the `lmerTest` package after `lme4`, by using:
`library(lmerTest)`. This is shown in the examples above.
Once that is done, all regular calls to `lmer()` function used to fit
LMMs will automatically have df estimated and p-values.

Although this is a relatively common approach, it is not without
limitations or debate. Other approaches include:

- assuming the sample size is large enough that the degrees of freedom
  are large enough that we can assume the parameters follow a normal
  distribution instead of a t-distribution.
- Relying on things like profile-based confidence intervals
- Bootstrapping, which is time intensive but probably the most robust
  of these options.

Here is a short example bootstrapping. Here we bootstrap the 95%
confidence intervals, and if the interval does not include 0, we can
conclude that $p < .05$, it is statistically significant.
Although bootstrapping is robust, we often do not use it as it can be
qutie time intensive for more complex models, especially those with
more random effects.

```{r}

confint(fp.m, method = "boot")

``` 

# You Try It

Fit an intercept only linear mixed model by completing the code below.
Use the variable `energy` as your outcome variable.
Calculate a summary of the model, model diagnostics, and find the
intraclass correlation coefficient for `energy`. 
Then try fitting a linear model to energy (i.e., not a LMM, ignore the
repeated measures). Look at the confidence intervals for the intercept
from the LMM and the LM. Which one is wider / smaller?

```{r, error = TRUE}

## store the model results in an object called "m2lmm"
m2lmm <- lmer()


## now make a summary of the model results


## what is the intraclass correlation coefficient for this variable?


## look at model diagnostics here


## try fitting an intercept only linear regression (not LMM)
## to energy
m2lr <- lm( )


## calculate the confidence intervals from the LMM and the lm

```

# Summary

## Conceptual

Key points to take away conceptually are:

- How to calculate and interpret descriptive statistics for continuous
  and   categorical variables from multilevel data, including
  different  options presented in this topic
- How to interpret the random intercept in LMMs
- How to interpret fixed effects in LMMs
- How to create and interpret between and within effect predictors in
  LMMs
- Approaches to statistical inference in LMMs
- Basic assumptions and diagnostics for LMMs
- What types of confidence intervals can be calculated for fixed
  and/or random effects from LMMs
- What ML and REML mean and a simple understanding of when to use each  


## Code

| Function       | What it does                                 |
|----------------|----------------------------------------------|
| `lmer()`     | Fit a linear mixed model  |
| `confint()` | Calculate confidence intervals from a LMM, defaults to profile likelihood confidence intervals  | 
| `summary()` | Get a summary table of the residuals, random effects, and fixed effects from a LMM  | 
| `modelDiagnostics()` | Calculate model diagnostics (outliers, normality, homogeneity) on a LMM  | 
| `iccMixed()` | Calculate the ICC for a variable from a random intercept only LMM | 

------------------------------------------------------------------------------------------

# Extra (Optional) - Likelihoods

**Note: if you just want to know what you need to know, you do not
need to read this section..** If you want some more background that is
beyond what you're expected to know but may help provide some basis
for a deeper understanding of ML and some later concepts, read on. 

To understand ML, we need to understand what a likelihood is. A
likelihood is not a probability. One way to think about this is that
probabilities are related to results, whereas likelihoods are related
to hypotheses. A very generic likelihood function looks like this:

$$
\mathcal{L}(x | \theta)
$$

where

- $x$ is some data (could be a vector or a whole dataset)
- $\theta$ is one or more parameter(s)

Often, we work with log likelihoods, $\mathcal{LL}$, rather than
likelihoods directly because likelihoods tend to be multiplicative,
but on the log scale, we can simply sum them up. The log likelihood
will be equal to the (typically natural) log of the likelihood so we
can mostly interpret them the same ways anyway as we use likelihoods
as relative values, not absolute values.

Here is a simple example to help conceptually follow some of the
basics. Here we define a log likelihood for the mean, $\mu$ as:

$$
\mathcal{LL}(x | \mu) = -\sum_{i=1}^{n}(x_i - \mu)^{2}
$$

We can create this function in `R` like this:

```{r}

LL <- function(x, mu) {
  -sum((x - mu)^2)
}

```

Now we can use our function in `R` to easily calculate our log
likelihood for the data for different values of $\mu$. For example,
let's work with the `mpg` variable from `mtcars`. Here is a graph.

```{r}

plot(testDistribution(mtcars$mpg))

``` 

We can ask, what is the log likelihood of the data given, 
$\mu = 10$?

```{r}

LL(mtcars$mpg, mu = 10)

``` 

What about if $\mu = 15$?


```{r}

LL(mtcars$mpg, mu = 15)

``` 

Remember, LLs are basically always used in a relative fashion. So what
can we say? Well, the LL is higher for $\mu = 15$ than it is for $\mu
= 10$.  That does not tell us that the mean, $\mu$ **IS** 15. That is,
we do not answer is our hypothesis "true". However, we can say that
the data are more likely to come from a distribution where $\mu = 15$
than they are to come from a distribution where $\mu = 10$. 

We would say that $\mu = 15$ has higher likelihood than does $\mu =
10$, for these data. Generally, we don't just want to find *higher*
(log) likelihoods. It would be nice to find the **maximum** likelihood
estimates. What value of $\mu$ would give us the highest possible, the
maximum (log) likelihood value? Simplistically, we could think of
trying lots and lots of different values of $\mu$ and comparing all of
them to find the highest. We could plot this, which would look as
follows. 

```{r}

tests <- data.table(
  mu = seq(from = 10, to = 30, by = .1))

tests[, LL := LL(mtcars$mpg, mu = mu), by = mu]

ggplot(tests, aes(mu, LL)) +
  geom_point() +
  theme_pubr()

``` 

Obviously this is not all possible values of $\mu$, but for $\mu$
values between 10 and 30 in increments of 0.1, we can see that the LL
peaks at about 20. If we order our dataset so that the highest LL
estimates are at the top, we will see that the maximum LL estimate of 
$\mu$ **of those we tried** was 20.1. This matches the arithmetic mean
of `mpg`, within decimals.

```{r}

tests[order(-LL)]

mean(mtcars$mpg)

``` 

What we have just done is a very simple form of an algorithm to try to
find the maximum likelihood estimate of the mean. In practice, maximum
likelihood estimation typically:

1. Has a more complex likelihood function
2. Rather than simply try lots of values, special algorithms iterate
   to try to purposefully move from a starting point to higher
   likelihood values. This is what hapens, for example with GLMs and
   in LMMs. Possible values of the parameters are tried, their
   likelihood evaluated, new parameter values tried, and the algorithm
   stops iterating in cycles once it cannot find any parameter values
   that will further increase the likelihood. The maximum likelihood
   estimates have been found.

In some cases, rather than needing to try different parameter values,
we can do maths to figure out an equation to directly calculate the
maximum likelihood values. For example, it can be proven that the
equation for the arithmetic mean **is** the maximum likelihood
estimate of the mean of some data. No need to iterate.

Beyond understanding how (log) likelihoods contribute to finding
maximum likelihood estimates, the other useful take away point from
this is that just as we compared the likelihood of two different
possible $\mu$ values to find out that $\mu = 15$ generated a higher
likelihood than did $\mu = 10$, down the track you'll see that we can
compare different models using their LLs or variations of the LL to
try to decide between two competing models.

For complex models, including LMMs, the LLs also can be used to
calculate boundaries on parameters. For example, the likelihoods can
be profiled to find the 95% confidence intervals for a parameter. This
is slower but sometimes more robust approach. The idea is basically to
hold other parameters constant and vary one parameter and see how much
the likelihood of the overall model changes and use that define the
lower limit and upper limit of the confidence interval.

Compared to maximum likelihood estimates, REML estimates do not
generate true (log) likelihood values and so cannot be used in quite
the same way for comparing different models, that may rely on a true
likelihood. However, if you are just estimating one model, the REML
estimates are actually less biased estimates than the ML estimates so
are a good default.

If you want to read more in this area, see these
links^[https://towardsdatascience.com/a-gentle-introduction-to-maximum-likelihood-estimation-9fbff27ea12f
and
https://towardsdatascience.com/maximum-likelihood-estimation-explained-normal-distribution-6207b322e47f]. 

**This is the end of the side diversion about likelihoods. Again you
do not need to understand this section, it is here only for those
interested in going a bit deeper and having more ways to think about
ML and what is really meant by a likelihood.**

------------------------------------------------------------------------------------------