GLM2.rmd

---
title: "Generalized Linear Models 2"
author: "Joshua F. Wiley"
date: "`r Sys.Date()`"
output:
  tufte::tufte_html: 
    toc: true
    number_sections: true
---

```{r, include=FALSE, echo=FALSE}
library(tufte)
```

# Setup

You can download the `R`markdown file for this content here
[https://jwiley.github.io/MonashHonoursStatistics/GLM2.rmd](https://jwiley.github.io/MonashHonoursStatistics/GLM2.rmd).

We will be using a few packages for this content and also the baseline
data only from the data collection exercise.

```{r setup}

options(digits = 3)

library(haven)
library(data.table)
library(JWileymisc)
library(ggplot2)
library(ggpubr)
library(visreg)
library(survey)

## read in data
db <- as.data.table(read_sav("B 19032020.sav")) # baseline

## create some binary variables
db[, StressHigh := as.integer(stress >= 31)]
db[, SelfesteemHigh := as.integer(selfesteem >= 15)]

## read in some example data from the internet
## (requires internet connection to run)
dcount <- fread("https://stats.idre.ucla.edu/stat/data/poisson_sim.csv")

``` 

# Generalized Linear Models

We previously learned about linear regression and how it fit into the
Generalized Linear Models (GLM) framework (if you are not clear on
this, see: 
[http://joshuawiley.com/MonashHonoursStatistics/GLM1.html](http://joshuawiley.com/MonashHonoursStatistics/GLM1.html)).

Now we are going to learn about two other special cases of GLMs:
binary logistic regression and poisson regression.

# Poisson Regression

Poisson regression is used when the outcome is a count variable. Count
variables are discrete and must be whole, 0 or positive numbers (e.g.,
0, 1, 2, 3, etc.). The poisson distribution is used most often when
the counts are relatively rare. Some examples of research use cases
where a poisson may be appropriate are:

- examining risk factors for the number of accidents someone gets into
  over a 12 month period
- analysing the number of children people have
- predicting how many friends people have
- evaluating whether an intervention reduced the number of times
  someone missed their medication in the last month
- testing whether the total number of health care appointments over
  six months can be lowered by treating mental health

For count variables, it is common and not necessarily a problem if
most people score a 0, with relatively few people having counts of 1,
2, 3, etc.

In the simulated `dcount` data we downloaded from the internet, there
is a variable, `num_awards` which is the number of awards a student
got. We can use `testDistribution()` to examine this and get a default
comparison against a normal distribution as shown in the following
figure. The density plot is "lumpy" showing the discrete nature of the
data. No transformation would make this data approximately normally
distributed.


```{r, fig.width = 6, fig.height = 4}

plot(testDistribution(dcount$num_awards))

```

Rather than transform the data, we can change the distribution. We
again use `testDistribution()` but this time specify the distribution
as "poisson". The poinsson distribution is a discrete distribution so
a continuous density curve is not plotted. Instead, we get discrete
bars where the solid dark grey bars show the observed density of
different values and the blue bars show the density under a poisson
distribution. The deviaties plot at the bottom again shows deviations
from the assumed distribution, in this case, the poisson. Although the
densities are not a perfect match, they are relatively close and we
can see from the log likelihood values (labelled `LL`) printed in the graphs that
the poisson has a **much** higher log likelihood for the data than
does the normal distribution we tried earlier. 

```{r, fig.width = 6, fig.height = 4}

plot(testDistribution(dcount$num_awards, distr = "poisson"))

``` 

Like the normal distribution, the poisson distribution is a family of
distributions and the specific one you will get is controlled by a
single parameter, $\lambda$, which is both the mean and the variance
of the poisson distribution. You can find out more about the poisson
distribution
[here](https://en.wikipedia.org/wiki/Poisson_distribution) 
including the relevant equations for it. The figure that follows shows
the density for different counts under a poisson distribution with
varying $\lambda$ values.


```{r, echo=FALSE, fig.width = 6, fig.height = 4}

plotdat <- data.table(x = 0:20)
plotdat[, Lambda1  := dpois(x, lambda = 1)]
plotdat[, Lambda3  := dpois(x, lambda = 3)]
plotdat[, Lambda5  := dpois(x, lambda = 5)]
plotdat[, Lambda10 := dpois(x, lambda = 10)]

ggplot(melt(plotdat, id.vars = "x"), aes(x, value, fill = variable)) +
  geom_bar(stat = "identity", position = "dodge") + 
  theme_pubr() +
  xlab("x") +
  ylab("Density of the poisson distribution") +
  scale_fill_discrete("")

``` 

Linear regression rarely works well for count outcomes for two reasons:

- First, when an outcome can only be 0 or positive whole numbers, a
  straight line often is a bad fit and certainly at extremes (e.g.,
  negative values) is impossible.
- Second, unless the average number of counts is fairly high, a count
  variable and its residuals will not follow anything approximating a
  normal distribution: the normality assumption would be violated.

The GLM solves each of these problems separately

- Link functions transform the linear predicted value $\eta$ so that
  it never goes below 0
- Rather than assume a normal distribution, the GLM can assume a
  poisson distribution
    - Whereas the Normal distribution had two parameters, mean and
      standard deviation, the poisson distribution only has one: the
      average which is the same as the variance, called $\lambda$


For poisson regression, the link function is defined as:

$$ \eta = g(\lambda) = ln(\lambda) $$

where $ln()$ is the **n**atural **l**ogarithm (or log with base e,
euler's number).

`r margin_note("Although lambda can be exactly 0, 
Here $\lambda$ is the average number of counts / events. Although
counts must be at least 0 and strictly whole numbers, the average number of
counts, $\lambda$ can be any real, positive number number > 0.

Graphing $\lambda$ against $\lambda$ we get

```{r, echo = FALSE, fig.width = 6, fig.height = 4, fig.cap = "Fig 1a. Average counts against average counts"}

tmp <- data.frame(lambda = seq(from = .05, to = 10, by = .05))
ggplot(tmp, aes(lambda, lambda)) +
  geom_point() +
  theme_pubr()

```
The link function, the natural logarithm, unbounds $\lambda$ on the
left side because the log of 0 is negative infinity

The followinggraph shows the raw $\lambda$ values against the
natural log transformed values.

```{r, echo = FALSE, fig.width = 6, fig.height = 4, fig.cap = "Fig 1b. Average counts against log transformed average counts"}

ggplot(tmp, aes(lambda, log(lambda))) +
  geom_point() +
  theme_pubr()

```

In this way, the link function for poisson regression takes the
average number of counts, which can only fall > 0 and transforms it to
fall between negative infinity and positive infinity

Now we have a continuous and unbounded outcome we can apply
a linear model to!

To go from predictions on the linear model back to the original
count scale, we use the inverse link function:

- $\lambda = g^{-1}(\eta) = e^{\eta}$

That is, suppose we had the linear predicted values, $\eta$ shown
below graphically

```{r, echo = FALSE, fig.width = 6, fig.height = 4, fig.cap = "Fig 2a. Linear predictor against linear predictor"}

tmp2 <- data.frame(eta = seq(from = -5, to = 5, by = .1))
ggplot(tmp2, aes(eta, eta)) +
  geom_point() +
  theme_pubr()

```

Using the inverse link function, we transform them to again fall
> 0.

```{r, echo = FALSE, fig.width = 6, fig.height = 4, fig.cap = "Fig 2b. Linear predictor (log scale) against average counts"}

ggplot(tmp2, aes(eta, exp(eta))) +
  geom_point() +
  theme_pubr()

```

## Poisson Regression Assumptions

Much like a normal distribution was assumed for normal linear
regression, for poisson regression a poisson distribution is
assumed.

$$ y \sim Pois(\lambda) $$

In poisson regression although for relatively less common outcomes,
outliers on the left hand side may not be a concern, large positive
outliers (e.g., someone with an extreme number of awards, or who has
multiple health appointments daily, etc.) may be a concern.

Outliers on predictors may still be a concern.

Poisson regression still requires that the errors be independent
(i.e., individuals are independent, not dependent or repeated
measures).

Poisson regression also assumes that on the link scale (natural log scale)
there is a linear relationship.

Finally, poisson regression generally requires a large sample size.
There are no degrees of freedom so it requires a large enough sample
that **parameters** are distributed about normally relying on the
central limit theorem. 

## Poisson Regression in `R`

To estimate a poisson regression in `R`, we can use the `glm()`
function which allows us to fit a variety of different GLMs, including
poisson regression. `glm()` works almost identically to `lm()`, with a
few caveats. The first caveat is that for anything other than a linear
model, we need to specify the `family = ` argument, which controls
which specific type of GLM is estimated. In this case we want a
poisson regression, so we use `family = poisson()`.

The second caveat is that although some functions we are used to from
working with `lm()` will carry over to working with `glm()` models,
many functions will not work or will not work the same way. Because
linear regression is the most common, there tend to be the most tools
and convenience features implemented for it. In addition, some things,
like $R^2$ variance explained that makes sense with linear regression
does not really make sense for poisson regression.

The following code fits a poisson regression to the number of awards
`num_awards` predicted by math scores, `math`. There is not a single
agreed upon way to evaluate the residuals of these models for whether
they meet the assumptions, so we may just evaluate the distribution of
the outcome to see if it closely fits a poisson distribution.
The figure suggests its fairly close, so we might proceed with a
summary of the model from `summary()`.

```{r, fig.width = 6, fig.height = 4}

m <- glm(num_awards ~ math, data = dcount, family = poisson())

plot(testDistribution(dcount$num_awards, distr = "poisson"))

summary(m)

``` 

The output from `summary(m)` is fairly similar to what we would see
for a linear regression in `R`.

- Deviance Residuals: these still represent model residuals, but they
  are no longer raw residuals defined as $y - \hat{y}$. Deviance
  residuals are how much each data point contributed to the residual
  deviance, a sort of measure of overall model fit. They are analogous
  to residuals in linear regression, in that when you sum and square
  these residuals, you get the sums of squares used to assess model
  fit. They differ in that they are not calculated as the raw
  difference between each data point and the predicted value.
- Coefficients: the coefficients table is interpretted nearly
  identically as in linear regression. The estimates are the
  regression coefficients, the standard errors are given next. Instead
  of t values, z values are presented because in general for GLMs we
  cannot calculate degrees of freedom so we rely on the central limit
  theory, and so instead of a t-test, we use a z-test (recall from
  early stats that with an infinite sample size, the t-distribution
  becomes identical to the normal distribution and thus the t-test and
  z-test are the same, with an infinite sample and they are
  functionally the same with a relatively large sample). P values are
  in the final column, now based of the z-test instead of the t-tests
  used in linear regression.
- There is no model $R^2$ as that only works clearly for linear
  regression. Instead, we get null and residual deviance values and an
  AIC, which stands for the Akaike Information Criterion, a relative
  measure of model fit taking into account model complexity. We'll
  learn more about this much later in the semester, for now you don't
  need to worry about them.
- The number of Fisher scoring iterations. This is included because
  unlike linear regression where we can directly calculate the best
  parameter estimates. For other GLMs, the computer tries different
  values and finds the best values, in a process known as
  optimization. At each step of the optimization it evaluates how well
  the specific parameter estimates work and based on some criteria
  whether to keep trying for better estimates or stop trying. Each one
  of these steps is called an "iteration" and the number of iterations
  can give a rough sense of how hard it was for the computer to find
  the values. Again you can mostly ignore these for now, although it
  will become more important as we talk about mixed effects models.


### Poisson Regression Interpretation

At one level, we can interpret the coefficients for poisson regression
nearly identically to those for linear regression, as long as we use
the log scale. An example follows.

A poisson regression predicting the number of awards each student
received from their math scores showed that students who had a 0 math
score were expected to have 
`r sprintf("%0.2f", coef(m)[["(Intercept)"]])` log awards,
`r sprintf("%s", formatPval(coef(summary(m))["(Intercept)", "Pr(>|z|)"], includeP = TRUE))`.
Each one unit higher math score was associated with
`r sprintf("%0.2f", coef(m)[["math"]])` higher log awards,
`r sprintf("%s", formatPval(coef(summary(m))["math", "Pr(>|z|)"], includeP = TRUE))`.
A graph of these results follows.

```{r, fig.width = 6, fig.height = 4}

visreg(m, xvar = "math", partial = FALSE, rug = FALSE, gg = TRUE) +
  ylab("predicted log number of awards") +
  theme_pubr()

``` 

By specifying that we are working on a log scale, we can interpret the
results as we did before and graph the results almost the same as
before, other than needing to mention the log. Although these
interpretations are technically accurate, for most people, it is hard
to understand results on a log scale.

If we exponentiate the regression coefficients, specifically the
slopes, we get what are called **Incident Rate Ratios** or IRRs.
IRRs indicate how many more times the outcome will be for a one unit
change in the predictor. For example, if the $IRR = 2$ and the base
rate was 1, then one unit higher would be $1 * 2 = 2$. If the base
rate was 2, then a one unit higher would be $2 * 2 = 4$.
IRRs are interpretted as for each one unit higher predictor score,
there are IRR times as many events of the outcome. Because IRRs are
multiplicative, an IRR of 1 means that a one unit change in the
predictor is associated with 1 times as many events on the outcome,
that is no change. Thus on the IRR scale, a coefficient of 1 is
equivalent to a coefficient of 0 on the link (log) scale.

Note that although we can exponentiate coefficients, confidence
intervals, or predictions, we should *not* exponentiate standard
errors, z values or p values. To get the IRRs in `R`, we can extract
the coefficients from the model using the `coef()` function and then
exponentiate using the `exp()` function and we follow a nearly
identical process for the confidence intervals.

```{r}

exp(coef(m))

exp(confint(m))

``` 

We can interpret poisson regression using the IRRs as follows.


A poisson regression predicting the number of awards each student
received from their math scores showed that students who had a 0 math
score were expected to have 
`r sprintf("%0.2f", exp(coef(m)[["(Intercept)"]]))` awards,
[95% CI `r paste(sprintf("%0.2f", exp(confint(m))["(Intercept)", ]),
       collapse = " - ")`].
Each one unit higher math score was associated with having 
`r sprintf("%0.2f", exp(coef(m)[["math"]]))` times the number of
awards,
[95% CI `r paste(sprintf("%0.2f", exp(confint(m))["math", ]),
       collapse = " - ")`], 
`r sprintf("%s", formatPval(coef(summary(m))["math", "Pr(>|z|)"], includeP = TRUE))`.
A graph of these results follows.

```{r, fig.width = 6, fig.height = 4}

visreg(m, xvar = "math", scale = "response",
       partial = FALSE, rug = FALSE, gg = TRUE) +
  ylab("predicted number of awards") +
  theme_pubr()

``` 

Note that when calling `visreg()`, we added the new argument, `scale =
"response"`. This tells `R` that we do not want the results on the
linear link scale, we want them on the original response scale. This
argument for `visreg()` is helpful any time you are working with non
linear link functions. The figure also shows how although on the link
scale there was a perfectly linear model prediction, the linear model
on the original response scale is clearly not linear.

Poisson regression with multiple predictors follows the same
interpretation, but just as with linear regression you need to specify
that what is being evaluated is the unique / independent association
of each predictor with the outcome, independent of other variables in
the model.

### You Try It - Poisson

`r margin_note("There is no association between prog and num_awards so
do not be surprised if the graphs etc are not that interesting.")`

Fit a poisson regression predicting the number of awards, `num_awards`
from `prog` in the `dcount` data. Try to interpret the results using
IRRs and make a graph. 

```{r tryitpoisson}

## fit poisson regression

## calculate IRRs and confidence intervals on the IRR scale

## graph and visualize the results


``` 

# Binary Logistic Regression

Binary logistic regression is regression when the outcome only takes
on two values: 0 or 1. Logistic regression is useful for many
questions, such as:

- What predicts whether someone will have major depression or not?
- Does one treatment have a higher probability of patients *remitting*
  from major depression than another treatment?
- What is the probability that a patient will be re-admitted to the
  hospital within 30 days of discharge?
- What predicts whether an individual will live or die before age 60?
- If a bank gives a loan to someone, what is their probability of not
  being able to pay it back?
- Do older adults have a higher probability of using CAM than younger
  adults?

Linear regression will not work for binary outcomes for two reasons:

- First, when an outcome can only be 0 or 1, a straight line is a very
  bad fit
    - straight lines could predict a value of 1.4 or 2 or -0.3 when
      those values are not possible
- Second, there is no way that a binary variable or its residuals will
  follow a normal distribution: the normality assumption would be
  violated

The GLM solves each of these problems separately

- Link functions transform the linear predicted value $\eta$ so that
  it never goes below 0 and never goes about 1
- Rather than assume a normal distribution, the GLM uses a new
  distribution, the Bernoulli distribution
    - Whereas the Normal distribution had two parameters, mean and
      standard deviation, the Bernoulli distribution only has one: the
      average probability that an event will occur, sometimes $p$ or $\mu$

For logistic regression, the link function is defined as:

$$ \eta = g(\mu) = ln\left(\frac{\mu}{1 - \mu}\right) $$

That is called the *logit* function.

Here $\mu$ is the probability that the outcome will
be 1. Probabilities range from 0 to 1: an outcome cannot happen less
than never (probability of 0) or more than always (probability of 1).

Graphing $\mu$ against $\mu$ we get

```{r, echo = FALSE, fig.width = 6, fig.height = 4, fig.cap = "Fig 1a. Probabilities against Probabilities"}

tmp <- data.frame(mu = seq(from = .01, to = .99, by = .01))
ggplot(tmp, aes(mu, mu)) +
  geom_point() +
  theme_pubr()

```

The first part of the *logit* function:

- $\frac{\mu}{1 - \mu}$

unbounds $\mu$ on the right side, so that as it goes to 1,
transformed, it goes to infinity, shown graphically below

```{r, echo = FALSE, fig.width = 6, fig.height = 4, fig.cap = "Fig 1b. Probabilities against right unbounded probabilities"}
ggplot(tmp, aes(mu, mu/(1 - mu))) +
  geom_point() +
  theme_pubr()
```

The next part, the natural logarithm, unbounds it on the left side
because the log of 0 is negative infinity

- $ln\left(\frac{\mu}{1 - \mu}\right)$

Combined, the graph below shows the raw $\mu$ values against the
transformed values.

```{r, echo = FALSE, fig.width = 6, fig.height = 4, fig.cap = "Fig 1c. Probabilities against unbounded probabilities (logit scale)"}

ggplot(tmp, aes(mu, log(mu/(1 - mu)))) +
  geom_point() +
  theme_pubr()

```

In this way, the link function for logistic regression takes the
probability of the event occuring, which can only fall between 0 and
1, and transforms it to fall between negative infinity and positive
infinity

Now we have a continuous and unbounded outcome we can apply
a linear model to!

To go from predictions on the linear model back to the probability
scale, we use the inverse link function:

$$ \mu = g^{-1}(\eta) = \frac{1}{1 + e^{-\eta}} $$

That is, suppose we had the linear predicted values, $\eta$ shown
below graphically

```{r, echo = FALSE, fig.width = 6, fig.height = 4, fig.cap = "Fig 2a. Linear predictor against linear predictor"}

tmp2 <- data.frame(eta = seq(from = -5, to = 5, by = .1))
ggplot(tmp2, aes(eta, eta)) +
  geom_point() +
  theme_pubr()

```

Using the inverse link function, we transform them to again fall
between 0 and 1, like probabilities.

```{r, echo = FALSE, fig.width = 6, fig.height = 4, fig.cap = "Figure 2b. Linear predictor (logit scale) against probabilities"}

ggplot(tmp2, aes(eta, 1/(1 + exp(-eta)))) +
  geom_point() +
  theme_pubr()

```

## Logistic Regression Assumptions

Much like a normal distribution was assumed for normal linear
regression, for logistic regression a Bernoulli distribution is
assumed.

$$ y \sim Bern(\mu) $$

The Bernoulli only has the one parameter, the probability of the event
occuring.

In logistic regression the outcome is either 0 or 1, so outliers on
the outcome are not a concern

Outliers on predictors may still be a concern

One problem that sometimes arises is known as separation, this occurs
when some predictor perfectly predicts the outcome or separates the
outcome, most often when the outcome is rare or the sample size is
small.

- In a sample of 60 people where only 10 have major depression, if 80%
  of the sample also have PTSD, it may happen that all cases of major
  depression occurred in people with PTSD, so the percentage of
  depression in those without PTSD is exactly 0%, which is
  separation.
- The usual solution is to remove predictors or collapse groups in
  these cases

Logistic regression still requires that the errors be independent
(i.e., individuals are independent, not dependent or repeated
measures).

Logistic regression also assumes that on the linke scale (logit scale)
there is a linear relationship.

Finally, logistic regression generally requires a large sample size.
There are no degrees of freedom, requires large enough sample that
parameters are distributed about normally (central limit theorem).

## Logistic Regression in `R` 1

For logistic regression in `R`, we will use the `StressHigh`
categorical variable we created based off stress scores in the
baseline data collection exercise data.

First, let's take a look at some basic descriptives on a few variables
we will work with. The `egltable()` function calculates basic
descriptive estimates optionally by group levels in a table.
The first argument is a list of variable names and then we give it a
dataset. The argument `strict=FALSE` is used to indicate that if a
variable has few levels, like `StressHigh` which is only 0 and 1, even
though `R` stores it as a continuous variable, it should be treated as
categorical for the table, otherwise `egltable()` will report means
instead of frequencies unless the variable is a string or factor
variable. 

```{r}

egltable(
  c("StressHigh", "stress", "SelfesteemHigh", "selfesteem"),
  data = db, strict=FALSE)

```

Now let's look at a cross tabulation of high stress and high self
esteem, which we do by specifying a grouping variable as
`StressHigh`. The columns of the table are now levels of self
esteem. We get frequencies and percentages for `SelfesteemHigh` and means
and standard deviations for `stress` and `selfesteem` along with a
chi-squared test and t-test and effect sizes. The chi-square test is
significant, p < .001 indicating that high/low self esteem and
high/low stress are not independent. We can also see that there is
some data in all cells of the high/low self esteem and stress, although
only three people reported both high self esteem and high stress.

```{r}

egltable(
  c("stress", "SelfesteemHigh", "selfesteem"),
  g = "StressHigh",
  data = db, strict=FALSE)

```

As with poisson regression, we can fit a logistic regression using the
`glm()` function. We use the usual regression formula approach,
specify the dataset and set `family = binomial()`, which gives us a
logistic regression.

```{r}

mlog <- glm(StressHigh ~ SelfesteemHigh, data = db, family = binomial())

summary(mlog)

```

The output from `summary(mlog)` is fairly similar to what we would see
for a linear regression in `R`.

- Deviance Residuals: these still represent model residuals, but they
  are no longer raw residuals defined as $y - \hat{y}$. Deviance
  residuals are how much each data point contributed to the residual
  deviance, a sort of measure of overall model fit. They are analogous
  to residuals in linear regression, in that when you sum and square
  these residuals, you get the sums of squares used to assess model
  fit. They differ in that they are not calculated as the raw
  difference between each data point and the predicted value.
- Coefficients: the coefficients table is interpretted nearly
  identically as in linear regression. The estimates are the
  regression coefficients, the standard errors are given next. Instead
  of t values, z values are presented because in general for GLMs we
  cannot calculate degrees of freedom so we rely on the central limit
  theory, and so instead of a t-test, we use a z-test (recall from
  early stats that with an infinite sample size, the t-distribution
  becomes identical to the normal distribution and thus the t-test and
  z-test are the same, with an infinite sample and they are
  functionally the same with a relatively large sample). P values are
  in the final column, now based of the z-test instead of the t-tests
  used in linear regression.
- There is no model $R^2$ as that only works clearly for linear
  regression. Instead, we get null and residual deviance values and an
  AIC, which stands for the Akaike Information Criterion, a relative
  measure of model fit taking into account model complexity. We'll
  learn more about this much later in the semester, for now you don't
  need to worry about them.
- The number of Fisher scoring iterations. This is included because
  unlike linear regression where we can directly calculate the best
  parameter estimates. For other GLMs, the computer tries different
  values and finds the best values, in a process known as
  optimization. At each step of the optimization it evaluates how well
  the specific parameter estimates work and based on some criteria
  whether to keep trying for better estimates or stop trying. Each one
  of these steps is called an "iteration" and the number of iterations
  can give a rough sense of how hard it was for the computer to find
  the values. Again you can mostly ignore these for now, although it
  will become more important as we talk about mixed effects models.

### Logistic Regression Interpretation

At one level, we can interpret the coefficients for logistic
regression nearly identically to those for linear regression, as long
as we use the logit scale. An example follows.

A logistic regression predicting high stress from selfesteem (high /
low) showed that students who had low selfesteem were expected to have 
`r sprintf("%0.2f", coef(mlog)[["(Intercept)"]])` log odds of being
high stress,
`r sprintf("%s", formatPval(coef(summary(mlog))["(Intercept)", "Pr(>|z|)"], includeP = TRUE))`.
Being in the high versus low selfesteem group was associated with
`r sprintf("%0.2f", coef(mlog)[["SelfesteemHigh"]])` lower log odds
of being high stress,
`r sprintf("%s", formatPval(coef(summary(mlog))["SelfesteemHigh", "Pr(>|z|)"], includeP = TRUE))`.
A graph of these results follows.

```{r, fig.width = 6, fig.height = 4}

visreg(mlog, xvar = "SelfesteemHigh", partial = FALSE, rug = FALSE, gg = TRUE) +
  ylab("predicted log odds of high stress") +
  theme_pubr()

``` 

By specifying that we are working on the log odds (logit) scale, we
can interpret the results as we did before and graph the results
almost the same as before, other than needing to mention the log
odds. Although these interpretations are technically accurate, for
most people, it is hard to understand results on a log odds scale.

If we exponentiate the regression coefficients, specifically the
slopes, we get what are called **Odds Ratios** or ORs.
ORs indicate how many more times the odds of occuring the outcome will
be for a one unit change in the predictor. For example, if the $OR = 2$ and the base
odds was 0.5, then one unit higher would be $0.5 * 2 = 1$ odds. If the base
odds was 2, then a one unit higher would be $2 * 2 = 4$ odds.
ORs are interpretted as for each one unit higher predictor score,
there are OR times the odds of the outcome. Because ORs are
multiplicative, an OR of 1 means that a one unit change in the
predictor is associated with 1 times the odds of the outcome,
that is no change. Thus on the OR scale, a coefficient of 1 is
equivalent to a coefficient of 0 on the link (log odds) scale.

Note that although we can exponentiate coefficients, confidence
intervals, or predictions, we should *not* exponentiate standard
errors, z values or p values. To get the ORs in `R`, we can extract
the coefficients from the model using the `coef()` function and then
exponentiate using the `exp()` function and we follow a nearly
identical process for the confidence intervals.

```{r}

exp(coef(mlog))

exp(confint(mlog))

``` 

We can interpret logistic regression using the ORs as follows.


A logistic regression predicting high stress from selfesteem (high /
low) showed that students who had low selfesteem were expected to have 
`r sprintf("%0.2f", exp(coef(mlog)[["(Intercept)"]]))` odds of being
high stress,
[95% CI `r paste(sprintf("%0.2f", exp(confint(mlog))["(Intercept)", ]),
       collapse = " - ")`].
Being in the high vs low self esteem group was associated with 
`r sprintf("%0.2f", exp(coef(mlog)[["SelfesteemHigh"]]))` times the
odds of being high stress,
[95% CI `r paste(sprintf("%0.2f", exp(confint(mlog))["SelfesteemHigh", ]),
       collapse = " - ")`], 
`r sprintf("%s", formatPval(coef(summary(mlog))["SelfesteemHigh", "Pr(>|z|)"], includeP = TRUE))`.
A graph of these results follows.

```{r, fig.width = 6, fig.height = 4}

visreg(mlog, xvar = "SelfesteemHigh", scale = "response",
       partial = FALSE, rug = FALSE, gg = TRUE) +
  ylab("predicted probability of being high stress") +
  theme_pubr()

``` 

Note that when calling `visreg()`, we added the new argument, `scale =
"response"`. This tells `R` that we do not want the results on the
linear link scale, we want them on the original response scale. This
argument for `visreg()` is helpful any time you are working with non
linear link functions. The figure also shows how although on the link
scale there was a perfectly linear model prediction, the linear model
on the original response scale is clearly not linear.

Logistic regression with multiple predictors follows the same
interpretation, but just as with linear regression you need to specify
that what is being evaluated is the unique / independent association
of each predictor with the outcome, independent of other variables in
the model.

In addition to the odds ratio (OR) scale, with a simple model, we can
easily convert from the logit scale to the probability scale, which
may be even easier to interpret.

People with lower self esteem are the reference group, captured by the intercept

$$ \frac{1}{1 + e^{- (`r coef(mlog)[1]`)}} = `r as.vector(plogis(coef(mlog)[1]))` $$

People with lower self esteem have a 
`r as.vector(plogis(coef(mlog)[1]))` probability of being high stress

The coefficient for self esteem, indicates that people with high self
esteem have `r coef(mlog)[2]` higher log odds (the logit scale) of
being high stress than do people with low self esteem.

To calculate the probability that people with high self esteem will be
high stress:

$$ \frac{1}{1 + e^{- (`r coef(mlog)[1]` + `r coef(mlog)[2]`)}} = `r as.vector(plogis(sum(coef(mlog))))` $$

People with high self esteem have a 
`r as.vector(plogis(sum(coef(mlog))))` probability of being high
stress 

Stated on the the odds scale.

- People with high self esteem have 
  $e^{`r coef(mlog)[2]`} = `r exp(coef(mlog)[2])`$ 
  times the odds of being high stress as do people with low self
  esteem 
- This is called the odds ratio (OR)

An example summary of this analysis is:
"People with high self esteem were less likely to be high stress than
people with low self esteem, (OR = `r exp(coef(mlog)[2])`, p
< .001)."

Another way of saying the same thing:
"People with high self esteem had `r exp(coef(mlog)[2])` times the
odds of being high stress as people with low self eseteem (p < .001)."

## Logistic Regression in `R` 2

Next we examine a continuous predictor, continuous self esteem.
The analysis in `R` is basically the same as the previous example:

```{r}

mlog2 <- glm(StressHigh ~ selfesteem, data = db, family = binomial())

summary(mlog2)

```

- The coefficient for self esteem indicates that for each one unit higher
  self esteem, people have `r coef(mlog2)[2]` lower log
  odds of being high stress
- In odds ratios:
    - For each additional unit of self esteem, participants have
	`r exp(coef(mlog2)[2])` times the odds of being high stress (p <
	.001). 

- One challenge with the odds ratio metric is that it is multiplicative
    - If the baseline odds are 2, then 1.2 times the odds would be
      2.4 (an absolute difference of 0.4)
	- If the baseline odds are 4, then 1.2 times the odds would be 4.8
      (an absolute difference of 0.8)

- Important to keep this interpretation in mind, odds ratios do not
  directly indicate the probability of the event occuring, just how
  many times higher the odds

We can calculate the specific probability of being high stress as we
did for categorical self esteem

Probability of being high stress for participants with 0 self esteem

$$\frac{1}{1 + e^{- (`r coef(mlog2)[1]`)}} = `r as.vector(plogis(coef(mlog2)[1]))`$$

Participants with 0 self esteem have a
`r as.vector(plogis(coef(mlog2)[1]))` probability of being high stress.

To calculate the probability of being high stress for participants with a 20 score on
self esteem, we first calculate the log odds:

$$`r coef(mlog2)[1]` + `r coef(mlog2)[2]` * 20 = `r coef(mlog2)[1] + 20 * coef(mlog2)[2]`$$
   
Then we convert the log odds to probabilities:

$$\frac{1}{1 + e^{- (`r coef(mlog2)[1] + 20 * coef(mlog2)[2]`)}} = `r as.vector(plogis(coef(mlog2)[1] + 20 * coef(mlog2)[2]))`$$

Participants with a score of 20 on self esteem have a
`r as.vector(plogis(coef(mlog2)[1] + 20 * coef(mlog2)[2]))`
probability of being high stress. This matches what we can see
visually when we use `visreg()` to plot on the probability scale.


```{r, fig.width = 6, fig.height = 4, fig.cap = "Probability of being high stress by self esteem level with 95% confidence intervals"}

visreg(mlog2, xvar = "selfesteem", scale = "response",
       partial = FALSE, rug = FALSE, gg = TRUE) +
  ylab("predicted probability of being high stress") +
  theme_pubr()

```

Although logistic regression models do not have $R^2$ values, there
are some fit indices. For example, we could calculate the predicted
probabilities for each person and if they are >= .50, classify them as
the model predicting high stress and if < .50 classify them as the
model predicting low stress. Then we can compare the model predictions
to the observed data to calculate the model accuracy.

```{r}

accdata <- data.table(
  Observed = db$StressHigh,
  Predicted = as.integer(predict(mlog2, type = "response") >= .50))

## quick view of the data
head(accdata)

## simple frequencies table
xtabs(~ Observed + Predicted, data = accdata)

## accuracy is the average number of times predictions match observed
mean(accdata$Observed == accdata$Predicted)

``` 

### Average Marginal Effects

Because the probability scale is easier to interpret than is even the
odds ratio scale, people often desire to convert results from logistic
regression into probabilities. A challenge on the probability scale is
that the results are not linear. For example if looking at the graph
of continuous self esteem predicting the probability of being high
stress, the probability changes much less when moving from a self
esteem score of 4 to 5 then from 10 to 11.

We can see this in practice by using the `predict()` function to
generate some predicted probabilities. We make a small new dataset
with some particular self esteem values (4, 5, 10, 11) and specify the
argument `type = "response"` to get probabilities rather than
predicted log odds.

```{r, results = "asis"}

newdata <- data.table(
  selfesteem = c(4, 5, 10, 11))

newdata$Prob <- predict(mlog2,
                        newdata = newdata,
                        type = "response")

knitr::kable(newdata)

```

The probabilities show that the predicted probability of being high
stress changes very little when moving from 4 to 5 self esteem but
quite a bit from 10 to 11. This illustrates a challenge: if you want
to talk about how much the probability of being high stress changes
for a one unit change in self esteem, it matters where the self esteem
scores started.

One principled approach to solving this problem is called the **average
marginal effect**. There are two parts to this. A **marginal** effect
is the instantaneous effect of change at a particular
point. Technically, it is the derivative of the slope for those that
helps. Because each participant has their own actual self esteem, each
participant also will have their own marginal effect, their own
instaneous rate of change in the probability of being high stress for
a change in self esteem. The **average** part of average marginal
effect is that we can average the marginal effects across all
participants in the sample to get an average marginal effect, and this
number represents the average change in probability that you would
expect if you took a population that looked like our sample and could
increase everyone's self esteem by one unit.

Calculating the average marginal effect is fairly easy. 

1. Using the original dataset, generate predicted probabilities for
   everyone
2. Add a small quantity, $h$ to every self esteem score in the dataset
   and generate new predicted probabilities
3. Take the difference between the original and new predicted
   probabilities divided by $h$ so that the differences are per unit
   change
4. Average all the differences in probabilities

```{r}

## pick h value for difference, store that in variable h
h <- .01

## make the original dataset
originaldata <- db[, .(selfesteem)]

## make the incrased self esteem dataset (selfesteem + h)
increaseddata <- db[, .(selfesteem = selfesteem + h)]

## calculate original predicted probabilities
originaldata$Prob <- predict(mlog2, newdata = originaldata,
                             type = "response")

## calculate increased predicted probabilities
increaseddata$Prob <- predict(mlog2, newdata = increaseddata,
                              type = "response")

## calculate the difference in probabilities per unit
diffprob <- (increaseddata$Prob - originaldata$Prob) / h

## calculate the average marginal effect
mean(diffprob)

``` 

The average marginal effect shows that on average in our sample, a one
unit increase in self esteem predicts a 
`r sprintf("%0.2f", mean(diffprob))` decrease in the probability of
being high stress.

## GLMs with Multiple Predictors

An important consequence of how we have to calculate the entire log
odds of being high on stress before we can calculate predicted
probabilities for logistic regression or raw counts for poisson
regression is that when we move into multiple predictors in GLMs, if
we want to calculate predicted values on the original response scale,
we must pick values for every variable. That makes it more difficult
to create plots of one variable at a time because we must hold other
variables at specific values as well.

To better understand this phenomenon, we'll continue with logistic
regression examples. Here we have both continuous self esteem and
neuroticism predicting high vs low stress. On the log odds scale, we
can see that there is a linear association between self esteem and log
odds of being high stress and the lines are parallel at different
levels of neuroticism, reflecting our choice that self esteem and
neuroticism are two separate predictors and do not interact (no
moderation). 

```{r, fig.width = 6, fig.height = 4}

mlog3 <- glm(StressHigh ~ selfesteem + neuroticism,
             data = db, family = binomial())

summary(mlog3)

visreg(mlog3, xvar = "selfesteem",
       by = "neuroticism", breaks = c(4, 9),
       scale = "linear", overlay = TRUE,
       partial = FALSE, rug = FALSE, gg = TRUE) +
  ylab("predicted log odds of being high stress") +
  theme_pubr()

```

However, look at what happens when we create the same graph but back
convert the predicted log odds into predicted probabilities. In
addition to the main effect of neuroticism, the lines no longer appear
parallel. On the probability scale, the baseline probability matters
and thus neuroticism also influences the "curve" or slope on the
probability scale of the association between self esteem and being
high stress. For this reason, and the same logic applies to poisson
regression, when you have multiple predictors and a non-linear
transformation in the link function, all the predictor values
influence the final predicted outcome on the original scale.

```{r, fig.width = 6, fig.height = 4}

visreg(mlog3, xvar = "selfesteem",
       by = "neuroticism", breaks = c(4, 9),
       scale = "response", overlay = TRUE,
       partial = FALSE, rug = FALSE, gg = TRUE) +
  ylab("predicted probability of being high stress") +
  theme_pubr()

```

When you have multiple predictors it also is possible to calculate
average marginal effects for each predictor. The steps are virtually
identical to the steps for a single predictor, except that the
datasets must include all predictors but you only increase the values
of one predictor at a time.


```{r}

## pick h value for difference, store that in variable h
h <- .01

## make the original dataset
originaldata2 <- db[, .(selfesteem, neuroticism)]

## make the incrased self esteem dataset (selfesteem + h)
increasedSelfEsteem <- db[, .(selfesteem = selfesteem + h, neuroticism)]

## make the incrased neuroticism dataset (neuroticism + h)
increasedNeuro <- db[, .(selfesteem, neuroticism = neuroticism + h)]

## calculate original predicted probabilities
originaldata2$Prob <- predict(mlog3, newdata = originaldata2,
                             type = "response")

## calculate increased predicted probabilities for self esteem
increasedSelfEsteem$Prob <- predict(mlog3, newdata = increasedSelfEsteem,
                              type = "response")

## calculate increased predicted probabilities for neuroticism
increasedNeuro$Prob <- predict(mlog3, newdata = increasedNeuro,
                              type = "response")

## calculate the difference in probabilities per unit
se.diffprob <- (increasedSelfEsteem$Prob - originaldata2$Prob) / h
neuro.diffprob <- (increasedNeuro$Prob - originaldata2$Prob) / h

## calculate the average marginal effect s
mean(se.diffprob) # self esteem
mean(neuro.diffprob) # neuroticism

``` 

The results show that in the sample, holding neuroticism constant,
increasing self esteem by one unit on average is predicted to result
in a `r mean(se.diffprob)` decline in the probability of being high
stress. Conversely, holding self esteem constant, increasing
neuroticism by one unit on average is predicted to result in a 
`r mean(neuro.diffprob)` increase in the probability of being high
stress.

# Summary Table

Here is a little summary of some of the functions used in this
topic. 

| Function       | What it does                                 |
|----------------|----------------------------------------------|
| `glm()`  | Fits a generalized linear regression model, with `family=poisson()` or `family=binomial()` for poisson and logistic regression | 
| `coef()`  | Extracts coefficients from a regression model | 
| `confint()`  | Calculates confidence intervals (95% by default) for a regression model | 
| `predict()` | Function to get predicted values from a model, including on the original response scale |
| `testDistribution()` | allows testing a variable against any specific distribution | 
| `visreg()` | can be used to make plots of regression models on the linear or original scales |


# Bonus Content

Note: you do not need to read or follow this, it is extra content for
those who may use or be interseted.


For binary outcomes, sometimes in place of odds ratios, people prefer
to report relative risks. Odds ratios are how many higher times the
odds. The relative risk is how many times higher the risk or
probability of the event is. To get relative risks, the link function
needs to be the natural logarithm. However, this tends to be unstable
when used with a bernoulli distribution. The current standard in the
literature for binary outcomes to get relative risks is to use a
poisson regression model with a log link on the binary
outcome. However, it is known that this results in biased standard
errors and so it is recommended to use a robust standard errors rather
than default standard errors if using a poisson regression to estimate
relative risks for a binary outcome. For more examples, see:
[here](https://bmcmedresmethodol.biomedcentral.com/articles/10.1186/s12874-018-0519-5).

Robust standard errors can be
calculated by using the `survey` package and specifying the IDs for
each participant in the model design, which will then have robust
standard errors calculated (note, the `survey` package can also handle
many other complex survey designs including when people are not
randomly selected but were selected based off particular
probabilities, but we are only using it here to get the robust
standard errors). Then we use the `svyglm()` function which works
almost the same as regular `glm()` only it can take the additional
`design` argument and can calculate robust standard errors.

```{r}

## first specify the design, indicating what the ID column is
## the warning about no weights or probabilities is ignorable
design <- svydesign(ids = ~ ID, data = db)

## fit a poisson regression which uses a log link to the binary data
mrr <- svyglm(StressHigh ~ SelfesteemHigh, data = db,
              family = poisson(), design = design)

## get summary of coefficients, standard errors, p-values
## assuming infinite degrees of freedom (central limit theorem)
summary(mrr, df.resid=Inf)

## calculate the relative risks and confidence intervals
exp(coef(mrr))

exp(confint(mrr))

```

From these results we can say that people with high self esteem have a
relative risk of 0.167 of being high stress as do people with low self
esteem. In a simple case like this model, that directly means that
people with high self esteem have 0.167 times the probability of being
high stress as people with low self esteem. Although it is still on
the multiplicative scale, it relates to probabilities directly rather
than being multiplicative on the odds.