gee.Rmd

# Generalized Estimating Equations

A <span class="emph">generalized estimating equation</span> is an *estimation procedure*[^notamodel] for dealing with clustered data, and is seemingly very popular in disciplines trained with a biostatistics perspective, but perhaps not too commonly used elsewhere.  Models using this approach are sometimes called <span class="emph">marginal models</span>,  and can be seen as follows, where the target $y$ is multivariate normal with mean vector $X\beta$ and covariance matrix $\Sigma$, which is typically block diagonal as we created at the beginning[^geevsre].

$$ y \sim \mathcal{N}(X\beta,  \Sigma) $$


Here the focus is on the 'population average' effects, akin to 'fixed' effects in the RE model, and in some circumstances they are identical.  In addition, one specifies the type of covariance structure thought to underlie the data. Among the more common are:

- *independence*: no correlation
- *autoregressive*: as described previously
- *exchangeable*: same correlation everywhere (aka 'compound symmetry' or 'spherical')
- *unstructured*: all possible correlations are estimated as parameters

And there are many others. The GEE approach is identical to RE intercept-only model approach if one conducts a linear Gaussian model, as in this case. In addition, these correlation structures are often available to mixed models tools, so this extension alone should not be a reason[^lme4] to use the GEE approach.

We'll use the <span class="pack">geepack</span> package in order to conduct the gee approach to the model.  We'll specify an autoregressive correlation structure as we did with the mixed model, and as the data was in fact designed with.

```{r geeMod, echo=-(4:6)}
library(geepack)

gee_mod = geeglm(y ~ time + treatment, data=d, corstr='ar1', id=id, waves=time)
summary(gee_mod)$coefficients %>% pander(digits=3, round=3)
data_frame(`Residual Variance` = summary(gee_mod)$dispersion$Estimate) %>% round(3) %>%  pander
summary(gee_mod)$corr %>% pander(round=3)
```

We should be getting used to the coefficient estimates for the population-average, a.k.a. fixed effects, by now.  We also obtain an estimate for the residual correlation, here noted as *alpha*.

Note the similarities here compared with the mixed model where there is only a random intercept. A couple of changes are made to keep things as similar as possible.

```{r mixedgeeComparison, echo=1}
mixed_mod_ri_only = lme(y ~ time + treatment, data=d, random = ~1|id, cor=corAR1(form=~time), 
                        control=lmeControl(opt='optim'), method='ML')  # changed opt bc nlminb had issues
summary(mixed_mod_ri_only) %>% pander(digits=3, round=3)
VarCorr(mixed_mod_ri_only, rdig=3)[,1:2] %>%             # note: you have removed the correlation from display
  as_data_frame() %>% 
  mutate(Variance=as.numeric(Variance), StdDev=as.numeric(StdDev)) %>% 
  mutate_all(round, digits=3) %>% 
  pander(round=3, nsmall=3)

coef(mixed_mod_ri_only$modelStruct$corStruct, unconstrained=F) %>% 
  pander(round=3, nsmall=3)  
```

The population average effects are identical (though the <span class="func">geeglm</span> function automatically does cluster robust standard errors). The estimated correlations for both are similar, and a bit high.  There is essentially no cluster variance in the mixed model, and both estimated residual variances are similar, and similar to the standard linear model we started with.  This makes sense as the variance is equal to the residual variance + the intercept variance + slope variance.


```{r geelmelmvar}
summary(lm_mod)$sigma^2                   # similar to gee and random intercept only
sum(as.numeric(VarCorr(mixed_mod)[,1]))   # model that incorporates all sources of variance
intsd^2 + timesd^2 + sigmasq              # 'truth'
```


GEE models are generally robust to misspecification of the covariance structure.  They are rarely implemented for more than simple clustering but some tools allow for it[^kerby].


Where GEE differ from random effects models in interpretation regards non-Gaussian models.  Consider a gender effect for a binary outcome where individuals are nested within families.  In this case we could use a standard logistic regression, but would want to take the clustering into account.  The gender effect for GEE is comparing hypothetical males in many different families to hypothetical females in many different families (controlling for other covariates).  In the mixed model with a random effect for family, you are comparing males in one family to females in the same family (again controlling for other factors).  The latter odds ratio will typically be larger because there is less fluctuation due to family-to-family differences[^kerby2].  However, this isn't a *larger* effect, just a different one.

## Pros

- Easy modeling of different correlation structures
- Focus on population level effects
- Extends the cluster robust approach to GLM setting and other correlation structures[^clusrob_recall]
- Robust to misspecification of correlation structure
- May be more feasible with larger data situations than mixed models

## Cons

- Cluster-specific effects are not estimated
- Not easily extendable to other clustering situations, e.g. nested
- Missing data is assumed Missing Completely at Random (MCAR) (RE and FE assume MAR)

Gist: If your goal is to focus on population average effects and ignore subject specific effects, without the drawbacks of the FE model, the GEE approach might be considered.


[^notamodel]: Not a model!  We're just doing a GLM here.

[^lme4]: <span class="pack">lme4</span> is a very widely used mixed model package that does not allow the specification of a correlation structure for the residuals.

[^geevsre]: The corresponding matrix formulation for the *conditional* model of the random effects approach is:
$$ y \sim \mathcal{N}(X\beta + Z\Gamma,  \Sigma) $$
where $Z$ is some subset of $X$, and $\Gamma$ contains the random effects.

[^kerby]: CSCAR director and University of Michigan statistics faculty [Kerby Shedden](https://github.com/kshedden) has worked on a Python implementation for nested structures as part of the [statsmodels module](http://statsmodels.sourceforge.net/devel/gee.html).

[^clusrob_recall]: Recall the note at the end of the [cluster robust SE chapter][Cluster robust variances].

[^kerby2]: This is a paraphrase of a response by Kerby Shedden to a question on the CSCAR help list that I thought was succinct and easy to understand.  He continues: "I don't have a strong opinion about this.  It is almost always good to control for observed variables to reduce heterogeneity and get a sharper analysis.  But it's less clear to me that you should control for things that are unattributable (e.g. family effects of unknown cause), even if it is technically possible to do so.  It comes down to this: do you want to measure my risk (for being male) by comparing me to my sister (or an imaginary sister who is identical to me in all observables except gender)...or do you want to compare me to an unrelated female, also identical to me in terms of all observables?"