lec04.Rmd

---
title: "Lecture 4"
author: "DJM"
date: "16 October 2018"
output:
  pdf_document: default
  slidy_presentation:
    css: http://mypage.iu.edu/~dajmcdon/teaching/djmRslidy.css
    font_adjustment: 0
bibliography: booth-refs.bib
---

\newcommand{\cdist}{\rightsquigarrow}
\newcommand{\cprob}{\xrightarrow{P}}
\newcommand{\E}{\mathbb{E}}
\newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]}
\newcommand{\Var}[1]{\mathbb{V}\left[ #1 \right]}
\newcommand{\Cov}[2]{\mathrm{Cov}\left[#1,\ #2\right]}
\newcommand{\given}{\ \vert\ }
\renewcommand{\P}{\mathbb{P}}
\newcommand{\R}{\mathbb{R}}
\newcommand{\argmin}{\arg\min}
\newcommand{\argmax}{\arg\max}
\newcommand{\F}{\mathcal{F}}
\newcommand{\norm}[1]{\left\lVert #1 \right\rVert}
\newcommand{\indicator}{\mathbf{1}}
\renewcommand{\bar}{\overline}
\renewcommand{\hat}{\widehat}
\newcommand{\tr}[1]{\mbox{tr}(#1)}
\newcommand{\brt}{\widehat{\beta}_{r,t}}
\newcommand{\brl}{\widehat{\beta}_{r,\lambda}}
\newcommand{\bls}{\widehat{\beta}_{ls}}
\newcommand{\blt}{\widehat{\beta}_{l,t}}
\newcommand{\bll}{\widehat{\beta}_{l,\lambda}}
\newcommand{\X}{\mathbb{X}}


```{r setup, include=FALSE}
knitr::opts_chunk$set(message=FALSE, warning=FALSE,
               fig.align='center',fig.width=10,
               fig.height=6, cache=TRUE, autodep = TRUE)
library(tidyverse)
theme_set(theme_minimal(base_family="Times"))
green = '#00AF64'
blue = '#0B61A4'
red = '#FF4900'
orange = '#FF9200'
colvec = c(green,blue,red,orange)
```


## An Overview of Classification

Some examples:

-   A person arrives at an emergency room with a set of symptoms that
    could be 1 of 3 possible conditions. Which one is it?

-   A online banking service must be able to determine whether each
    transaction is fraudulent or not, using a customer's location, past
    transaction history, etc.

-   Given a set of individuals sequenced DNA, can we determine whether
    various mutations are associated with different phenotypes?


All of these problems are ~~not~~ regression
problems. They are ~~classification~~ problems.

## The Set-up

It begins just like regression: suppose we have observations
$$\mathcal{D}= \{(X_1,Y_1),\ldots,(X_n,Y_n)\}$$

Again, we want to estimate a function that maps $X$ into $Y$ that helps
us predict as yet observed data.

(This function is known as a __classifier__)


The same constraints apply:

-   We want a classifier that predicts test data, not just the training
    data.
-   Often, this comes with the introduction of some bias to get lower
    variance and better predictions.

## How do we measure quality?

In regression, we have $Y_i \in \mathbb{R}$ and use squared error loss


Instead, let $Y \in \mathcal{K} = \{1,\ldots, K\}$

(This is arbitrary, sometimes other numbers, such as $\{-1,1\}$ will be
used)


We again make predictions $\hat{Y}$ based on $\mathcal{D}$


Our loss function is now a $K\times K$ matrix $L$ with

-   zeros on the diagonals
-   $\ell(k,k')$ on the off diagonal ($k\neq k'$)


## How do we measure quality?

Again, we appeal to risk
$$R(g) = \mathbb{E}_{(X,Y)} \ell(Y,g(X))$$ If we use the law of
total probability, this can be written
$$R(g) = \mathbb{E}_X \sum_{y=1}^K \ell(y,\; g(X)) \mathbb{P}(Y = y | X)$$
This can be minimized point wise over $X$, to produce
$$g_*(X) = \argmin_{f \in \mathcal{G}} \sum_{y=1}^K \ell(y,g(X)) \mathbb{P}(Y = y | X)$$
(This is the ~~Bayes' classifier~~. Also,
$R(g_*)$ is the ~~Bayes' limit~~)

## Best classifier

If we make specific choices for $\ell$, we can find $g_*$ exactly

Define (for convenience)
$$\ell_g(Z) = \ell(Y,g(X))$$ 

As $Y$ takes only a few values, ~~zero-one~~
prediction risk is natural
$$\ell_g(Z) = \mathbf{1}_{Y\neq g(X)}(Z) \Rightarrow R(g) = \mathbb{E}[\ell_g(Z)] = \mathbb{P}(g(X) \neq Y),$$

(This means we want to __label__ or
__classify__ a new observation $(X,Y)$ such that
$g(X) = Y$ as often as possible)

\vspace{.15in}
Under this loss, we have
$$g_*(X) = \argmin_{g \in \mathcal{G}} \left[ 1 - \mathbb{P}(Y = g | X)\right]  = \argmax_{g \in \mathcal{G}} \mathbb{P}(Y = g | X )$$

## Best classifier

Suppose we encode a two-class response as $Y \in \{0,1\}$


Let's continue to use [squared error loss]{style="color: orangemain"}:
$\ell_f(Z) = (Y - f(X))^2$


Then, the Bayes' rule is
$$f_*(X) = \mathbb{E}[ Y | X] = \mathbb{P}(Y = 1 | X)$$ 
(note that $f_* \in [0,1]$)

Hence, we achieve the same Bayes' rule/limit with squared error
classification by discretizing the probability:

$$g_*(X) = \mathbf{1}(f_*(X) > 1/2)$$

## Classification is easier than regression

Let $\hat{f}$ be any estimate of $f_*$


Let $\hat{g}(X) = \mathbf{1}(\hat{f}(X) > 1/2)$


It can be shown that $$\begin{aligned}
  &\mathbb{P}(Y \neq \hat{g}(X) | X) - \mathbb{P}(Y \neq g_*(X) | X)  \\
  &= \cdots =\\
  & = 
(2f_*(X) - 1)(\mathbf{1}(g_*(X) = 1) - \mathbf{1}(\hat{g}(X) = 1)) \\
& = |2f_*(X) - 1|\mathbf{1}(g_*(X)\neq \hat{g}(X))  \\
& =  2\left|f_*(X) - \frac{1}{2}\right|\mathbf{1}(g_*(X)\neq \hat{g}(X)) \end{aligned}$$


~~Can you show this?~~

## Classification is easier than regression

Now
$$g_*(X)\neq \hat{g}(X) \Rightarrow |\hat{f}(X) - f_*(X)| \geq |\hat{f}(X) - 1/2|$$
Therefore $$\begin{aligned}
 &\mathbb{P}(Y \neq \hat{g}(X)) - \mathbb{P}(Y \neq g_*(X)) =\\
& =  \int(\mathbb{P}(Y \neq \hat{g}(X) | X) - \mathbb{P}(Y \neq g_*(X) | X))d\mathbb{P}_X   \\
& =  \int 2\left|\hat{f}(X) - \frac{1}{2}\right|\mathbf{1}(g_*(X)\neq \hat{g}(X))d\mathbb{P}_X  \\
& \leq  2\int |\hat{f}(X) - f_*(X)| \mathbf{1}(g_*(X)\neq \hat{g}(X))d\mathbb{P}_X \\
& \leq  2\int |\hat{f}(X) - f_*(X)|d\mathbb{P}_X \end{aligned}$$

## Bayes' rule and class densities

Using Bayes' theorem $$\begin{aligned}
f_*(X) & = \mathbb{P}(Y = 1 | X) \\
& =
\frac{p(X|Y = 1) \mathbb{P}(Y = 1)}{\sum_{g \in \{0,1\}} p(X|Y = g) \mathbb{P}(Y = g)} \\
& =
\frac{f_1(X) \pi}{ f_1(X)\pi + f_0(X)(1-\pi)}\end{aligned}$$ We call
$f_g(X)$ the [class densities]{style="color: greenmain"}


The Bayes' rule can be rewritten $$g_*(X) = 
\begin{cases}
1 & \textrm{ if } \frac{f_1(X)}{f_0(X)} > \frac{1-\pi}{\pi} \\
0  &  \textrm{ otherwise}
\end{cases}$$

## How to find a classifier

All of these prior expressions for $g_*$ give rise to classifiers

-   __Empirical risk minimization:__ Choose a set
    of classifiers $\Gamma$ and find $\hat{g} \in \Gamma$ that minimizes
    some estimate of $R(g)$
    
    (This can be quite challenging as, unlike in regression, the
    training error is nonconvex)

-   __Regression:__ Find an
    estimate $\hat{f}$ and plug it in to the Bayes' rule

-   __Density estimation:__
    Estimate $\hat{\pi}$ and $f_g$ from $\mathcal{D}$ where $Y =g$ and

# Linear classifiers

## Linear classifier

As our classifier $\hat{g}$ takes a discrete number of values, it is
equivalent to partitioning the covariate space into
__regions__


The boundaries between these regions are known as __decision
boundaries__

These decision boundaries are sets of points at which $\hat{g}$ is
indifferent between two (or more) classes

A __linear classifier__ is a $\hat{g}$ that
produces linear decision boundaries (don't confuse this with "linear smoothers", regression functions which give $\hat{Y}=HY$ for some $H$)

## Linear classifier: Example

Suppose $\mathcal{G} = \{ 0,1\}$ and we form the GLM logistic regression

The posterior probabilities are $$\begin{aligned}
\mathbb{P}(Y = 1 | X)  & = \frac{\exp\{\beta_0 + \beta^{\top}X\}}{1 + \exp\{\beta_0 + \beta^{\top}X\}} \\
\mathbb{P}(Y = 0 | X) & = \frac{1}{1 + \exp\{\beta_0 + \beta^{\top}X\}}\end{aligned}$$

The _logit_ (i.e.: log odds) transformation
forms a linear decision boundary
$$\log\left( \frac{\mathbb{P}(Y = 1 | X)}{\mathbb{P}(Y = 0 | X) } \right) = \beta_0 + \beta^{\top} X$$
The decision boundary is the hyperplane
$\{X : \beta_0 + \beta^{\top} X = 0\}$

(Log-odds below 0, classify as 0, above 0 classify as a 1)

## Logit example

```{r, echo=FALSE}
set.seed(2018-03-30)
logit <- function(z) log(z)-log(1-z)
ilogit <- function(z) exp(z)/(1+exp(z))

sim.logistic <- function(X, beta0, beta) {
  linear.parts = beta0 + X%*%beta 
  y = as.factor(rbinom(nrow(X), size=1, prob=ilogit(linear.parts)))
  data.frame(y,X)
}
X <- matrix(runif(100*2, min=-1,max=1),ncol=2)
df = sim.logistic(X, -2.5, c(-5,5))
```

```{r}
g <- ggplot(df, aes(X1,X2,color=y)) + geom_point() +
  scale_color_manual(values=c(blue,red))
g
log.lm = glm(y~.,data=df, family='binomial')
summary(log.lm)
```

## What is the line? 

* Suppose we decide "Predict `1` if `predict(log.lm) > 0.5`".

* This means "For which combinations of `x1` and `x2` is
\[
\frac{\exp\left(\widehat{\beta}_0 + \widehat{\beta}_1 x_1 + \widehat{\beta}_2 x_2\right)}
{1+\exp\left(\widehat{\beta}_0 + \widehat{\beta}_1 x_1 + \widehat{\beta}_2 x_2\right)} > 0.5 ?
\]

* Solving this gives
\[
\widehat{\beta}_0 + \widehat{\beta}_1 x_1 + \widehat{\beta}_2 x_2 > \log(.5)-\log(1-.5) 
\Rightarrow x_2 > -\frac{\widehat{\beta}_0 + \widehat{\beta}_1 x_1}{\widehat{\beta}_2}.
\]

* That's just a line. Let's plot it:
```{r}
cc = coefficients(log.lm)
g + geom_abline(intercept = -cc[1]/cc[3], slope = -cc[2]/cc[3], color=green)
```

## Lots of different boundaries

```{r, echo=FALSE}
decision.boundary <- function(ddd){
  cc = coefficients(glm(y~X1+X2,data=ddd,family='binomial'))
  return(data.frame(intercept=-cc[1]/cc[3],slope=-cc[2]/cc[3]))
}
newdf = list()
for(i in 1:4) newdf[[i]] = sim.logistic(X, rnorm(1), rnorm(2,sd=3))
names(newdf) = letters[1:4]
newdf = data.table::rbindlist(newdf, idcol="index") %>% group_by(index)
dbs = newdf %>% do(decision.boundary(.))
ggplot(newdf, aes(X1,X2,color=y)) + geom_point() +
  scale_color_manual(values=c(blue,red)) + facet_wrap(~index) +
  geom_abline(mapping=aes(intercept=intercept, slope=slope),data=dbs,color=green)
dbs
```

# Using Bayes Rule

## Bayes' rule-ian approach

The decision theory for classification indicates we need to know the
posterior probabilities: $\mathbb{P}(Y = g | X)$ for doing optimal
classification

\vspace{.15in}
Suppose that

-   $p_g(X) = \mathbb{P}(X | Y = g)$ is the
    [likelihood]{style="color: orangemain"} of the covariates given the
    class labels

-   $\pi_g = \mathbb{P}(Y=g)$ is the prior

Then

$$\mathbb{P}(Y = g | X) = \frac{p_g(X) \pi_g}{\sum_{g \in \mathcal{G}}p_g(X) \pi_g}  \propto p_g(X) \pi_g$$

[[Conclusion:]{style="color: greenmain"}]{.smallcaps} Having the class
densities almost gives us the Bayes' rule as the training proportions
can usually be used to estimate $\pi_g$

(Though, sometimes estimating $\pi_g$ can be nontrivial/impossible)

## Bayes' rule-ian approach: Summary

There are many techniques based on this idea

-   Linear/quadratic discriminant analysis

    (Estimates $p_g$ assuming multivariate Gaussianity)

-   General nonparametric density estimators

-   Naive Bayes (Factors $p_g$ assuming conditional independence)

## Discriminant analysis

Suppose that
$$p_g(X) \propto |\Sigma_g|^{-1/2} \exp\left\{-(X - \mu_g)^{\top}\Sigma_g^{-1}(X - \mu_g)/2\right\}$$

Let's assume that [$\Sigma_g \equiv \Sigma$]{style="color: orangemain"}.

Then the log-odds between two classes $g,g'$ is: $$\begin{aligned}
\log\left( \frac{\mathbb{P}(Y = g | X)}{\mathbb{P}(Y = g' | X) } \right)
&  = 
\log\frac{p_g(X)}{p_{g'}(X)} + \log \frac{\pi_g}{\pi_{g'}}\\
& = 
\log \frac{\pi_g}{\pi_{g'}} - (\mu_{g} + \mu_{g'})^{\top} \Sigma^{-1} (\mu_g - \mu_{g'})/2  \\
& \qquad+ X^{\top} \Sigma^{-1}(\mu_g - \mu_{g'})\end{aligned}$$

This is linear in $X$, and hence has a linear decision boundary

## Types of discriminant analysis

The __linear discriminant function__ is
(proportional to) the log posterior:
$$\delta_g(X) = \log \pi_g + X^{\top} \Sigma^{-1}\mu_g  - \mu_{g}^{\top} \Sigma^{-1} \mu_g /2$$
and we assign $g(X) = \argmin_g \delta_g(X)$

(This is just minimum Euclidean distance, weighted by the covariance
matrix and prior probabilities)

## LDA parameter estimation

Now, we must estimate $\mu_g$ and $\Sigma$. If we\...

-   use the intuitive estimators $\hat{\mu}_g = \overline{X}_g$ and
    $$\hat\Sigma = \frac{1}{n-G} \sum_{g \in \mathcal{G}} \sum_{i \in g} (X_i - \hat{\mu}_g) (X_i - \hat{\mu}_g)^{\top}$$
    then we have produced ~~linear discriminant
    analysis~~ (LDA)


## LDA intuition

How would you classify a point with this data?


```{r, echo=FALSE}
Sigma = matrix(c(2,.7,.7,1),2)
xy = matrix(rnorm(300*2,sd=.2),ncol=2) %*% Sigma + rep(c(1,1,-3,2,-5,0),each=100)
normals = tibble(g = rep(letters[1:3],each=100), x=xy[,1], y=xy[,2])
ggplot(normals, aes(x=x,y=y,color=g)) + geom_point()
```


We can just classify an observation to the
__closest__ mean $(\overline{X}_g)$

What do we mean by close? (Need to define distance)

## LDA intuition

Intuitively, assigning observations to the nearest $\overline{X}_g$ (but
ignoring the covariance) would amount to 
\[
\begin{aligned}
\tilde{g}(X) 
& =  
\argmin_g \norm{X - \overline{X}_g}_2^2  \\
& = 
 \argmin_g X^{\top}X - 2X^{\top}\overline{X}_g + \overline{X}_g^{\top}\overline{X}_g \\
& = 
 \argmin_g {-X^{\top}\overline{X}_g} +  {\frac{1}{2}\overline{X}_g^{\top}\overline{X}_g} \\
 & \textrm{{compare this to:}} \\
\hat{g} 
& = \argmin_g \underbrace{ {X^{\top}\hat\Sigma^{-1}\overline{X}_g} - {\frac{1}{2}\overline{X}_g^{\top}\hat{\Sigma}^{-1} \overline{X}_g} }_{likelihood}+ \underbrace{\log(\hat\pi_g)}_{prior} \end{aligned}
\]


The difference is we weight the distance by $\hat\Sigma^{-1}$
and weight the class assignment by fraction of observations in each
class.

(Note: this generalization of Euclidean distance is called
__Mahalanobis__ distance)

## Intuition

```{r, echo=FALSE}
library(mvtnorm)
n = 100
pi1 = 0.5
n1 = floor(n*pi1); n0 = n-n1
mu1 = c(1,2); mu0 = c(-1,-1)
Sigma = 2*diag(2)
X1 = rmvnorm(n1, mu1, Sigma) 
X2 = rmvnorm(n0, mu0, Sigma)
X = rbind(X1,X2)
Y = factor(c(rep(1,n1),rep(0,n0)))
df = data.frame(Y,X)
```
```{r,echo=FALSE}
g <- ggplot(df, aes(X1,X2,color=Y)) + geom_point() + scale_color_manual(values=c(blue,red))
Sinv = solve(Sigma)
slope.vec = t(mu1-mu0) %*% Sinv
intercept = 0.5*(t(mu0) %*% Sinv %*% mu0 - t(mu1) %*% Sinv %*% mu1)
g + stat_ellipse(type='norm') + # these are estimated, not the truth
  geom_abline(intercept = -intercept/slope.vec[2], 
              slope = -slope.vec[1]/slope.vec[2], color=green)
```

* Note: here there is a single $\Sigma$, but I don't know how to plot ellipses in `ggplot`. So these are estimated.


## Intuition 

```{r,echo=FALSE}
mu1 = c(1,2); mu0 = c(1,-1)
Sigma = 2*matrix(c(1,-.5,-.5,1),2)
X1 = rmvnorm(n1, mu1, Sigma) 
X2 = rmvnorm(n0, mu0, Sigma)
X = rbind(X1,X2)
Y = factor(c(rep(1,n1),rep(0,n0)))
df = data.frame(Y,X)
Sinv = solve(Sigma)
slope.vec = t(mu1-mu0) %*% Sinv
intercept = 0.5*(t(mu0) %*% Sinv %*% mu0 - t(mu1) %*% Sinv %*% mu1)
ggplot(df, aes(X1,X2,color=Y)) + geom_point() + scale_color_manual(values=c(blue,red)) +
  stat_ellipse(type='norm') + 
  geom_abline(intercept = -intercept/slope.vec[2], 
              slope = -slope.vec[1]/slope.vec[2], color=green)
```

* Different $\mu_g$, $\Sigma$.

## Same one, but make n big

```{r,echo=FALSE}
n1=500; n0=500
X1 = rmvnorm(n1, mu1, Sigma) 
X2 = rmvnorm(n0, mu0, Sigma)
X = rbind(X1,X2)
Y = factor(c(rep(1,n1),rep(0,n0)))
df = data.frame(Y,X)
Sinv = solve(Sigma)
slope.vec = t(mu1-mu0) %*% Sinv
intercept = 0.5*(t(mu0) %*% Sinv %*% mu0 - t(mu1) %*% Sinv %*% mu1)
ggplot(df, aes(X1,X2,color=Y)) + geom_point() + scale_color_manual(values=c(blue,red)) +
  stat_ellipse(type='norm') + 
  geom_abline(intercept = -intercept/slope.vec[2], 
              slope = -slope.vec[1]/slope.vec[2], color=green)
```

## Same one, but change P(Y=1)

```{r,echo=FALSE}
n1=250; n0=50
X1 = rmvnorm(n1, mu1, Sigma) 
X2 = rmvnorm(n0, mu0, Sigma)
X = rbind(X1,X2)
y = factor(c(rep(1,n1),rep(0,n0)))
df = data.frame(y,X)
Sinv = solve(Sigma)
slope.vec = t(mu1-mu0) %*% Sinv
intercept = 0.5*(t(mu0) %*% Sinv %*% mu0 - t(mu1) %*% Sinv %*% mu1) + log(.75) - log(.25)
ggplot(df, aes(X1,X2,color=y)) + geom_point() + scale_color_manual(values=c(blue,red)) +
  stat_ellipse(type='norm') + 
  geom_abline(intercept = -intercept/slope.vec[2], 
              slope = -slope.vec[1]/slope.vec[2], color=green)
```

## Performance of LDA

The quality of the classifier produced by LDA depends on two things:

-   The sample size $n$

    (This determines how accurate the $\hat \pi_g$, $\hat \mu_g$, and
    $\hat\Sigma$ are)

-   How wrong the LDA assumptions are

    (That is: $X| Y= g$ is a Gaussian with mean $\mu_g$ and variance
    $\Sigma$)

__Recall:__ The _decision
boundary_ of a classifier are the values of
$X$ such that the classifier is [indifferent]{style="color: orangemain"}
between two (or more) levels of $Y$

A _linear_ decision boundary is when this set
of values looks like a line

## Comparing LDA and Logistic regression

* Both are linear in $x$:  

    - LDA$\longrightarrow \alpha_0 + \alpha_1^\top x$ 
    - Logit$\longrightarrow \beta_0 + \beta_1^\top x$.

* But the parameters are estimated differently.

* Examine the joint distribution of $(X,y)$:  

    - LDA:  $\prod_i f(x_i,y_i) = \underbrace{\prod_i f(X_i | y_i)}_{\textrm{Gaussian}}\underbrace{\prod_i f(y_i)}_{\textrm{Bernoulli}}$
    - Logistic: $\prod_i f(x_i,y_i) = \underbrace{\prod_i f(y_i | X_i)}_{\textrm{Logistic}}\underbrace{\prod_i f(X_i)}_{\textrm{Ignored}}$
  
* LDA estimates the joint, but Logistic estimates only the conditional distribution. But this is really all we need.

* So logistic requires fewer assumptions.

* But if the two classes are perfectly separable, logistic crashes (and the MLE is undefined)

* LDA works even if the conditional isn't normal, but works poorly if any X is qualitative


## Comparison

```{r, echo=FALSE}
lda.disc <- function(fit,df){
  pi0 = fit$prior[1]
  pi1 = fit$prior[2]
  mu0 = fit$means[1,]
  mu1 = fit$means[2,]
  S = pi0*cov(filter(df,y==0)[,-1]) + pi1*cov(filter(df,y==1)[,-1])
  Sinv = solve(S)
  slope.vec = t(mu1-mu0) %*% Sinv
  intercept = 0.5*(t(mu0) %*% Sinv %*% mu0 - t(mu1) %*% Sinv %*% mu1) + log(.75) - log(.25)
  int = -intercept/slope.vec[2]
  sl = -slope.vec[1]/slope.vec[2]
  return(data.frame(intercept=int,slope=sl))
}
```


```{r, echo=FALSE}
library(MASS)
lda.fit = lda(y~X1+X2, data=df) 
sl.int = lda.disc(lda.fit,df)
log.bd = decision.boundary(df)
truth = data.frame(intercept=-intercept/slope.vec[2], slope=-slope.vec[1]/slope.vec[2])
dfa = rbind(sl.int,log.bd,truth)
dfa$discriminant = c('lda','logistic','truth')
ggplot(df, aes(X1,X2,color=y)) + geom_point() + scale_color_brewer(palette = 'Set1')+
  stat_ellipse(type='norm') + 
  geom_abline(mapping=aes(intercept=intercept, slope=slope,color=discriminant),data=dfa)
```

# LDA family

## The LDA variance assumption

Returning to the assumption: $\Sigma_g = \Sigma$

The assumption provides two benefits:
-   Allows for estimation when $n$ __isn't__
    large compared with $Gp(p+1)/2$
-   Lowers the variance of the procedure (but produces bias)

(This can be seen by the estimation of fewer parameters)

## Regularizing LDA

-   Penalize the 'plug-in' estimates [@Friedman1989]:  
    Let $\lambda \in [0,1]$,
    $$\hat{\Sigma}_{\lambda} = \lambda \hat{\Sigma} + (1-\lambda) \hat\sigma^2 I$$
    
- Nearest Shrunken Centroids:  
  Take $\hat\Sigma=\hat\sigma^2 I$.
  
- Regularized Optimal Affine Discriminant (ROAD) [@FanFeng2012]:  
  Solve $$\argmin_v v'Sv + \lambda\norm{v}_1,\quad\quad \textrm{s.t. } v'(\hat{\mu}_1-\hat{\mu}_2)=1.$$ The intuition is that the discriminant
  $$\hat\Sigma^{-1}(\hat{\mu}_1-\hat{\mu}_2)$$ now depends only on a few features. This one works really well in high-dimensions.
  
# QDA


## QDA

* Start like LDA, but let $\Sigma_1 \neq \Sigma_0$.

* This gives a "quadratic" decision boundary (it's a curve).

* If we have many columns in $X$ ($p$)

    - Logistic estimates $p+1$ parameters
    - LDA estimates $2p + p(p+1)/2 + 1$
    - QDA estimates $2p + p(p+1) + 1$
  
* If $p=50$,

    - Logistic: 51
    - LDA: 1376 (estimating the discriminant is only 51)
    - QDA: 2651
  
* QDA doesn't get used much: there are better nonlinear versions with way "fewer" parameters (SVMs)

## LDA/QDA in `R`

```{r qda-pred, echo=TRUE}
# Generate data
n1=50; n0=50
Sigma1 = matrix(c(2,.8,.8,1),2)
Sigma0 = matrix(c(1,-.5,-.5,2),2)
X1 = rmvnorm(n1, mu1, Sigma1) 
X2 = rmvnorm(n0, mu0, Sigma0)
X = rbind(X1,X2)
y = factor(c(rep(1,n1),rep(0,n0)))
df = data.frame(y,X)
library(MASS)
qda.fit = qda(y~X1+X2, data=df)
lda.fit = lda(y~X1+X2, data=df)
```


## LDA/QDA decision boundaries

```{r, echo=FALSE}
pred.grid = expand.grid(X1=seq(min(df$X1),max(df$X1),len=100),
                        X2=seq(min(df$X2),max(df$X2),len=100))
pred.grid$qda = predict(qda.fit, newdata=pred.grid)$class
pred.grid$lda = predict(lda.fit, newdata=pred.grid)$class
pg = gather(pred.grid,key='key',value='predictions',-c(X1,X2))
ggplot(pg, aes(X1,X2)) + geom_raster(aes(fill=predictions)) +
  facet_wrap(~key) + scale_fill_brewer()+
  geom_point(data=df,mapping=aes(X1,X2,color=y)) +
  scale_color_brewer(palette = 'Set1')
```


## Reduced rank LDA

Part of the popularity of LDA is that it provides dimension
reduction as well


The $G$ class centroids $\mu_g$ must all lie in an affine subspace of
dimension $G-1$ (presuming $G < p$)

(Let $\mathcal{H}_{G-1}$ be this subspace)

If $G$ is much less than $p$, this will be a substantial drop in
dimension

## Reduced rank LDA

In practice, we can compute LDA from spectral information:
$$\begin{aligned}
\delta_g(X) 
& = 
\log \pi_g + X^{\top} \Sigma^{-1}\mu_g  - \mu_{g}^{\top} \Sigma^{-1} \mu_g /2 \\
&\propto
\log \pi_g + (X - \mu_g)^{\top} \Sigma^{-1}(X - \mu_g)/2 \end{aligned}$$
So,

1.  ~~Spectrum:~~ Form
    $\hat{\Sigma}_{\lambda} = U D U^{\top}$

2.  ~~Sphere:~~ Rewrite your data
    as $\tilde{X} \leftarrow D^{-1/2} U^{\top} X$

3.  ~~Assign:~~ Classify to the
    closest mean in transformed space

    (Penalizing by estimate of prior probability)

## Reduced rank LDA

We can ignore any information orthogonal to $\mathcal{H}_{G-1}$, as it
contributes to each class equally (in the sphered space)

So, project $\tilde{X}$ onto $\mathcal{H}_{G-1}$ and make distance
computations there

When $G = 2,3$, this means we can plot the projection onto
$\mathcal{H}_{G-1}$ with no loss of information about the LDA solution


If $G > 3$, then we may wish to project onto a
further reduced space
$\mathcal{H}_{L} \subset \mathcal{H}_{G-1}$

We'd like $\mathcal{H}_L$ to maintain the most amount of information
possible for assigning to classes

## Reduced rank LDA

This can be done via the following procedure 

1.  ~~Centroids:~~ Compute
    $G \times p$ matrix $M$ of class centroids

2.  ~~Covariance:~~ Form
    $\hat\Sigma$ as the common covariance matrix

3.  ~~Sphere:~~
    $\tilde{M} = M \hat\Sigma^{-1/2}$

4.  ~~Between Covariance:~~ Find
    covariance matrix for $\tilde{M}$, call it $B$

5.  ~~Spectrum:~~ Compute
    $B = V S V^{\top}$


Now, span$(V_L) = \mathcal{H}_L$


Also, the coordinates of the data in this space are
$Z_k = v_k^{\top} \hat\Sigma^{-1/2}X$


These derived variables are commonly called "canonical
coordinates"

## Reduced rank LDA: Summary

-   Gaussian likelihoods with identical covariances leads to linear
    decision boundaries (LDA)

-   We can actually do all relevant computations/graphics on the reduced
    space $\mathcal{H}_{G-1}$

-   If this isn't small enough, we can do 'optimal' dimension reduction
    to $\mathcal{H}_L$


As an aside, this procedure is identical to __Fisher's discriminant
analysis__ (when $L=1$)

# Logit family

## Logistic regression

Logistic regression for two classes simplifies to a likelihood:

(Using $\pi_i(\beta) = \mathbb{P}(Y = 1 | X = X_i,\beta)$)
$$\begin{aligned}
\ell(\beta) 
& = 
\sum_{i=1}^n \left( Y_i\log(\pi_i(\beta)) + (1-Y_i)\log(1-\pi_i(\beta))\right) \\
& = 
\sum_{i=1}^n \left( Y_i\log(e^{\beta^{\top}X_i}/(1+e^{\beta^{\top}X_i})) - (1-Y_i)\log(1+e^{\beta^{\top}X_i})\right) \\
& = 
\sum_{i=1}^n \left( Y_i\beta^{\top}X_i -\log(1 + e^{\beta^{\top} X_i})\right)\end{aligned}$$

This gets optimized via Newton-Raphson updates and iteratively reweighed
least squares

## Sparse logistic regression

This procedure suffers from all the same problems as least squares

\vspace{.15in}
We can use penalized likelihood techniques in the same way as we did
before

\vspace{.15in}
This means minimizing (over $\beta_0,\beta$):
$$\sum_{i=1}^n \left( - Y_i(\beta_0 + \beta^{\top}X_i) +\log\left(1 + \exp\left\{\beta_0 + \beta^{\top} X_i\right\}\right)\right)  
+ \lambda (\alpha\norm{\beta}_1+ (1-\alpha) \norm{\beta}_2^2)$$ (Don't
penalize the intercept and do standardize the covariates)

\vspace{.15in}
This is the __logistic elastic net__

## Sparse logistic regression: Software

`glmnet` works essentially as before:
```{r, eval=FALSE}
cv.glmnet(x,y, family='binomial')
```


Unfortunately, the computations are more difficult for path algorithms
(such as the `lars` package) due to the coefficient profiles being only piecewise smooth (rather than piecewise linear)


`glmpath` does quadratic approximations to the profiles, while still
computing the exact points at which the active set changes

If $G>2$, use multinomial logit
```{r, eval=FALSE}
cv.glmnet(x,y, family='multinomial')
```
or
```{r, eval=FALSE}
library(nnet)
multinom(y~., data=df)
```


## Final words: Logistic versus LDA


-   Forming __logistic__ requires fewer
    assumptions

-   The MLEs under __logistic__ will be
    undefined if the classes are perfectly separable

-   If some entries in $X$ are qualitative, then the modeling
    assumptions behind __LDA__ are suspect

-   In practice, the two methods tend to give very similar results

## Selected references