lec02.Rmd

---
title: "Lecture 2"
author: "DJM"
date: "2 October 2018"
output:
  pdf_document: default
  slidy_presentation:
    css: http://mypage.iu.edu/~dajmcdon/teaching/djmRslidy.css
    font_adjustment: 0
  html_document:
    df_print: paged
bibliography: booth-refs.bib
---

\newcommand{\cdist}{\rightsquigarrow}
\newcommand{\cprob}{\xrightarrow{P}}
\newcommand{\E}{\mathbb{E}}
\newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]}
\newcommand{\Var}[1]{\mathbb{V}\left[ #1 \right]}
\newcommand{\Cov}[2]{\mathrm{Cov}\left[#1,\ #2\right]}
\newcommand{\given}{\ \vert\ }
\renewcommand{\P}{\mathbb{P}}
\newcommand{\R}{\mathbb{R}}
\newcommand{\argmin}[1]{\underset{#1}{\textrm{argmin}}}
\newcommand{\F}{\mathcal{F}}
\newcommand{\norm}[1]{\left\lVert #1 \right\rVert}
\newcommand{\indicator}{\mathbbm{1}}
\renewcommand{\bar}{\overline}
\renewcommand{\hat}{\widehat}
\newcommand{\tr}[1]{\mbox{tr}(#1)}
\newcommand{\brt}{\widehat{\beta}_{r,t}}
\newcommand{\brl}{\widehat{\beta}_{r,\lambda}}
\newcommand{\bls}{\widehat{\beta}_{ls}}
\newcommand{\blt}{\widehat{\beta}_{l,t}}
\newcommand{\bll}{\widehat{\beta}_{l,\lambda}}
\newcommand{\X}{\mathbb{X}}


## Statistics vs. ML

```{r setup, include=FALSE}
knitr::opts_chunk$set(cache=TRUE, autodep=TRUE, message=FALSE, warning = FALSE,
                      fig.align='center')
library(tidyverse)
theme_set(theme_minimal(base_family="Times"))
green = '#00AF64'
blue = '#0B61A4'
red = '#FF4900'
orange = '#FF9200'
colvec = c(green,blue,red,orange)
```


* Lots of overlap, both try to "extract information from data"

Venn diagram


## Probability

1. $X_n$ converges _in probability_ to $X$, $X_n\cprob X$, if for every $\epsilon>0$,
   $\P\left(|X_n-X| < \epsilon\right) \rightarrow 1$. 
2. $X_n$ converges _in distribution_ to $X$, $X_n\cdist X$, if 
   $F_n(t)\rightarrow F(t)$ at all continuity points $t$. 
3. (Weak law) If $X_1, X_2,\ldots$ are iid random variables with common mean $m$,then
  $\bar{X}_n \cprob  m$.
4. (CLT) If $X_1, X_2,\ldots$ are iid random variables with common mean $m$ and variance $s^2<\infty$, then $\sqrt{n}(\bar{X}_n-m)/s \cdist \mbox{N}(0,1)$.


## Big-Oh and Little-Oh

Deterministic:

1.  $a_n = o(1)$ means $a_n\rightarrow 0$ as $n\rightarrow\infty$
2.  $a_n = o(b_n)$ means $\frac{a_n}{b_n} = o(1)$.  
    Examples:  
    -   If $a_n = \frac{1}{n}$, then $a_n = o(1)$
    -   If $b_n = \frac{1}{\sqrt{n}}$, then $a_n = o(b_n)$  
3.  $a_n = O(1)$ means $a_n$ is eventually bounded for all $n$ large
    enough, $|a_n|<c$ for some $c>0$. Note that $a_n =o(1)$ implies $a_n = O(1)$
4.  $a_n = O(b_n)$ means $\frac{a_n}{b_n} = O(1)$. Likewise,
    $a_n = o(b_n)$ implies $a_n = O(b_n)$.
    Examples:  
    -   If $a_n = \frac{n}{2}$, then $a_n = O(n)$

Stochastic analogues:

1.  $Y_n = o_p(1)$ if for all $\epsilon > 0$, then
    $P(|Y_n|>\epsilon)\rightarrow0$
2.  We say $Y_n = o_p(a_n)$ if $\frac{Y_n}{a_n} = o_p(1)$
3.  $Y_n = O_p(1)$ if for all $\epsilon > 0$, there exists a $c > 0$
    such that $P(|Y_n|>c)<\epsilon$
4.  We say $Y_n = O_p(a_n)$ if $\frac{Y_n}{a_n} = O_p(1)$  
    Examples:  
    -   $\overline{X}_n - \mu = o_p(1)$ and $S_n - \sigma^2 =
            o_p(1)$. By the the Law of Large Numbers.  
    -   $\sqrt{n}(\overline{X}_n-\mu) = O_p(1)$ and $\overline{X}_n-\mu =
            O_p(\frac{1}{\sqrt{n}})$. By the Central Limit Theorem.
            
## Statistical models

A statistical model $\mathcal{P}$ is a collection of probability
distributions or densities. A parametric model has the form
$$\mathcal{P}= \{p(x;\theta):\theta\in\Theta\}$$ where
$\Theta\subset\mathbb{R}^d$ in the parametric case.

Examples of nonparametric statistical models:

-   $\mathcal{P}= \{$ all continuous CDF's $\}$

-   $\mathcal{P}= \{f:\int(f''(x))^2dx<\infty\}$

## Evaluating estimators

An *estimator* is a function of data that
does not depend on $\theta$.

Suppose $X\sim N(\mu,1)$.  
    -$\mu$ is not an estimator.  
    -Things that are estimators: $X$, any functions of $X$, 3, $\sqrt{X}$, etc.

1.  Bias and Variance
2.  Mean Squared Error
3.  Minimaxity and Decision Theory
4.  Large Sample Evaluations

## MSE

__Mean Squared Error (MSE)__. Suppose $\theta, \widehat\theta$, define
$$\begin{aligned}
\mathbb{E}\!\left[ \left(\theta - \widehat\theta \right)^2 \right] = \int \cdots \int \left[ \left( \widehat\theta(x_1, \ldots, x_n) - \theta\right) f(x_1;\theta)^2 \cdots f(x_n;\theta) \right] dx_1 \cdots dx_n.\end{aligned}$$

__Bias and Variance__ The bias is $$\begin{aligned}
B = \mathbb{E}\!\left[\widehat\theta\right] - \theta,\end{aligned}$$ and
variance is $$\begin{aligned}
V = \Var{\widehat\theta}.\end{aligned}$$

Bias-Variance Decomposition $$\begin{aligned}
\mathit{MSE} = B^2 + V\end{aligned}$$

$$\begin{aligned}
\mathit{MSE} &= \mathbb{E}\!\left[ ( \widehat\theta - \theta )^2\right]\\
&= \mathbb{E}\!\left[ \left(\widehat\theta - \mathbb{E}\!\left[\widehat\theta\right] +
    \mathbb{E}\!\left[\widehat\theta\right] - \theta\right)^2 \right]\\ 
&= \mathbb{E}\!\left[ \widehat\theta - \mathbb{E}\!\left[\widehat\theta\right] \right] + \left(
  \mathbb{E}\!\left[\widehat\theta\right] - \theta \right)^2 +  
\underbrace{2\mathbb{E}\!\left[ \widehat\theta - \mathbb{E}\!\left[\widehat\theta\right] \right]}_{=0}
\left(\mathbb{E}\!\left[\widehat\theta\right] - \theta \right)\\ 
&= V + B^2\end{aligned}$$

An estimator is unbiased if $B=0$. Then $\mathit{MSE} = Variance$.

Let $x_1, \ldots, x_n \overset{iid}\sim N(\mu,\sigma^2)$.

$$\begin{aligned}
 \mathbb{E}\!\left[\overline{x}\right] &= \mu, & \mathbb{E}\!\left[s^2\right] &= \sigma^2\\
  \mathbb{E}\!\left[(\overline{x} - \mu)^2\right] &= \frac{\sigma^2}n
  =O\left(\frac1n\right) &
\mathbb{E}\!\left[(s^2 - \sigma^2)^2 \right] &= \frac{2\sigma^4}{n-1} =
O\left(\frac1n\right).\end{aligned}$$

## Minimaxity

Let $\mathcal{P}$ be a set of distributions. Let $\theta$ be a parameter and let $L(\theta, \theta')$ be a loss function. 

The __minimax risk__ is

\[
R_n(\mathcal{P}) = \inf_{\hat\theta} \sup_{P\in\mathcal{P}} \mathbb{E}_P[L(\theta,\hat\theta)]
\]

If $\sup_{P\in\mathcal{P}} \E_P [L(\theta, \hat\theta)] = R_n$ then $\hat\theta$ is a minimax estimator.

* $X_1, X_2, \dots, X_n \overset{iid}{\sim} N(\theta, 1)$ Then $\bar X$
  is minimax for many loss functions. It's risk is $R_n = \frac{1}{n}$
  which is the ``Parametric Rate''.
* $X_1, X_2,\dots, X_n \sim f$, where $f \in \F$ is some density. Let
  $\F$ be the class of smooth densities:  $\F = \left\{ f ; \int (f'')^2 < c_0\right\}$
  Then $R_n \leq C n^{-4/5}$ for 
    $L(\hat{f}, f) = \int(f-\hat{f})^2 dx.$
    

# Linear model, introduction

## The Setup

Suppose we have data
$$\mathcal{D}= \{ (X_1 , Y_1), (X_2 , Y_2), \ldots, (X_n , Y_n) \},$$
where

-   $X_i \in \mathbb{R}^p$ are the _features_

    (or explanatory variables or
    predictors or
    covariates. NOT INDEPENDENT VARIABLES!)

-   $Y_i \in \mathbb{R}$ are the response
    variables.

    (NOT DEPENDENT VARIABLE!)

Our goal for this class is to find a way to explain (at least
approximately) the relationship between $X$
and $Y$.

## Prediction risk for regression

Given the _training data_ $\mathcal{D}$, we
want to predict some independent _test data_
$Z = (X,Y)$

This means forming a $\hat f$, which is a function of both the range of
$X$ and the training data $\mathcal{D}$, which provides predictions
$\hat Y = \hat f(X)$.


The quality of this prediction is measured via the prediction risk
$$R(\hat{f}) = \Expect{(Y - \hat{f}(X))^2}.$$

We know that the _regression function_,
$f_*(X) = \mathbb{E}[Y | X]$, is the best possible predictor.


Note that $f_*$ is *unknown*.


## A linear model: Multiple regression


If we assume:
$f_*(X) = X^{\top}\beta =  \sum_{j=1}^p x_j \beta_j$
$$\Rightarrow Y_i = X_i^{\top}\beta + \epsilon_i$$. 

There's generally no reason to make this assumption.

We'll examine a few cases:

1. $f_*$ is linear, low dimensions.  
2. $f_*$ is ~~not~~ linear, but we use a linear model anyway  
3. $f_*$ is linear, high dimensions.  
4. $f_*$ isn't linear, high dimensions.

Important:
Calling $f_*$ "linear", means that $f_*(x) =  x'\beta$

## Kernelization

We'll come back to this more rigorously later in the course.

Suppose $x\in[0,2\pi]$ and $f_*(x) = \sin(x)$.

$f_*$ isn't linear in $x$. 

But 
\[
\sin(x) = \beta_1 x + \beta_2 x^2 +\beta_3 x^3 + \cdots = \sum_{j=1}^\infty \beta_j x^j
\]
by Taylor's theorem (of course this works for any function).

If I have some map $\Phi(x)\rightarrow [\phi_1(x), \ldots, \phi_K(x)]$, then I can estimate a linear model using the new features $\phi_1,\ldots,\phi_K$.

I can even take $K=\infty$.

This is still a "linear" model in the sense we're using today, though it isn't "linear" in the original $x$.

## Low-dimension, high-assumptions

Let $x_i \in \R^p$, $p<n$.

If $f_*$ is linear, and $\epsilon_i \sim N(0,\sigma^2)$ (independent)

Then all the good things happen:

1. $R(\hat f) = \sigma^2\left[1 + \frac{p}{n}\right]$
2. $\norm{\beta_*-\hat\beta}_2^2 = O_p(p/n)$
3. Coefficient estimates are normally distributed.
4. etc.

## Low-dimension, low-assumptions

Let $\beta_* = \argmin_\beta \Expect{(Y-X\beta)^2}$ be the best linear predictor for the feature $X$.

Note that this is well defined: $\beta_* = \E[XX']^{-1}\E[XY] =: \Sigma_{XX}^{-1}\sigma_{XY}$.

Call $R(\beta_*) = \Expect{(Y-X\beta_*)^2}$.

We call $R(\beta)-R(\beta_*)$ the _excess risk_ of using $\beta$ relative to the best linear predictor $\beta_*$.

Note that
\[
R(\beta)-R(\beta_*) = (\beta-\beta_*)'\Sigma (\beta-\beta_*).
\]

Then, (simplified result), See Theorem 11.3 of @GyorfiKohler,
\[
R(\hat\beta_{ls}) \leq C \left[R(\beta_*) + \frac{p \log n}{n}\right] 
\]

Note that if the model were linear, $R(\beta_*) = \sigma^2$

We also have a CLT [see @BerkBrown2014]:
\[
\begin{aligned}
\sqrt{n}(\hat\beta_{ls}-\beta_*) & \cdist N(0,\Gamma)\\
\Gamma &= \Sigma^{-1} \Expect{(Y-X\beta)^2 XX'} \Sigma^{-1}
\end{aligned}
\]


# Bias and variance

## Prediction risk for regression

Note that $R(\hat{f})$ can be written as
$$R(\hat{f}) = \int \textrm{bias}^2(x) d\mathbb{P}_X + \int \textrm{var}(x) d\mathbb{P}_X + \sigma^2$$
where 
$$\begin{aligned}
\textrm{bias}(x) & = \Expect{\hat{f}(x)} - f_*(x)\\
\textrm{var}(x) & = \Var{\hat{f}(x)} \\
\sigma^2 & = \Expect{(Y - f_*(X))^2}
\end{aligned}$$


This decomposition applies to much more general loss functions [@James2003]


```{r,fig.align='center',fig.height=6, fig.width=6, echo=FALSE, message=FALSE}
cols = c(blue, red, green, orange)

par(mfrow=c(2,2),bty='n',ann=FALSE,xaxt='n',yaxt='n',family='serif',mar=c(0,0,0,0),oma=c(0,2,2,0))
require(mvtnorm)
mv = matrix(c(0,0,0,0,-.5,-.5,-.5,-.5),4,byrow=T)
va = matrix(c(.01,.01,.5,.5,.05,.05,.5,.5),4,byrow=T)

for(i in 1:4){
  plot(0,0,ylim=c(-2,2),xlim=c(-2,2),pch=19,cex=50,col=blue,ann=FALSE,pty='s')
  points(0,0,pch=19,cex=30,col='white')
  points(0,0,pch=19,cex=20,col=green)
  points(0,0,pch=19,cex=10,col=orange)
  points(rmvnorm(20,mean=mv[i,],sigma=diag(va[i,])), cex=1, pch=19)
  switch(i, 
         '1'= {
           mtext('low variance',3,cex=2)
           mtext('low bias',2,cex=2)
         },
         '2'= mtext('high variance',3,cex=2),
         '3' = mtext('high bias',2,cex=2)
  )
}
```


## Bias-variance tradeoff

This can be heuristically thought of as
$$\textrm{Prediction risk} = \textrm{Bias}^2 + \textrm{Variance}.$$

There is a natural conservation between these quantities


Low bias $\rightarrow$ complex model $\rightarrow$ many parameters
$\rightarrow$ high variance


The opposite also holds

(Think: $\hat f \equiv 0$.)


We'd like to 'balance' these quantities to get the best possible
predictions


## Classical regime

The Gauss-Markov theorem assures us that OLS is the best linear
_unbiased_ estimator of $\beta$


Also, it is the maximum likelihood estimator under a homoskedastic,
independent Gaussian model, has the other nice properties listed above.

Does that necessarily mean it is any good?


Write $\X= U D V'$ for the SVD of the design matrix $\X$.

Then
$\Var{\hat\beta_{LS}}  \propto (X^{\top}X)^{-1}  
= 
VD^{-1}\underbrace{U^{\top}U}_{=I} D^{-1} V^{\top}
=
VD^{-2}  V^{\top}$


Thus,
$$\mathbb{E}|| \hat\beta_{LS} - \beta||_2^2  =  \textrm{trace}(\mathbb{V}\hat\beta) \propto \sum_{j=1}^p \frac{1}{d_j^2}$$

_Important:_ Even in the
classical regime, we can do arbitrarily badly if $d_p \approx 0$!

## An example

```{r, echo=FALSE}
set.seed(20181002)
n = 60
x = runif(n)
y = sin(2*pi*x) + rnorm(n)
bigx = 1:100/100
df = data.frame(x = bigx, y=sin(bigx),
                p01 = predict(lm(y~x),newdata = data.frame(x=bigx)),
                p03 = predict(lm(y~poly(x,3)),newdata = data.frame(x=bigx)),
                p10 = predict(lm(y~poly(x,10)),newdata = data.frame(x=bigx)),
                p19 = predict(lm(y~poly(x,19)),newdata = data.frame(x=bigx)))
```

```{r, echo=FALSE,fig.height=4,fig.width=6}
gathered = df %>% gather(key='model',value='value',-x,-y)
ggplot(data.frame(x,y), aes(x=x,y=y)) + geom_point() + coord_cartesian(ylim=c(-3,3)) +
  geom_line(data=gathered, aes(x=x,y=value,color=model), linetype='dotted') +
  scale_color_brewer(palette='Set1') + stat_function(fun=function(x) sin(2*pi*x))
```

Using a Taylor's series, for all $x$
$$\sin(x) = \sum_{q = 0}^\infty \frac{(-1)^q x^{2q+1}}{(2q + 1)!}  = \Phi(X)^{\top}\beta$$
Additional polynomial terms will __reduce__ the bias but the variance can get nasty.


## Returning to polynomial example: Variance

The least squares solution is given by solving
$\min\norm{\X\beta - Y}_2^2$

$$\X=
\begin{bmatrix}
1 & X_1 & \ldots & X_1^{p-1} \\
   & \vdots && \\
 1 & X_n & \ldots & X_n^{p-1} \\
\end{bmatrix},$$ 
is the associated Vandermonde matrix.


This matrix is well known for being numerically unstable


(Letting $\X = UDV^{\top}$, this means that $d_1/d_p \rightarrow \infty$)

Hence $$\norm{(\mathbb{X}^{\top}\mathbb{X})^{-1}}_2 = \frac{1}{d_p^2}$$
grows larger, where here $\norm{\cdot}_2$ is the spectral (operator)
norm.

In the example, I used the _orthogonal_ polynomials, so $d_j=1$.

So, $\Var{\hat\beta} = \sigma^2 p$.


__Conclusion__:
Fitting the full
least squares model, even in low dimensions, can lead to poor
prediction/estimation performance.


## Big data regime

_Big data:_ The computational complexity
scales extremely quickly. This means that procedures that are feasible
classically are not for large data sets


_Example:_ Fit $\hat\beta_{LS}$
with $\mathbb{X}\in \mathbb{R}^{ n \times p}$. Next fit $\hat\beta_{LS}$
with $\mathbb{X}\in \mathbb{R}^{ 3n \times 4p}$


The second case will take $\approx$ $(3*4^2) = 48$ times longer to
compute, as well as $\approx 12$ times as much memory!

(Actually, for software such as `R` it might take
36 times as much memory, though there are data structures specifically
engineered for this purpose that update objects 'in place')


## High dimensional regime

_High dimensional:_ These problems tend to have
many of the computational problems as "Big
data", as well as a "rank
problem"


Suppose $\mathbb{X}\in \mathbb{R}^{n \times p}$ and $p > n$


Then $\textrm{rank}(\mathbb{X}) = n$ and the equation
$\mathbb{X}\hat{\beta} = Y$:

-   can be solved *exactly* (that is; the training error is 0)

-   has an infinite number of solutions

```{r, echo=FALSE, fig.width=12,fig.height=6}
x = -200:200/100
df = data.frame(x = x, low = x^2, 
                high = x^2*(abs(x)>=1) + (abs(x)<1))
df %>% gather(key='dim',value='value',-x) %>%
  ggplot(aes(x=x,y=value)) + geom_line(aes(color=dim),size=2) + 
  facet_wrap(~dim, strip.position = 'bottom') + scale_color_manual(values=colvec) +
  theme(axis.title=element_blank(),
        axis.text=element_blank(),
        axis.ticks=element_blank(),
        legend.position = 'none',
        strip.text = element_text(size=18))
```

# High dimensional linear methods

## Theoretical meanings

1. Low dimensional
    - finite sample $p<n$
    - asymptotics $p/n \rightarrow 0$
2. High dimensional
    - finite sample $p>n$
    - asymptotics $p/n \rightarrow \infty$
3. "Big data"
    - finite sample $n$ or $p$ or both are huge
    - no real asymptotic interpretation

## Approaches for big data

1. Reduce the dimension. Try PCA on the features, cluster features, screen the features.

2. Use all the covariates, but shrink the coefficients.

3. Select some useful covariates, throw away the rest.

## Shrinkage

One way to do this for regression is to solve (say):
\[
\begin{aligned}
\min_\beta &\frac{1}{n}\sum_i (y_i-x'_i \beta)^2 \\
\mbox{s.t.} & \sum_j \beta^2_j < t
\end{aligned}
\]
for some $t>0$.

* This is called "ridge regression".

* The ~~minimizer~~ of this problem is called $\brt$

Compare this to least squares:
\[
\begin{aligned}
\min_\beta &\frac{1}{n}\sum_i (y_i-x'_i \beta)^2 \\
\mbox{s.t.} & \beta \in \R^p
\end{aligned}
\]

## Geometry of ridge regression (2 dimensions)

```{r plotting-functions, echo=FALSE}
library(mvtnorm)
normBall <- function(q=1, len=1000){
  tg = seq(0,2*pi, length=len)
  out = data.frame(x = cos(tg)) %>%
    mutate(b=(1-abs(x)^q)^(1/q), bm=-b) %>%
    gather(key='lab', value='y',-x)
  out$lab = paste0('"||" * beta * "||"', '[',signif(q,2),']')
  return(out)
}

ellipseData <- function(n=100,xlim=c(-2,3),ylim=c(-2,3), 
                        mean=c(1,1), Sigma=matrix(c(1,0,0,.5),2)){
  df = expand.grid(x=seq(xlim[1],xlim[2],length.out = n),
                   y=seq(ylim[1],ylim[2],length.out = n))
  df$z = dmvnorm(df, mean, Sigma)
  df
}
lballmax <- function(ed,q=1,tol=1e-6){
  ed = filter(ed, x>0,y>0)
  for(i in 1:20){
    ff = abs((ed$x^q+ed$y^q)^(1/q)-1)<tol
    if(sum(ff)>0) break
    tol = 2*tol
  }
  best = ed[ff,]
  best[which.max(best$z),]
}
```

```{r,echo=FALSE}
nb = normBall(2)
ed = ellipseData()
bols = data.frame(x=1,y=1)
bhat = lballmax(ed, 2)
ggplot(nb,aes(x,y)) + xlim(-2,2) + ylim(-2,2) + geom_path(color=red) + 
  geom_contour(mapping=aes(z=z), color=blue, data=ed, bins=7) +
  geom_vline(xintercept = 0) + geom_hline(yintercept = 0) + 
  geom_point(data=bols) + coord_equal() +
  geom_label(data=bols, mapping=aes(label=bquote('hat(beta)[ls]')), parse=TRUE,
             nudge_x = .3, nudge_y = .3) +
  geom_point(data=bhat) + xlab(expression(beta[1])) + ylab(expression(beta[2])) + 
  geom_label(data=bhat, mapping=aes(label=bquote('hat(beta)[rt]')), parse=TRUE,
             nudge_x = -.4, nudge_y = -.4)
```

## Ridge regression

An equivalent way to write
\[
\brt = \arg \min_{ || \beta ||_2^2 \leq t} \frac{1}{n}\sum_i (y_i-x'_i \beta)^2
\]

is in the ~~Lagrangian~~ form

\[
\brl = \arg \min_{ \beta} \frac{1}{n}\sum_i (y_i-x'_i \beta)^2 + \lambda || \beta ||_2^2.
\]

For every $\lambda$ there is a unique $t$ (and vice versa) that makes 
\[
\brt = \brl
\]

Observe:

* $\lambda = 0$ (or $t = \infty$) makes $\brl = \bls$
* Any $\lambda > 0$ (or $t <\infty$)  penalizes larger values of $\beta$, effectively shrinking them.


Note: $\lambda$ and $t$ are known as ~~tuning parameters~~

## Ridge regression path

```{r load-data, echo=FALSE}
prostate = read.table("http://statweb.stanford.edu/~tibs/ElemStatLearn/datasets/prostate.data", header = TRUE)
Y = prostate$lpsa
X = as.matrix(prostate[,names(prostate)!=c('lpsa','train')])
n = length(Y)
p = ncol(X)

library(glmnet)
ridge = glmnet(x=X,y=Y,alpha=0)
df = data.frame(as.matrix(t(ridge$beta)))
df$lambda = ridge$lambda
gather(df, key='predictor',value='coefficient',-lambda) %>%
  ggplot(aes(x=lambda,y=coefficient,color=predictor)) + geom_path() + 
  scale_x_log10() + scale_color_brewer(palette = 'Set1')
```


## Least squares is invariant to rescaling, regularized methods aren't

Let's multiply our design matrix by a factor of 10 to get $\widetilde{\X} = 10\X$.  

Then:
\[
\widetilde{\beta}_{\textrm{ls}} = (\widetilde{\X}^{\top} \widetilde{\X})^{-1} \widetilde{\X}^{\top} Y = \frac{1}{10}(\widetilde{\X}^{\top} \widetilde{\X})^{-1} \widetilde{\X}^{\top} Y = 
\frac{\widehat\beta_{\textrm{ls}}}{10}
\]

So, multiplying our data by ten just results in our estimates being reduced by one tenth.  
Hence, any prediction is left unchanged:
\[
\widetilde{\X}\widetilde\beta_{\textrm{ls}} = \X \bls
\]

This means, for instance, if we have a covariate measured in miles, then we will get the "same" answer if we change it to kilometers


* `lm.ridge` automatically scales every column of $\X$ to have mean zero and Euclidean norm 1.

* It also centers $Y$.

* Together, this means there is no intercept. (We don't penalize the intercept)

* In `R`: `scale(X)` defaults to mean 0, SD 1. But you can change either.

* Another version is in the package `glmnet`. More on this in a bit.

## Solving the minimization

* One nice thing about ridge regression is that it has a closed-form solution (like OLS)

\[
\brl = (\X'\X + \lambda I)^{-1}\X'Y
\]

* This is easy to calculate in `R` for any $\lambda$. But you need to recalculate for each $\lambda$.

* Computations and interpretation are simplified if we examine the Singular Value Decomposition of $\X = UDV'$.

* Then,

\[
\brl = (\X'\X + \lambda I)^{-1}\X'Y = (VD^2V' + \lambda I)^{-1}VDU'Y 
= V(D^2+\lambda I)^{-1} DU'Y.
\]

* For computations, now we only need to invert a diagonal matrix.

* For interpretations, we can compare this to OLS:

\[
\bls = (\X'\X)^{-1}\X'Y = (VD^2V')^{-1}VDU'Y = VD^{-2}DU'Y = VD^{-1}U'Y
\]

* Notice that $\bls$ depends on $d_j/d_j^2$ while $\brl$ depends on $d_j/(d_j^2 + \lambda)$.

* Ridge regression makes the coefficients smaller relative to OLS.

* But if $\X$ has small singular values, ridge regression compensates with $\lambda$ in the denominator.

_Finally,_

* $p>n$, $(\X'\X+\lambda I_p)^{-1}$ requires $O(p^3)$ computations and $O(p^2)$ storage
* But $\X'(\X\X' + \lambda I_n)^{-1}$ requires $O(n^3)$ computations and $O(n^2)$ storage

Searle's matrix identity shows that these are equal.

## Ridge regression and multicollinearity

Multicollinearity is a phenomenon in which a combination of predictor variables is extremely similar to another predictor variable. Some comments:

* A better phrase that is sometimes used is "$\X$ is ill-conditioned"
* It means that one of its columns is nearly (or exactly) a linear combination of other columns.  This is sometimes known as "(numerically) rank-deficient".
* If $\X = U D V'$ is ill-conditioned, then some elements of $D$ are nearly zero
* If we form $\bls= V D^{-1} U' Y$, then we see that the small
entries of $D$ are now huge (due to the inverse).  This in turn creates a huge variance.
* Recall: $\mathbb{V} \bls =  \sigma^2(\X'\X)^{-1} = \sigma^2 V D^{-2} V'$


Ridge Regression fixes this problem by preventing the division by a near zero number

Conclusion: $(\X^{\top}\X)^{-1}$ can be really unstable, while $(\X^{\top}\X + \lambda I)^{-1}$ is not.

## Ridge theory

Recalling that $\beta'_*x$ is the best linear approximation to $f_*(x)$

If $\norm{x}_\infty< r$,  [@HsuKakade2014],
\[
R(\brl) - R(\beta_*) \leq \left(1+ O\left(\frac{1+r^2/\lambda}{n}\right)\right)
\frac{\lambda\norm{\beta_*}_2^2}{2} + \frac{\sigma^2\tr{\Sigma}}{2n\lambda}
\]

Optimizing over $\lambda$, and setting $B=\norm{\beta_*}$ gives
\[
R(\brl) - R(\beta_*) \leq \sqrt{\frac{\sigma^2r^2B^2}{n}\left(1+O(1/n)\right)} + 
O\left(\frac{r^2B^2}{n}\right)
\]

Lower bound
\[
\inf_{\hat\beta}\sup_{\beta_*} R(\hat\beta) - R(\beta_*) \geq C\sqrt{\frac{\sigma^2r^2B^2}{n}}
\]

We call this behavior _rate minimax_: essential meaning, 
\[
R(\hat\beta) - R(\beta_*) = O\left(\inf_{\hat\beta}\sup_{\beta_*} R(\hat\beta) - R(\beta_*)\right)
\]

In this setting, Ridge regression does as well as we could hope, up to constants.

## Bayes interpretation

If 

1. $Y=X'\beta + \epsilon$, 
2. $\epsilon\sim N(0,\sigma^2)$ 
3. $\beta\sim N(0,\tau^2 I_p)$,

Then, the posterior mean (median, mode) is the ridge estimator with $\lambda=\sigma^2/\tau^2$.


# Subset selection methods

## Best subsets

If we imagine that only a few predictors are relevant, we could solve
\[
\min_{\beta\in\R^p} \norm{Y-\X\beta}_2^2 + \lambda\norm{\beta}_0
\]

The $\ell_0$-norm counts the number of non-zero coefficients.

This may or may not be a good thing to do.

It is computationally infeasible if $p$ is more than about 20.

Technically NP-hard (you must find the error of each of the $2^p$ models)

Though see [@BertsimasKing2016] for a method of solving reasonably large cases via mixed integer programming.

## Packages

Because this is an NP-hard problem, we fall back on greedy algorithms.

1. Forward stepwise
2. Backward stepwise
3. Add/remove

All are implemented by the `regsubsets` function in the `leaps` package. 

It also will do exhaustive search.

## Theory

This result is due to @FosterGeorge1994.

1. If the truth is linear.
2. $\lambda = C\sigma^2\log p.$
3. $|\beta_*| = s$
\[
\frac{\Expect{\norm{\X\beta_*-\X\hat\beta}_2^2}/n}{s\sigma^2/n} \leq 4\log p + 2 + o(1).
\]

They, also prove a lower bound 
\[
\inf_{\hat\beta}\sup_{\X,\beta_*} \frac{\Expect{\norm{\X\beta_*-\X\hat\beta}_2^2}/n}{s\sigma^2/n} \geq 2\log p - o(\log p).
\]

~~Important:~~ 

- even if we could compute the subset selection estimator at scale, it’s not clear that we would want to
- (Many people assume that we would.) 
- theory provides an understanding of the performance of various estimators under typically idealized conditions


# Lasso and family

## Can we get the best of both worlds?

To recap:

* Deciding which predictors to include, adding quadratic terms, or interactions is ~~model selection~~.

* Ridge regression provides regularization, which trades off bias and variance and also stabilizes multicollinearity.  


Ridge regression: $\min ||Y-\X\beta||_2^2 \textrm{ subject to } ||\beta||_2^2 \leq t$ 

Best linear regression model: $\min ||Y-\X\beta||_2^2 \textrm{ subject to } ||\beta||_0 \leq t$

$||\beta||_0$ is the number of nonzero elements in $\beta$

Finding the best linear model  is a nonconvex optimization problem (In fact, it is NP-hard)

Ridge regression is convex (easy to solve), but doesn't do model selection

Can we somehow "interpolate" to get both?

## Geometry of convexity

```{r,echo=FALSE}
nbs = list()
nbs[[1]] = normBall(0,1)
qs = c(.5,.75,1,1.5,2)
for(ii in 2:6) nbs[[ii]] = normBall(qs[ii-1])
nbs = bind_rows(nbs)
nbs$lab = factor(nbs$lab, levels = unique(nbs$lab))
seg = data.frame(lab=levels(nbs$lab)[1],
                 x0=c(-1,0),x1=c(1,0),y0=c(0,-1),y1=c(0,1))
levels(seg$lab) = levels(nbs$lab)
ggplot(nbs, aes(x,y)) + geom_path(size=1.2) + facet_wrap(~lab,labeller = label_parsed) + 
  geom_segment(data=seg,aes(x=x0,xend=x1,y=y0,yend=y1),size=1.2) + 
  coord_equal() + geom_vline(xintercept = 0,size=.5) + 
  geom_hline(yintercept = 0,size=.5) +
  theme(strip.text.x = element_text(size = 12))
```

## The best of both worlds

```{r, echo=FALSE}
nb = normBall(1)
ed = ellipseData()
bols = data.frame(x=1,y=1)
bhat = lballmax(ed, 1)
ggplot(nb,aes(x,y)) + xlim(-2,2) + ylim(-2,2) + geom_path(color=red) + 
  geom_contour(mapping=aes(z=z), color=blue, data=ed, bins=7) +
  geom_vline(xintercept = 0) + geom_hline(yintercept = 0) + 
  geom_point(data=bols) + coord_equal() +
  geom_label(data=bols, mapping=aes(label=bquote('hat(beta)[ls]')), parse=TRUE,
             nudge_x = .3, nudge_y = .3) +
  geom_point(data=bhat) + xlab(expression(beta[1])) + ylab(expression(beta[2])) + 
  geom_label(data=bhat, mapping=aes(label=bquote('hat(beta)[lt]')), parse=TRUE,
             nudge_x = -.4, nudge_y = -.4)
```

This regularization set...

* ... is convex (computationally efficient)
* ... has corners (performs model selection)


## $\ell_1$-regularized regression

Known as 

* "lasso"
* "basis pursuit"

The estimator satisfies
\[
\blt = \arg\min_{ ||\beta||_1 \leq t}  ||Y-\X\beta||_2^2 
\]

In its corresponding Lagrangian dual form:
\[
\bll = \arg\min_{\beta} ||Y-\X\beta||_2^2 + \lambda ||\beta||_1
\]


## Lasso

While the ridge solution can be easily computed
\[
\brl = \arg\min_{\beta} ||Y-\X\beta||_2^2 + \lambda ||\beta||_2^2 = (\X^{\top}\X + \lambda I)^{-1} \X^{\top}Y
\]

the lasso solution

\[
\bll = \arg\min_{\beta} ||Y-\X\beta||_2^2 + \lambda ||\beta||_1 = \; ??
\]
doesn't have a closed form solution.


However, because the optimization problem is convex, there exist efficient algorithms for computing it. 

We'll talk algorithms next week.

## Coefficient path: ridge vs lasso

```{r,echo=FALSE}
library(lars)
lasso = glmnet(x=X,y=Y)
ridge = glmnet(x=X,y=Y,alpha=0)
df = data.frame(as.matrix(t(ridge$beta)))
df$norm = sqrt(rowSums(df^2))
df1 = data.frame(as.matrix(t(lasso$beta)))
df1$norm = rowSums(abs(df1))
df$method = 'ridge'
df1$method = 'lasso'
out = bind_rows(df[-1,],df1)
gather(out, key='predictor',value='coefficient',-norm,-method) %>%
  ggplot(aes(x=norm,y=coefficient,color=predictor)) + geom_path() + 
  facet_wrap(~method,scales = 'free_x') + 
  #scale_x_log10() + 
  scale_color_brewer(palette = 'Set1')
```

## Packages

There are two main `R` implementations for finding lasso


* Using `glmnet`: `lasso.out = glmnet(X, Y, alpha=1)`.  
* Setting `alpha=0` gives ridge regression (as does `lm.ridge` in the `MASS` package)
* Setting `alpha` $\in (0,1)$ gives a method called the "elastic net" which combines ridge regression and lasso.
* Alternatively, there is  `lars`: `lars.out = lars(X, Y)`
* `lars` also other things called "Least angle", "forward stepwise", and "forward stagewise" regression

## Two packages

1. `lars` (this is the first one)
2. `glmnet` (this one is faster)

These use different algorithms, but both compute the ~~path~~ for a range of $\lambda$.

`lars`   
- starts with an empty model and adds coefficients until saturated  
- Uses the entire sequence of $\lambda$ based on the nature of the optimization problem.  
- Doesn't support other likelihoods  
- Slower  
- The path returned by `lars` as more useful than that returned by `glmnet`.

`glmnet`  
- starts with an empty model and examines each value of $\lambda$ using previous values as "warm starts".  
- $\lambda \in \left[\epsilon\norm{\X'Y}_\infty,\ \norm{\X'Y}_\infty\right)$ for some small $\epsilon$  
- It is generally much faster than `lars` and uses lots of other tricks (as well as compiled code) for extra speed.  
- Easier to cross validate sensibly  
- Can ocassionaly give boundary solutions, or too many non-zero coefs

## Lasso paths

```{r}
lasso = glmnet(X,Y)
lars.out = lars(X,Y)
par(mfrow=c(1,2))
plot(lasso)
plot(lars.out,main='')
```

## Lasso theory under strong conditions

__Support recovery__ [@Wainwright2009], see also [@MeinshausenBuhlmann2006; @ZhaoYu2006]

1. The truth is linear.
2. $\norm{\X'_{S^c}\X_S (\X'_S\X_S)^{-1}}_\infty < 1-\epsilon.$
3. $\lambda_{\min} (\X'_S\X_S) \geq C_{\min} > 0$.
4. The columns of $\X$ have 2-norm $n$.
5. The noise is iid Normal.
6. $\lambda_n$ satisfies $\frac{n\lambda^2}{\log(p-s)} \rightarrow \infty$.
7. $\min_j \{ |\beta_j| : j \in S\} \geq \rho_n > 0$ and 
\[
\rho_n^{-1} \left( \sqrt{\frac{\log s}{n}}+ \lambda_n\norm{(\X'_S\X_S)^{-1}}_\infty \right)\rightarrow 0
\]

Then, $P(\textrm{supp}(\bll) = \textrm{supp}(\beta_*))\rightarrow 1$.

__Estimation consistency__ [@negahban2010unified] also [@MeinshausenYu2009]

1. The truth is linear.
2. $\exists \kappa$ such that for all vectors $\theta\in\R^p$ that satisfy 
$\norm{\theta_{S^C}}_1 \leq 3\norm{\theta_S}_1$, we have $\norm{X\theta}_2^2/n \geq \kappa\norm{\theta}_2^2$ (Compatibility)
3. The columns of $\X$ have 2-norm $n$.
4. The noise is iid sub-Gaussian.
5. $\lambda_n >4\sigma \sqrt{\log (p)/n}$.

Then, with probability at least $1-c\exp(-c'n\lambda_n^2)$,  
\[
\norm{\bll-\beta_*}_2^2 \leq \frac{64\sigma^2}{\kappa^2}\frac{s\log p}{n}.
\]

__Warning__: These conditions are very strong, uncheckable in practice, unlikely to be true for real datasets. But theory of this type is the standard for these procedures.

## Lasso under weak/no conditions

If $Y$ and $X$ are bounded by $B$, then with probability at least $1-\delta^2$,
\[
R_n(\blt) - R_n(\beta_*) \leq \sqrt{\frac{16(t+1)^4B^2}{n}\log\left(\frac{\sqrt{2}p}{\delta}\right)}.
\]

This is a simple version of a result in [@GreenshteinRitov2004].

Note that it applies to the $L_1$ version rather than the Lagrangian formulation.

The boundedness can be replaced by sub-Gaussian or other conditions.

@bartlett2012 derives the same rate for the Lagrangian form up to log factors.

Again, this rate is (nearly) optimal:
\[
c\sqrt{\frac{s}{n}} < C\sqrt{\frac{s\log p}{n}}.
\]

$\log p$ is the penalty you pay for selection.


# Model selection (details later)

## Choosing the lambda

* You have to choose $\lambda$ in lasso or in ridge regression
* lasso selects a model (by setting coefficients to zero), but the value of $\lambda$ determines how many/which.
* All of these packages come with CV built in.
* However, the way to do it differs from package to package (whomp whomp)

## Ridge regression, `lm.ridge` version


```{r}
library(MASS)
par(mfrow=c(1,1))
# 1. Estimate the model (note, this uses a formula, and you must supply lambda)
ridge.out = lm.ridge(lpsa~.-train, data=prostate, lambda = 0:100)
# 2. Plot it
plot(ridge.out)
# (2a). If you chose lambda poorly, this will look bad, try again
# 3. Choose lambda using GCV
plot(ridge.out$lambda,ridge.out$GCV,ty='l')
# 4. If there's a minimum, FIND IT, else try again
best.lam = which.min(ridge.out$GCV)
# 5. Return the coefs/predictions for the best model
coefs = coefficients(ridge.out)[best.lam,]
preds = as.matrix(dplyr::select(prostate,-lpsa,-train)) %*% coefs[-1] + coefs[1]
```

## `glmnet` version (lasso or ridge)


```{r}
# 1. Estimate cv and model at once, no formula version
lasso.glmnet = cv.glmnet(X,Y)
# 2. Plot the coefficient path
plot(lasso.glmnet$glmnet.fit) # the glmnet.fit == glmnet(X,Y)
# 3. Choose lambda using CV
plot(lasso.glmnet) #a different plot method for the cv fit
# 4. If the dashed lines are at the boundaries, redo it with a better set of lambda
best.lam = lasso.glmnet$lambda.min # the value, not the location (or lasso$lambda.1se)
# 5. Return the coefs/predictions for the best model
coefs.glmnet = coefficients(lasso.glmnet, s = best.lam)
preds.glmnet = predict(lasso.glmnet, newx = X, s = best.lam) # must supply `newx`
```

## `lars` version

This is incredibly difficult to cross-validate. 

The path changes from fold to fold, so things can get hairy.

In principle, the following should work.


```{r}
# 1. Estimate cv, no formula version
lasso.lars.cv = cv.lars(X,Y) # also plots it
# 2. Choose lambda using CV
best.lam.lars = lasso.lars.cv$index[which.min(lasso.lars.cv$cv)] # the location, not the value
# 3. Estimate the lasso and plot
lasso.lars = lars(X,Y) # still the whole path
# 5. Return the coefs/predictions for the best model
coefs.lars = coefficients(lasso.lars, s = best.lam.lars, mode='fraction') # annoying mode argument is required
preds.lars = predict(lasso.lars, newx=X, s = best.lam.lars, mode='fraction') # must supply `newx`
```

## Paths with chosen lambda (lasso and ridge)

```{r}
ridge.glmnet = cv.glmnet(X,Y,alpha=0,lambda.min.ratio=1e-10) # added to get a minimum
par(mfrow=c(1,2))
plot(ridge.glmnet)
plot(lasso.glmnet)
best.lam.ridge = ridge.glmnet$lambda.min
plot(ridge.glmnet$glmnet.fit,xvar='lambda', main='Ridge (glmnet)')
abline(v=log(best.lam.ridge))
plot(lasso.glmnet$glmnet.fit,xvar='lambda', main='Lasso (glmnet)')
abline(v=log(best.lam))
```


## The lasso universe

![](gfx/wwlasso.jpg)

- Grouped lasso (Yuan and Lin (2007), Meier et al. (2008)), where variables are included
or excluded in groups.
- Refitted lasso (e.g. Lederer 2013).  Takes the estimated model from lasso and fits the full
least squares solution on selected covariates (less bias, more variance).
- Dantzig selector (Candes, Tao (2007)), a slightly modified version of the lasso
- The elastic net (Zou, Hastie (2005)), generally used for correlated variables that
combines a ridge/lasso penalty.  Included in `glmnet`.  Fixes non-uniqueness problem
of lasso (although, see Tibshirani (2013)).  
- SCAD (Fan and Li (2005)), a non-convex version of lasso that adds a more severe variable selection penalty
- $\sqrt{\textrm{lasso}}$ (Belloni et al. (2011)), claims to be tuning parameter free (but isn't).  Uses $||\cdot||_2$
instead of $||\cdot||_2^2$ for the loss.
- Generalized lasso (Tibshirani, Taylor (2011)).  Adds various additional penalty matrices to the penalty term (ie: 
$||D\beta||_1$).  


## Selected references