You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
First we will define Regularized Loss Minimization and see how stability of learning algorithms and overfitting are connected. Then we are going to proof some general bounds about stability for Tikhonov regularization. To get useful bounds, we have to add further assumptions like a Lipschitz loss function, or a $\beta$-smooth loss function. However, only Lipschitz loss functions are considered here. We will proof that learning problems with convex-Lipschitz-bounded loss function and Tikhonov regularization are APAC learnable. We will also see (without proof) a similar result for Ridge Regression, which has a non-Lipschitz loss function.
§ 1 RLM Rule
Definition 1: Regularized Loss Minimization (RLM)
Regularized Loss Minimization is a learning rule in the form of
$\underset{w}{\text{argmin}} (L_S(w) + R(w))$,
with a regularization function$R: \mathbb{R}^d \to \mathbb{R}$ .
The case with $R(w) = \lambda \lVert w \rVert^2_2$ for $\lambda>0$ is called Tikhonov regularization.
§ 2 Stable Rules and Overfitting
Notations
Symbol
Meaning
$U(m)$
uniform distribution over $[m]={1,\ldots,m}$
$A$
a learning algorithm
$S = (z_1, \ldots,z_m)$
training set
$A(S)$
output of $A$ after processing $S$
$z'$
another training example
$S^{(i)}$
$(z_1,\ldots,z_{i-1},z',z_{i+1}, \ldots, z_m)$
Definition 2: Stability
Let $\epsilon: \mathbb{N} \to \mathbb{R}$ be monotonically decreasing. A is on-average-replace-one-stable with rate $\epsilon(m)$ if for every distribution $\mathcal{D}$
For a reasonable learner, the term $l(A(S^{(i)}), z_i) - l(A(S),z_i)$ will typically be greater than zero, because $z_i$ was used during the learning of $A(S)$, but not while learning $A(S^{(i)})$.
Theorem 1
Let $\mathcal{D}$ be a distribution. Let $S = (z_1, \ldots,z_m)$, $z'$ be examples, independent and identically distributed according to $\mathcal{D}$. Then for any learning algorithm $A$:
For the first equation we used the definition of the true error $L_{\mathcal{D}}$ and for the second equation, we swapped the names of the i.i.d variables $z'$ and $z_i$. Using the definition of the empirical error $L_S$ as a weighted sum of terms of the form $l(-,z_i)$, which can be written as an expectation value as well, yields:
Combining both equations finishes the proof. $\square$
Remark
As $\underset{S\sim\mathcal{D}^{m}}{\mathbb{E}}\left[L_{\mathcal{D}}(A(S)) - L_S(A(S))\right]$
is a measurement of overfitting, Theorem 1 tells us, simply put, that "stable rules do not overfit".
§ 3 Strong Convexity
Definition 3
A function $f$ is $\lambda$-strongly convex, if for all $w, u, \alpha\in(0,1)$ we have
1 and 2 are easy to check, so only a proof of 3 is provided here:
First we divide the definition of strong convexity by $\alpha$ and rearrange to get the following:
This is a general bound for Tikhonov regularization. To bound this further, we will now assume our loss function to be Lipschitz.
Theorem 2
Assume a convex, $\rho$-Lipschitz loss function. Then the RLM rule with $\lambda\lVert w \rVert ^2$ regularization is on-average-replace-one-stable with rate $\frac{2\rho^2}{\lambda m}$. By Theorem 1, this also implies, that
As this holds for any $S, z', i$, taking expectations will conclude the proof. $\square$
§ 5 Controlling the Fitting Stability Tradeoff
The theorem above shows, that the stability term decreases, when $\lambda$ increases. As the empirical risk also increases with $\lambda$, we face a tradeoff between fitting and overfitting. In this section, we will choose a value of $\lambda$ to derive a new bound for the true risk.
Theorem 3
Assume a convex, $\rho$-Lipschitz loss function. Then the RLM rule with Tikhonov regularization satisfies:
This is bound is also called oracle inequality . We may think of $w^{* }$ as hypothesis with low risk. $A(S)$ will then only be slightly worse than $w^{* }$.
We have $L_{S}(A(S)) \leq L_{S}(A(S)) + \lambda||A(S)||^2 \leq L_{S}(w^{* }) + \lambda\lVert w^{* }\rVert^2$,
as $A(S)$ is a minimizer of $L_S$.
Taking expectations and using $\underset{S\sim\mathcal{D}^{m}}{\mathbb{E}}\left[L_{S}(w^{* }) \right] =L_{\mathcal{D}}(w^* )$ yields
For a convex-Lipschitz-bounded learning problem $(\mathcal{H},Z,l)$
with parameters $\rho, B$ and $\lambda := \sqrt{\frac{2\rho^2}{B^2m}}$ and Tikhonov regularization, we get
(Note that bounded means: $\lVert w \rVert \leq B$ for all $w \in \mathcal{H}$)
Proof
The corollary follows directly by setting $w^{* }$ to $\underset{w \in \mathcal{H}}{\text{argmin }} L_{\mathcal{D}}(w)$, inserting $\lambda$ in Theorem 3, and using $\lVert w^{* } \rVert \leq B$.
$\square$
§ 6 APAC Learnability
Convex-Lipschitz-bound problems are APAC learnable, as Lemma 2 will show:
Lemma 2
If an algorithm A guarantees that for $m \geq m_{\mathcal{H}}(\epsilon)$ and every distribution $\mathcal{D}$ holds:
Let $\delta \in (0,1)$, $m\geq m_{\mathcal{H}}(\epsilon \delta)$.
Define $X=L_{\mathcal{D}}(A(S)) - \underset{h \in \mathcal{H}}{\min} L_{\mathcal{D}}(h)$.
Then $X\geq0$ and by our assumption $\mathbb{E}[X] \geq \epsilon \delta$. Using Markov's inequality, we obtain
This is just the RLM rule with Tikhonov regularization and loss function $l(w,z') = {\frac{1}{2}(\left<w,x'\right> -y')^2}$. But the loss function is not Lipschitz, so we cannot apply the theorems for convex-Lipschitz-bound functions!
However we have, that
$$\nabla l(w,z') = \frac{1}{2}z'z'^{T}w - y'z',$$
is $\beta$-Lipschitz (for some value of $\beta$). Functions with this property, i.e. $\beta$-Lipschitz gradients, are called $\beta$-smooth.
Remark
For $\beta$-smooth functions, very similar results to the previously stated theorems and corollaries for Lipschitz functions hold. Especially we get:
Theorem 4 (without proof)
For a distribution $\mathcal{D}$ over $\chi \times [0,1]$ with $\chi = { x\in \mathbb{R}^d | x \leq 1}$, $\mathcal{H}={w\in \mathbb{R}^d: \lVert w \rVert \leq B}$, $\epsilon\in(0,1)$, $m \geq \frac{150 B^2}{\epsilon^2}$, $\lambda = \frac{\epsilon}{3B^2}$,
the Ridge Regression algorithm satisfies
This implies that there is an APAC learner for Ridge Regression.
§ 8 Questions
Q1: What does Theorem 3 tell us?
A1: The error is bounded by the regularizer as $m \rightarrow \infty.$
Q2: Why is $\lambda \lVert . \rVert ^2$$2 \lambda$-strongly convex?
A2: This can be proven using the polarization identities.
Q3: In Corollary 1 why is $w$ bounded in a ball centered at $0$.
A3: We can change the regularizer and get the same theorem with $w$ being bounded in a ball with different center.
Q4: Are there practical applications to convergence theorems on Tikhonov regularization?
A4: Yes, Ridge Regression.