Skip to content

Latest commit

 

History

History
198 lines (110 loc) · 8.85 KB

s02_SGD.md

File metadata and controls

198 lines (110 loc) · 8.85 KB

Stochastic Gradient Descent

§1. Classical Gradient Descent

In Classical Gradient Descent we want to minimize $f \in \mathrm{C}^1(S, \mathbb{R})$ where $S \subseteq \mathbb{R}^n$ is open. The idea is to iteratively descent into the negative gradient direction of $f$ with small stepsize.

Definition (Gradient Descent):

Input: $\eta > 0$, $T \in \mathbb{N}$, $f:S \rightarrow \mathbb{R}$

  • Initialize $w^1 \in S$
  • For $t = 1...T$
    • $w^{t+1} = w^t - \eta \nabla f(w^t)$
  • Return $\overline{w} = \frac{1}{T}\sum_{t\leq T} w^t$.

Other options: Return $w^T$ or best performing $w^{t_0}$ such that $f(w^{t_0}) \leq f(w^t)$ for all $t \leq T$.

§2. Subgradients

In this chapter we generalize Gradient Descent to non-differentiable functions.

Recall: If $f:S\rightarrow \mathbb{R}$ is convex then for any point $w \in S$ if $f$ is differentiable in $w$ then

$$ \forall u \in S: \quad f(u) \geq f(w) + (u-w, \nabla f(w)) $$

where $$ H_{f,w}= { f(w) + (u-w, \nabla f(w)) \mid u \in S} $$ is the tangent hyperplane of $f$ at $w$.

Lemma: Let $S\subseteq \mathbb{R}^n$ be convex and open. Then a function $f:S \rightarrow \mathbb{R}$ is convex if and only if

$$\forall w \in S \exists v \in \mathbb{R}^n \forall u \in S: \quad f(u) \geq f(w) + (u-w, v). $$

Definition: Let $f:S \rightarrow \mathbb{R}$ be a function. A vector $v \in \mathbb{R}^n$ statisfying

$$\forall u \in S: \quad f(u) \geq f(w) + (u-w, v)$$

at a given point $w \in S$ is called subgradient of $f$ at $w$. The set of all subgradients at $w$ is denoted with $\partial f(w)$.

Facts:

  • If $f$ is convex then $\partial f(w) \neq \emptyset.$
  • If $f$ is convex and differentiable at $w$ then $$\partial f(w) = { \nabla f(w) }.$$

Example:

$$\partial |.|(x) = \begin{cases} {+1}: : x > 0\\ [-1,1]: : x =0 \\ {-1}: : x < 0. \end{cases} $$

§3. Lemma 14.1

For later convergence theorems we need the following lemma.

Lemma 14.1: Let $v_1, ..., v_n \in \mathbb{R}^n$. Then any algorithm with initialization $w^1=0$ and update rule $w^{t+1} = w^t - \eta v_t$ for $\eta &gt;0$ statisfies for all $w^* \in \mathbb{R}^n$

$$\sum_{t=1}^T (w^t - w^, v_t) \leq \frac{\rVert w^ \lVert ^2}{2\eta} + \frac{\eta}{2} \sum_{t=1}^T \rVert v_t \lVert ^2.$$

In particular for every $B, \rho &gt; 0$ if $ \rVert v_1 \lVert, ..., \rVert v_T \lVert \leq \rho $ and $w^* \in \overline{\mathbb{B}}_B(0)$ then with $\eta = \frac{B}{\eta} \frac{1}{\sqrt{T}}$ we have

$$\frac{1}{T} \sum_{t=1}^T(w^t-w^*, v_t) \leq \frac{B\rho}{\sqrt{T}}.$$

Proof: Recall the polarization identities

$$(u,v) = \frac{1}{2}(||u||^2 + ||v||^2 - ||u-v||^2).$$

Hence

$$(w^t - w^, v_t) = \frac{1}{\eta}(w^t-w^, \eta v_t)$$

$$=\frac{1}{2 \eta}(-||w^t-w^* - \eta v_t||^2 + ||w^t -w^*||^2 + \eta^2||v_t||^2)$$

$$= \frac{1}{2\eta}(||w^t - w^||^2 - ||w^{t+1} - w^||^2) + \frac{\eta}{2}||v_t||^2.$$

Summing over $t$ we get

$$\sum_{t=1}^T (w^t - w^, v_t) = \frac{1}{2\eta}\sum_{t=1}^T(||w^t-w^||^2 - ||w^{t+1}-w^*||^2) + \frac{\eta}{2} \sum_{t=1}^T||v_t||^2$$

$$= \frac{1}{2\eta} (||w^1-w^||^2 - ||w^{T+1} - w^||^2) + \frac{\eta}{2} \sum_{t=1}^T ||v_t||^2 \leq \frac{||w^*||^2}{2\eta} + \frac{\eta}{2} \sum_{t=1}^T ||v_t||^2$$

since the second sum is telescopic and $w^1=0$ by assumption. $\square$

§4. Stochastic Gradient Descent

Definition (Stochastic Gradient Descent):

Input: $\eta &gt; 0$, $T \in \mathbb{N}, f:S \rightarrow \mathbb{R}$

  • Initialize $w^1=0$
  • For $t=1...T$
    • Draw $v_t$ according to some probability distribution such that $\mathbb{E}[v_t \mid w^t] \in \partial f(w^t)$
    • $w^{t+1} = w^t - \eta v_t$
  • Return $\overline{w} = \frac{1}{T} \sum_{t=1}^T w^t.$

Theorem: Let $B, \rho &gt;0$ and $f:S \rightarrow \mathbb{R}$ convex. Let $w^* \in \mathrm{argmin}_{\rVert w \lVert \leq B} f(w)$ for $S \subseteq \mathbb{R}^n$. Assume that SGD runs with $T$ iterations and stepsize $\eta = \frac{B}{\rho}\frac{1}{\sqrt{T}}$ and assume that $ \rVert v_1 \lVert, ..., \rVert v_T \lVert \leq \rho$ almost sure. Then

$$\mathbb{E}f(\overline{w})- f(w^*) \leq \frac{B\rho}{\sqrt{T}}.$$

Therefore for given $\epsilon &gt; 0$ to achieve $\mathbb{E}f(\overline{w}) - f(w^*) \leq \epsilon$ one needs to run $T \geq (B\rho / \epsilon)^2$ iterations of SGD.

Proof: We use the notation $v_{1:T} = v_1, ..., v_T$. Since $f$ is convex we can apply Jensens inequality to obtain

$$ f(\overline{w}) - f(w^) \leq \frac{1}{T}\sum_{t=1}^T(f(w^t) - f(w^))$$

for the iteratives $w^1, ..., w^T$ of SGD. The expectation preserves this inequality and we obtain

$$ \underset{v_{1:T}}{\mathbb{E}}[f(\overline{w}) - f(w^)] \leq \underset{v_{1:T}}{\mathbb{E}}\Big[\frac{1}{T}\sum_{t=1}^T(f(w^t) - f(w^))\Big]. $$

Using lemma 14.1 we have

$$ \underset{v_{1:T}}{\mathbb{E}} \Big[ \frac{1}{T} \sum_{t=1}^T(w^t - w^*, v_t) \Big] \leq \frac{B\rho}{\sqrt{T}}. $$

It is therefore enough to show that

$$ \underset{v_{1:T}}{\mathbb{E}}\Big[\frac{1}{T}\sum_{t=1}^T(f(w^t) - f(w^))\Big]\leq \underset{v_{1:T}}{\mathbb{E}} \Big[ \frac{1}{T} \sum_{t=1}^T(w^t - w^, v_t) \Big]. $$

Using linearity of expectation we get

$$\underset{v_{1:T}}{\mathbb{E}} \Big[ \frac{1}{T} \sum_{t=1}^T(w^t - w^, v_t) \Big] = \frac{1}{T} \sum_{t=1}^T \underset{v_{1:T}}{\mathbb{E}}[(w^t-w^, v_t)]. $$

Next recall the law of total expectation: Let $\alpha$, $\beta$ be random variables and $g$ some function then $$\mathbb{E}{\alpha}[g(\alpha)]=\mathbb{E}{\beta}[\mathbb{E}{\alpha}[g(\alpha) \mid \beta]]$$. Put $$\alpha=v{1:t}$$ and $$\beta=v_{1:t-1}$$ then

$$ \underset{v_{1:T}}{\mathbb{E}}[(w^t-w^,v_t)] = \underset{v_{1:t}}{\mathbb{E}}[(w^t-w^, v_t)] $$

$$ = \underset{v_{1:t-1}}{\mathbb{E}}[\underset{v_{1:t}}{\mathbb{E}}[(w^t-w^, v_t)|v_{1:t-1}] ] = \underset{v_{1:t-1}}{\mathbb{E}}[(w^t-w^, \underset{v_{t}}{\mathbb{E}}[v_t|v_{1:t-1}])]. $$

But now $w^t$ is completely determined by $v_{1:t-1}$ since $w^t = -\eta(v_1 + ... + v_{t-1})$. It follows $\underset{v_{t}}{\mathbb{E}} [v_t \mid v_{t-1}] = \underset{v_{t}}{\mathbb{E}} [v_t \mid w^t] \in \partial f(w^t)$. Therefore

$$ \underset{v_{1:t-1}}{\mathbb{E}}[(w^t-w^, \underset{v_{t}}{\mathbb{E}}[v_t|v_{1:t-1}])] \geq \underset{v_{1:t-1}}{\mathbb{E}}[f(w^t)-f(w^)]$$

which also implies

$$ \underset{v_{1:T}}{\mathbb{E}}[(w^t-w^, v_t)] \geq \underset{v_{1:T}}{\mathbb{E}}[f(w^t)-f(w^)].$$

Now summing over $t \leq T$ and dividing by $T$ gives the desired inequality. To achieve

$$ \mathbb{E}f(\overline{w}) - f(w^*) \leq \frac{B \rho}{\sqrt{T}} \leq \epsilon$$

we need $T \geq (B \rho / \epsilon)^2$ iterations. $\square$

§5. Stochastic Gradient Descent for Risk Minimization

In Learning Theory we want to minimize

$$ \mathcal{L}{\mathcal{D}}(w) = \mathbb{E}{z \sim \mathcal{D}}[l(w,z)].$$

SGD allows us to directly minimize $\mathcal{L}_{\mathcal{D}}$. For simplicity we assume that $l(-, z)$ is differentiable for all $z \in Z$. We construct the random direction $v_t$ as follows: Sample $z \sim \mathcal{D}$ and put

$$ v_t = \nabla l(w^t, z)$$

where the gradient is taken w.r.t. $w$. Interchanging integration and gradient we get

$$ \mathbb{E}[v_t \mid w^t] = \mathbb{E}z[\nabla l(w^t, z)] = \nabla \mathbb{E}z[l(w^t, z)] = \nabla \mathcal{L}{\mathcal{D}}(w) \in \partial\mathcal{L}{\mathcal{D}}(w).$$

The same argument can be applied to the subgradient case. Let $v_t \in \partial l(w^t, z)$ for a sample $z \sim \mathcal{D}$. Then by definition for all $u$

$$ l(u,z) - l(w^t,z) \geq (u-w^t, v_t)$$

By applying the expectation on both sides of the inequality we get

$$ \mathcal{L}{\mathcal{D}}(u) - \mathcal{L}{\mathcal{D}}(w^t)\geq \mathbb{E}_z[(u-w^t, v_t) \mid w^t] = (u-w^t, \mathbb{E}_z[v_t|w^t])$$

and therefore $\mathbb{E}[v_t \mid w^t] \in \partial \mathcal{L}_{\mathcal{D}}(w^t)$. This yields the following special form of SGD.

Definition (Stochastic Gradient Descent for Risk Minimization):

Input: $\eta &gt; 0$, $T \in \mathbb{N}$

  • Initialize $w^1=0$
  • For $t=1...T$
    • Sample $z \in \mathcal{D}$
    • Pick $v_t \in \partial l(w^t, z)$
    • $w^{t+1} = w^t - \eta v_t$
  • Return $\overline{w} = \frac{1}{T} \sum_{t=1}^T w^t.$

Using Theorem 14.8. we get the following corollary.

Corollary: Let $B, \rho &gt; 0$ and $\mathcal{L}_{\mathcal{D}}$ convex such that

$$ ||\partial l(w^1, z)||, ..., ||\partial l(w^T, z)|| \leq \rho $$

almost sure. Let $w^* \in \mathrm{argmin}_{\rVert w \lVert \leq B}f(w)$. Then to achieve

$$\mathbb{E}f(\overline{w}) - f(w^*) \leq \epsilon$$

for given $\epsilon &gt; 0$ we need to run SGD with stepsize $\eta = \frac{B}{\rho}\frac{1}{\sqrt{T}}$ and $T \geq (B \rho / \epsilon)^2$ iterations.

Questions

Is the learning rate of SGD in Theorem 14.8 decaying?

Is $$w^{}$$ the minimum? If not, if we get any arbitrary $$w^{}$$ would the algorithm converge to it?

Where is convexity requirement critical?

How do we sample from the distribution D

If one knows the distribution, can they calculate the gradient directly?

Are there cases where it is better to use SGD over GD and vice-versa?