You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In Classical Gradient Descent we want to minimize $f \in \mathrm{C}^1(S, \mathbb{R})$ where $S \subseteq \mathbb{R}^n$ is open. The idea is to iteratively descent into the negative gradient direction of $f$ with small stepsize.
at a given point $w \in S$ is called subgradient of $f$ at $w$. The set of all subgradients at $w$ is denoted with $\partial f(w)$.
Facts:
If $f$ is convex then $\partial f(w) \neq \emptyset.$
If $f$ is convex and differentiable at $w$ then $$\partial f(w) = { \nabla f(w) }.$$
Example:
$$\partial |.|(x) = \begin{cases}
{+1}: : x > 0\\
[-1,1]: : x =0 \\
{-1}: : x < 0.
\end{cases}
$$
§3. Lemma 14.1
For later convergence theorems we need the following lemma.
Lemma 14.1: Let $v_1, ..., v_n \in \mathbb{R}^n$. Then any algorithm with initialization $w^1=0$ and update rule $w^{t+1} = w^t - \eta v_t$ for $\eta >0$ statisfies for all $w^* \in \mathbb{R}^n$
In particular for every $B, \rho > 0$ if $ \rVert v_1 \lVert, ..., \rVert v_T \lVert \leq \rho $ and $w^* \in \overline{\mathbb{B}}_B(0)$ then with $\eta = \frac{B}{\eta} \frac{1}{\sqrt{T}}$ we have
Theorem: Let $B, \rho >0$ and $f:S \rightarrow \mathbb{R}$ convex. Let $w^* \in \mathrm{argmin}_{\rVert w \lVert \leq B} f(w)$ for $S \subseteq \mathbb{R}^n$. Assume that SGD runs with $T$ iterations and stepsize $\eta = \frac{B}{\rho}\frac{1}{\sqrt{T}}$ and assume that $ \rVert v_1 \lVert, ..., \rVert v_T \lVert \leq \rho$ almost sure. Then
Therefore for given $\epsilon > 0$ to achieve $\mathbb{E}f(\overline{w}) - f(w^*) \leq \epsilon$ one needs to run $T \geq (B\rho / \epsilon)^2$ iterations of SGD.
Proof: We use the notation $v_{1:T} = v_1, ..., v_T$. Since $f$ is convex we can apply Jensens inequality to obtain
Next recall the law of total expectation: Let $\alpha$, $\beta$ be random variables and $g$ some function then $$\mathbb{E}{\alpha}[g(\alpha)]=\mathbb{E}{\beta}[\mathbb{E}{\alpha}[g(\alpha) \mid \beta]]$$. Put $$\alpha=v{1:t}$$ and $$\beta=v_{1:t-1}$$ then
SGD allows us to directly minimize $\mathcal{L}_{\mathcal{D}}$. For simplicity we assume that $l(-, z)$ is differentiable for all $z \in Z$. We construct the random direction $v_t$ as follows: Sample $z \sim \mathcal{D}$ and put
$$ v_t = \nabla l(w^t, z)$$
where the gradient is taken w.r.t. $w$. Interchanging integration and gradient we get
The same argument can be applied to the subgradient case. Let $v_t \in \partial l(w^t, z)$ for a sample $z \sim \mathcal{D}$. Then by definition for all $u$
$$ l(u,z) - l(w^t,z) \geq (u-w^t, v_t)$$
By applying the expectation on both sides of the inequality we get