Variational Autoencodres (VAEs) are "deep" but conceptually simple generative models. To sample a data point
-
First, sample latent variables
$\mbz_n$ , \begin{align*} \mbz_n &\sim \mathrm{N}(\mbzero, \mbI) \end{align*} -
Then sample the data point
$\mbx_n$ from a conditional distribution with mean, \begin{align*} \E[\mbx_n \mid \mbz_n] &= g(\mbz_n; \mbtheta), \end{align*} where$g: \reals^H \to \reals^D$ is a nonlinear mapping parameterized by$\mbtheta$ .
We will assume
We have two goals. The learning goal is to find the parameters that maximize the marginal likelihood of the data, \begin{align*} \mbtheta^\star &= \arg \max_{\mbtheta} p(\mbX; \mbtheta)\ &= \arg \max_{\mbtheta} \prod_{n=1}^N \int p(\mbx_n \mid \mbz_n; \mbtheta) , p(\mbz_n; \mbtheta) \dif \mbz_n \end{align*}
The inference goal is to find the posterior distribution of latent variables, \begin{align*} p(\mbz_n \mid \mbx_n; \mbtheta) &= \frac{p(\mbx_n \mid \mbz_n; \mbtheta) , p(\mbz_n; \mbtheta)}{\int p(\mbx_n \mid \mbz_n'; \mbtheta) , p(\mbz_n'; \mbtheta)\dif \mbz_n'} \end{align*}
Both goals require an integral over
Idea: Use the ELBO to get a bound on the marginal probability and maximize that instead.
\begin{align*}
\log p(\mbX ; \mbtheta)
&= \sum_{n=1}^N \log p(\mbx_n; \mbtheta) \
&\geq \sum_{n=1}^N \log p(\mbx_n; \mbtheta) - \KL{q_n(\mbz_n; \mblambda_n)}{p(\mbz_n \mid \mbx_n; \mbtheta)} \
&= \sum_{n=1}^N \underbrace{\E_{q_n(\mbz_n)}\left[ \log p(\mbx_n, \mbz_n; \mbtheta) - \log q_n(\mbz_n; \mblambda_n) \right]}{\text{"local ELBO"}} \
&\triangleq \sum{n=1}^N \cL_n(\mblambda_n, \mbtheta) \
&= \cL(\mblambda, \mbtheta)
\end{align*}
where
Here, I've written the ELBO as a sum of local ELBOs
The ELBO is still maximized (and the bound is tight) when each
:::{admonition} Question
Suppose
Nevertheless, we can still constrain
Then, for fixed parameters
Now we can introduce a new algorithm.
:::{prf:algorithm} Variational EM (vEM) Repeat until either the ELBO or the parameters converges:
-
M-step: Set
$\mbtheta \leftarrow \arg \max_{\mbtheta} \cL(\mblambda, \mbtheta)$ -
E-step: Set
$\mblambda_n \leftarrow \arg \max_{\mblambda_n \in \mbLambda} \cL_n(\mblambda_n, \mbtheta)$ for$n=1,\ldots,N$ - Compute (an estimate of) the ELBO
$\cL(\mblambda, \mbtheta)$ . :::
In general, none of these steps will have closed form solutions, so we'll have to use approximations.
For exponential family mixture models, the M-step had a closed form solution. For deep generative models, we need a more general approach.
If the parameters are unconstrained and the ELBO is differentiable wrt
Note that the expected gradient wrt
Assume
To perform SGD, we need an unbiased estimate of the gradient of the local ELBO, but \begin{align*} \nabla_{\mblambda_n} \cL_n(\mblambda_n, \mbtheta) &= \nabla_{\mblambda_n} \E_{q(\mbz_n; \mblambda_n)} \left[ \log p(\mbx_n, \mbz_n; \mbtheta) - \log q(\mbz_n; \mblambda_n) \right] \ &\textcolor{red}{\neq} ; \E_{q(\mbz_n; \mblambda_n)} \left[ \nabla_{\mblambda_n} \left(\log p(\mbx_n, \mbz_n; \mbtheta) - \log q(\mbz_n; \mblambda_n)\right) \right]. \end{align*}
One way around this problem is to use the reparameterization trick, aka the pathwise gradient estimator. Note that,
\begin{align*}
\mbz_n \sim q(\mbz_n; \mblambda_n) \quad \iff \quad
\mbz_n &= r(\mblambda_n, \mbepsilon), \quad \mbepsilon \sim \cN(\mbzero, \mbI)
\end{align*}
where
We can use the law of the unconscious statistician to rewrite the expectations as,
\begin{align*}
\E_{q(\mbz_n; \mblambda_n)} \left[h(\mbx_n, \mbz_n, \mbtheta, \mblambda_n) \right]
&= \E_{\mbepsilon \sim \cN(\mbzero, \mbI)} \left[h(\mbx_n, r(\mblambda_n, \mbepsilon), \mbtheta, \mblambda_n) \right]
\end{align*}
where
\begin{align*}
h(\mbx_n, \mbz_n, \mbtheta, \mblambda_n) = \log p(\mbx_n, \mbz_n; \mbtheta) - \log q(\mbz_n; \mblambda_n).
\end{align*}
The distribution that the expectation is taken under no longer depends on the parameters
We can view the ELBO as an expectation over data indices, \begin{align*} \cL(\mblambda, \mbtheta) &= \sum_{n=1}^N \cL_n(\mblambda_n, \mbtheta) \ &= N , \E_{n \sim \mathrm{Unif}([N])}[\cL_n(\mblambda_n, \mbtheta)]. \end{align*} We can use Monte Carlo to approximate the expectation (and its gradient) by drawing mini-batches of data points at random.
In practice, we often cycle through mini-batches of data points deterministically. Each pass over the whole dataset is called an epoch.
Now we can add some detail to our variational expectation maximization algorithm.
:::{prf:algorithm} Variational EM (with the reparameterization trick)
For epoch
For
-
Sample
$\epsilon_n^{(m)} \iid{\sim} \cN(\mbzero, \mbI)$ for$m=1,\ldots,M$ . -
M-Step:
a. Estimate \begin{align*} \hat{\nabla}{\mbtheta} \cL_n(\mblambda_n, \mbtheta) &= \frac{1}{M} \sum{m=1}^M \left[ \nabla_{\mbtheta} \log p(\mbx_n, r(\mblambda_n, \mbepsilon_n^{(m)}); \mbtheta) \right] \end{align*}
b. Set
$\mbtheta \leftarrow \mbtheta + \alpha_i N \hat{\nabla}_{\mbtheta} \cL_n(\mblambda_n, \mbtheta)$ -
E-step:
a. Estimate \begin{align*} \hat{\nabla}{\mblambda} \cL_n(\mblambda_n, \mbtheta) &= \frac{1}{M} \sum{m=1}^M \nabla_{\mblambda} \left[\log p(\mbx_n, r(\mblambda_n, \mbepsilon_n^{(m)}); \mbtheta) - \log q(r(\mblambda_n, \mbepsilon_n^{(m)}), \mblambda_n) \right] \end{align*}
b. Set
$\mblambda_n \leftarrow \mblambda_n + \alpha_i \hat{\nabla}_{\mblambda} \cL_n(\mblambda_n, \mbtheta)$ . -
Estimate the ELBO \begin{align*} \hat{\cL}(\mblambda, \mbtheta) &= \frac{N}{M} \sum_{m=1}^M \log p(\mbx_n, r(\mblambda_n, \mbepsilon_n^{(m)}); \mbtheta) - \log q(r(\mblambda_n, \mbepsilon_n^{(m)}); \mblambda_n) \end{align*}
-
Decay step size
$\alpha_i$ according to schedule. :::
Note that vEM involves optimizing separate variational parameters
Note that the optimal variational parameters are just a function of the data point and the model parameters,
\begin{align*}
\mblambda_n^\star &= \arg \min_{\mblambda_n} \KL{q(\mbz_n; \mblambda_n)}{p(\mbz_n \mid \mbx_n, \mbtheta)}
\triangleq f^\star(\mbx_n, \mbtheta).
\end{align*}
for some implicit and generally nonlinear function
VAEs learn an approximation to
The inference network is (yet another) neural network that takes in a data point
The advantage is that the inference network shares information across data points — it amortizes the cost of inference, hence the name. The disadvantage is the output will not minimize the KL divergence. However, in practice we might tolerate a worse variational posterior and a weaker lower bound if it leads to faster optimization of the ELBO overall.
Logically, I find it helpful to distinguish between the E and M steps, but with recognition networks and stochastic gradient ascent, the line is blurred.
The final algorithm looks like this.
:::{prf:algorithm} Variational EM (with amortized inference)
Repeat until either the ELBO or the parameters converges:
-
Sample data point
$n \sim \mathrm{Unif}(1, \ldots, N)$ . [Or a minibatch of data points.] -
Estimate the local ELBO
$\cL_n(\mbphi, \mbtheta)$ with Monte Carlo. [Note: it is a function of$\mbphi$ instead of$\mblambda_n$ .] -
Compute unbiased Monte Carlo estimates of the gradients $\widehat{\nabla}{\mbtheta} \cL_n(\mbphi, \mbtheta)$ and $\widehat{\nabla}{\mbphi} \cL_n(\mbphi, \mbtheta)$. [The latter requires the reparameterization trick.]
-
Set \begin{align*} \mbtheta &\leftarrow \mbtheta + \alpha_i \widehat{\nabla}{\mbtheta} \cL_n(\mbphi, \mbtheta) \ \mbphi &\leftarrow \mbphi + \alpha_i \widehat{\nabla}{\mbphi} \cL_n(\mbphi, \mbtheta) \end{align*} with step size
$\alpha_i$ decreasing over iterations$i$ according to a valid schedule.
:::