This post describes two approaches for deriving the Expected Lower Bound (ELBO) used in variational inference. Let us begin with a little bit of motivation.

Consider a probabilistic model where we are interested in maximizing the marginal likelihood $p(X)$ for which direct optimization is difficult, but optimizing complete-data likelihood $p(X, Z)$ is significantly easier.

In a bayesian setting, we condition on the data $X$ and compute the posterior distribution $p(Z | X)$ over the latent variables given our observed data. This may however require approximate inference. There are two general approaches, sampling using MCMC, and optimization using variational inference.

The main idea behind variational inference is to consider a family of densities $\mathcal(Q)$ over the latent variables, and use optimization to find $q(Z)$ that approximates our target posterior $p(Z | X)$. We measure this using the Kullback-Leiber divergence, that is

$$ q^*(Z) = {\arg\min}_{q(Z) \in \mathcal{Q}} KL(q(Z)\ ||\ p(Z | X)). $$

However, optimizing the KL divergence directly is not tractable, because it requires us to compute the log posterior $p(Z | X)$, specifically

$$ KL(q(Z)\ ||\ p(Z | X)) = -\mathrm{E}_q \left[\log \frac{p(Z | X)}{q(Z)} \right]. $$

We can however do a bit of equation shuffling (note we omit the explicit density in the expectation since all of them are taken w.r.t $q$)

$$
\begin{aligned}
KL(q(Z)\ ||\ p(Z | X)) &= -\mathrm{E} \left[\log \frac{p(Z | X)}{q(Z)} \right] \\\\

&= \mathrm{E} \left[\log \frac{q(Z)}{p(Z | X)} \right] \\\\

&= \mathrm{E} \left[\log q(Z) \right] - \mathrm{E} \left[\log p(Z | X) \right] \\\\

&= \mathrm{E} \left[\log q(Z) \right] - \mathrm{E} \left[\log p(Z, X) \right] + \mathrm{E} \left[ \log p(X) \right] \\\\

&= \mathrm{E} \left[\log \frac{q(Z)}{p(Z, X)} \right] + \log p(X) \\\\

&= -\mathrm{E} \left[\log \frac{p(Z, X)}{q(Z)} \right] + \log p(X) \\\\

\end{aligned}
$$

where the last equations is a consequence of $\log p(X)$ being independent of $q(Z)$. Re-writing the equation and moving everything except for $\log p(X)$ to the right we get

$$ \log p(X) = \mathrm{E} \left[\log \frac{p(Z, X)}{q(Z)} \right] + KL(q(Z)\ ||\ p(Z | X)). $$

The first term on the right is usually called the **expected lower bound** (ELBO, or **variational lower bound**). Let us denote it as

$$ \mathcal{L}(q) = \mathrm{E} \left[\log \frac{p(Z, X)}{q(Z)} \right] $$

giving us the final equation

$$ \log p(X) = \mathcal{L}(g) + KL(q(Z)\ ||\ p(Z | X)). $$

Now comes the interesting part. Because we are interested in optimizing by changing $q$, the $\log p(X)$ does not change when $q$ changes. And because the KL divergence between $q(Z)$ and $p(Z | X)$ is always positive, then $\mathcal{L}(g)$ must be a lower bound on $\log p(X)$. As a result, because changing the ELBO by manipulating $q$ does not change $\log p(X)$, the expression on the right must be equal to a constant, which means that increasing $\mathcal{L}(g)$ must decrease $KL(q(Z) || p(Z|X))$. But this is what we wanted all along!

If we find a way to maximize the ELBO, we are effectively minimizing the KL divergence between our approximate distribution $q(Z)$, and our target posterior distribution $p(Z | X)$. If we were to choose $q(Z) = p(Z | X)$, the KL divergence would be zero, and $\mathcal{L}(g) = \log p(X)$. This justifies maximizing the ELBO as an objective in variational inference.

## ELBO using Jensen's inequality

The Jensen's inequality will give us a bit of motivation behind the ELBO.

In simple terms, Jensen's inequality states that for a convex function $f(x)$ and a random variable $X$ we get

$$ E[g(X)] \geq g(E[X]). $$

Recall that we're interested in

$$ \log p(X) = \log \left( \sum_Z p(X, Z) \right). $$

Introducing a new density $q(Z)$ on the latent variable $Z$ we can re-write the last equation as

$$ \log \left( \sum_Z p(X, Z) \frac{q(Z)}{q(Z)} \right) = \log \left( \sum_Z q(Z) \frac{p(X, Z)}{q(Z)} \right) = \log \mathrm{E}_q \left[ \frac{p(X, Z)}{q(Z)} \right]. $$

We can now simply apply the Jensen's inequality and immediately arrive at the ELBO as a lower bound, since

$$ \log p(X) = \log \mathrm{E}_q \left[ \frac{p(X, Z)}{q(Z)} \right] \geq \mathrm{E}_q \left[ \log \frac{p(X, Z)}{q(Z)} \right] = \mathcal{L}(q). $$

Note that we got the same exact equation as above, showing that $\mathcal{L}$ is indeed a lower bound on $\log p(X)$.

### References

- Understanding the Variational Lower Bound, Xitong Yang
- Machine Learning: a Probabilistic Perspective, Kevin Patrick Murphy
- Variational Inference: A Review for Statisticians, David M. Blei, Alp Kucukelbir, Jon D. McAuliffe
- Pattern Recognition and Machine Learning, Christopher Bishop

#### See also

#### Liked this blog post? Share it!

comments powered by Disqus