Variational Inference - Deriving ELBO

4 min read • Published: November 24, 2018

This post describes two approaches for deriving the Expected Lower Bound (ELBO) used in variational inference. Let us begin with a little bit of motivation.

Consider a probabilistic model where we are interested in maximizing the marginal likelihood p(X)p(X) for which direct optimization is difficult, but optimizing complete-data likelihood p(X,Z)p(X, Z) is significantly easier.

In a bayesian setting, we condition on the data XX and compute the posterior distribution p(ZX)p(Z | X) over the latent variables given our observed data. This may however require approximate inference. There are two general approaches, sampling using MCMC, and optimization using variational inference.

The main idea behind variational inference is to consider a family of densities (Q)\mathcal(Q) over the latent variables, and use optimization to find q(Z)q(Z) that approximates our target posterior p(ZX)p(Z | X). We measure this using the Kullback-Leiber divergence, that is

q(Z)=argminq(Z)QKL(q(Z)  p(ZX)).q^*(Z) = {\arg\min}_{q(Z) \in \mathcal{Q}} KL(q(Z)\ ||\ p(Z | X)).

However, optimizing the KL divergence directly is not tractable, because it requires us to compute the log posterior p(ZX)p(Z | X), specifically

KL(q(Z)  p(ZX))=Eq[logp(ZX)q(Z)].KL(q(Z)\ ||\ p(Z | X)) = -\mathrm{E}_q \left[\log \frac{p(Z | X)}{q(Z)} \right].

We can however do a bit of equation shuffling (note we omit the explicit density in the expectation since all of them are taken w.r.t qq)

KL(q(Z)  p(ZX))=E[logp(ZX)q(Z)]=E[logq(Z)p(ZX)]=E[logq(Z)]E[logp(ZX)]=E[logq(Z)]E[logp(Z,X)]+E[logp(X)]=E[logq(Z)p(Z,X)]+logp(X)=E[logp(Z,X)q(Z)]+logp(X)\begin{aligned} KL(q(Z)\ ||\ p(Z | X)) &= -\mathrm{E} \left[\log \frac{p(Z | X)}{q(Z)} \right] \\ &= \mathrm{E} \left[\log \frac{q(Z)}{p(Z | X)} \right] \\ &= \mathrm{E} \left[\log q(Z) \right] - \mathrm{E} \left[\log p(Z | X) \right] \\ &= \mathrm{E} \left[\log q(Z) \right] - \mathrm{E} \left[\log p(Z, X) \right] + \mathrm{E} \left[ \log p(X) \right] \\ &= \mathrm{E} \left[\log \frac{q(Z)}{p(Z, X)} \right] + \log p(X) \\ &= -\mathrm{E} \left[\log \frac{p(Z, X)}{q(Z)} \right] + \log p(X) \\ \end{aligned}

where the last equations is a consequence of logp(X)\log p(X) being independent of q(Z)q(Z). Re-writing the equation and moving everything except for logp(X)\log p(X) to the right we get

logp(X)=E[logp(Z,X)q(Z)]+KL(q(Z)  p(ZX)).\log p(X) = \mathrm{E} \left[\log \frac{p(Z, X)}{q(Z)} \right] + KL(q(Z)\ ||\ p(Z | X)).

The first term on the right is usually called the expected lower bound (ELBO, or variational lower bound). Let us denote it as

L(q)=E[logp(Z,X)q(Z)]\mathcal{L}(q) = \mathrm{E} \left[\log \frac{p(Z, X)}{q(Z)} \right]

giving us the final equation

logp(X)=L(g)+KL(q(Z)  p(ZX)).\log p(X) = \mathcal{L}(g) + KL(q(Z)\ ||\ p(Z | X)).

Now comes the interesting part. Because we are interested in optimizing by changing qq, the logp(X)\log p(X) does not change when qq changes. And because the KL divergence between q(Z)q(Z) and p(ZX)p(Z | X) is always positive, then L(g)\mathcal{L}(g) must be a lower bound on logp(X)\log p(X). As a result, because changing the ELBO by manipulating qq does not change logp(X)\log p(X), the expression on the right must be equal to a constant, which means that increasing L(g)\mathcal{L}(g) must decrease KL(q(Z)p(ZX))KL(q(Z) || p(Z|X)). But this is what we wanted all along!

If we find a way to maximize the ELBO, we are effectively minimizing the KL divergence between our approximate distribution q(Z)q(Z), and our target posterior distribution p(ZX)p(Z | X). If we were to choose q(Z)=p(ZX)q(Z) = p(Z | X), the KL divergence would be zero, and L(g)=logp(X)\mathcal{L}(g) = \log p(X). This justifies maximizing the ELBO as an objective in variational inference.

ELBO using Jensen’s inequality

The Jensen’s inequality will give us a bit of motivation behind the ELBO.

In simple terms, Jensen’s inequality states that for a convex function f(x)f(x) and a random variable XX we get

E[g(X)]g(E[X]).E[g(X)] \geq g(E[X]).

Recall that we’re interested in

logp(X)=log(Zp(X,Z)).\log p(X) = \log \left( \sum_Z p(X, Z) \right).

Introducing a new density q(Z)q(Z) on the latent variable ZZ we can re-write the last equation as

log(Zp(X,Z)q(Z)q(Z))=log(Zq(Z)p(X,Z)q(Z))=logEq[p(X,Z)q(Z)].\log \left( \sum_Z p(X, Z) \frac{q(Z)}{q(Z)} \right) = \log \left( \sum_Z q(Z) \frac{p(X, Z)}{q(Z)} \right) = \log \mathrm{E}_q \left[ \frac{p(X, Z)}{q(Z)} \right].

We can now simply apply the Jensen’s inequality and immediately arrive at the ELBO as a lower bound, since

logp(X)=logEq[p(X,Z)q(Z)]Eq[logp(X,Z)q(Z)]=L(q).\log p(X) = \log \mathrm{E}_q \left[ \frac{p(X, Z)}{q(Z)} \right] \geq \mathrm{E}_q \left[ \log \frac{p(X, Z)}{q(Z)} \right] = \mathcal{L}(q).

Note that we got the same exact equation as above, showing that L\mathcal{L} is indeed a lower bound on logp(X)\log p(X).


Share on Twitter and Facebook

Discussion of "Variational Inference - Deriving ELBO"

If you have any questions, feedback, or suggestions, please do share them in the comments! I'll try to answer each and every one. If something in the article wasn't clear don't be afraid to mention it. The goal of these articles is to be as informative as possible.

If you'd prefer to reach out to me via email, my address is loading ..