This post describes two approaches for deriving the Expected Lower Bound (ELBO) used in variational inference. Let us begin with a little bit of motivation.
Consider a probabilistic model where we are interested in maximizing the marginal likelihood for which direct optimization is difficult, but optimizing complete-data likelihood is significantly easier.
In a bayesian setting, we condition on the data and compute the posterior distribution over the latent variables given our observed data. This may however require approximate inference. There are two general approaches, sampling using MCMC, and optimization using variational inference.
The main idea behind variational inference is to consider a family of densities over the latent variables, and use optimization to find that approximates our target posterior . We measure this using the Kullback-Leiber divergence, that is
However, optimizing the KL divergence directly is not tractable, because it requires us to compute the log posterior , specifically
We can however do a bit of equation shuffling (note we omit the explicit density in the expectation since all of them are taken w.r.t )
where the last equations is a consequence of being independent of . Re-writing the equation and moving everything except for to the right we get
The first term on the right is usually called the expected lower bound (ELBO, or variational lower bound). Let us denote it as
giving us the final equation
Now comes the interesting part. Because we are interested in optimizing by changing , the does not change when changes. And because the KL divergence between and is always positive, then must be a lower bound on . As a result, because changing the ELBO by manipulating does not change , the expression on the right must be equal to a constant, which means that increasing must decrease . But this is what we wanted all along!
If we find a way to maximize the ELBO, we are effectively minimizing the KL divergence between our approximate distribution , and our target posterior distribution . If we were to choose , the KL divergence would be zero, and . This justifies maximizing the ELBO as an objective in variational inference.
ELBO using Jensen’s inequality
The Jensen’s inequality will give us a bit of motivation behind the ELBO.
In simple terms, Jensen’s inequality states that for a convex function and a random variable we get
Recall that we’re interested in
Introducing a new density on the latent variable we can re-write the last equation as
We can now simply apply the Jensen’s inequality and immediately arrive at the ELBO as a lower bound, since
Note that we got the same exact equation as above, showing that is indeed a lower bound on .
Share on Twitter and Facebook
Discussion of "Variational Inference - Deriving ELBO"
If you have any questions, feedback, or suggestions, please do share them in the comments! I'll try to answer each and every one. If something in the article wasn't clear don't be afraid to mention it. The goal of these articles is to be as informative as possible.
If you'd prefer to reach out to me via email, my address is loading ..