In the previous article we derived a maximum likelihood estimate (MLE) for the
parameters of a Multinomial distribution. This time
we’re going to compute the full posterior of the Dirichlet-Categorical model as
well as derive the posterior predictive distribution. This will close our
exploration of the Bag of Words model.
Similarly as in the previous article, our likelihood will be defined by a
Multinomial distribution, that is
Since the Dirichlet distribution is a conjugate prior to the Multinomial, we
can omit the normalization constants as we will be able to infer them
afterwards from the unnormalized posterior parameters. Knowing that the
posterior is again a Dirichlet distribution saves us a lot of tedious work.
Much like the model name would suggest, our prior will be the Dirichlet distribution,
which defines an probability simplex over the Multinomial’s parameters.
The prior has the form
Multiplying the likelihood by the prior will directly give us the shape of the posterior
because of the conjugacy. We don’t have to care about the normalizing constant. As a result, we obtain
We can write this more succintly as where is the vector of counts of the observed data .
MAP estimate of the parameters
Since we have our posterior, we can take a small detour and compute the
maximum-aposteriori (MAP) estimate of the parameters, which is simply the mode
of the posterior (its maximum). We can do this similarly to the previous article and use lagrange multipliers to enforce the constraint that . Since the Dirichlet distribution is again of the exponential family,
we differentiate the log posterior, which in turn is the log likelihood plus the log prior
The lagrangian than has the following form
Same as before, we differentiate the lagrangian with respect to
and set it equal to zero
Finally, we can apply the same trick as before and solve for
We can plug this back in to get the MAP estimate
Comparing this with the MLE estimate, which was
we can see the concentration parameter affects the
probability. If we were to set a uniform prior with , we would
recover the original MLE estimate.
The posterior predictive distribution give us a distribution over the possible
outcomes while taking into account our uncertainty in the parameters given by
the posterior distribution. For a general model with an outcome and a parameter
vector the posterior predictive is given by the following
Before we can integrate this, let us introduce a small
trick. For any
let us define
, that is all except for .
Using this we can write a marginal as
The posterior predictive
can then be re-written using this trick as a double integral
Posterior predictive for single trival Dirichlet-Categorical
If we’re considering a single-trial multinomial (multinoulli) we have , which is independent of
, simplifying the above expression
Now applying the marginalization trick we get and our posterior has the
Looking more closely at the formula, we can see this is an expectation of
under the posterior, that is
where and .
Repeating the result one more time for clarity, the posterior predictive
for a single trial Multinomial (Multinoulli) is given by
Posterior predictive for a general multi-trial Dirichlet-Multinomial
Generalizing the posterior predictive to a Dirichlet-Multinomial model with
multiple trials is going to be a little bit more work. Let us begin by writing
the posterior predictive in its full form (note we drop the conditioning on
in the likelihood for brevity, and because it is not needed). To avoid notation
clashes, let us replace the posterior by
, so we’ll write and
in place of and .
where in the last equality we made use of knowing that the integral of an
unnormalized Dirichlet distribution is . Let us repeat
the definition of again, that is
and plugging this back into the formula we computed
To move forward, we need to introduce a more general form for the multinomial distribution
which allows for non-integer counts. All it comes down is basically replacing factorials with
the gamma function, that is instead of
Since only the normalizing constant changed, we can plug it back into our posterior predictive formula
which although ugly, it is the posterior predictive distribution in closed form :)
If you have any questions, feedback, or suggestions, please do share them in the comments! I'll try to answer each and every one. If something in the article wasn't clear don't be afraid to mention it. The goal of these articles is to be as informative as possible.