In the previous article we derived a maximum likelihood estimate (MLE) for the
parameters of a Multinomial distribution. This time
we’re going to compute the full posterior of the Dirichlet-Categorical model as
well as derive the posterior predictive distribution. This will close our
exploration of the Bag of Words model.
Likelihood
Similarly as in the previous article, our likelihood will be defined by a
Multinomial distribution, that is
p(D∣π)∝i+1∏mπixi.
Since the Dirichlet distribution is a conjugate prior to the Multinomial, we
can omit the normalization constants as we will be able to infer them
afterwards from the unnormalized posterior parameters. Knowing that the
posterior is again a Dirichlet distribution saves us a lot of tedious work.
Prior
Much like the model name would suggest, our prior will be the Dirichlet distribution,
which defines an m−dimensional probability simplex over the Multinomial’s parameters.
The prior has the form
p(π∣α)=B(α)1i=1∏mπiαi−1.
Posterior
Multiplying the likelihood by the prior will directly give us the shape of the posterior
because of the conjugacy. We don’t have to care about the normalizing constant. As a result, we obtain
p(π∣D)∝p(D∣π)p(π∣α)=i=1∏mπixii=1∏mπiαi−1∝i=1∏mπiαi+xi−1∝Dir(π∣α1+x1,α2+x2,…,αm+xm)
We can write this more succintly as Dir(π∣α−x) where x is the vector of counts of the observed data D.
MAP estimate of the parameters
Since we have our posterior, we can take a small detour and compute the
maximum-aposteriori (MAP) estimate of the parameters, which is simply the mode
of the posterior (its maximum). We can do this similarly to the previous article and use lagrange multipliers to enforce the constraint that ∑i=1mπi=1. Since the Dirichlet distribution is again of the exponential family,
we differentiate the log posterior, which in turn is the log likelihood plus the log prior
logp(π∣D)∝logp(D∣π)+logp(π∣α).
The lagrangian than has the following form
L(π,λ)=i=1∑mxilogπi+i=1∑m(αi−1)logπi+λ(1−i=1∑mπi).
Same as before, we differentiate the lagrangian with respect to πi
∂πi∂L(π,λ)=πixi+πiαi−1−λ=πixi+αi−1−λ
and set it equal to zero
0λπi=πixi+αi−1−λ=πixi+αi−1=λxi+αi−1.
Finally, we can apply the same trick as before and solve for λ
πii=1∑mπi1λλ=λxi+αi−1=i=1∑mλxi+αi−1=i=1∑mλxi+αi−1=i=1∑m(xi+αi−1)=n−m+i=1∑mαi.
We can plug this back in to get the MAP estimate
πi=n+(∑i=1mαi)−mxi+αi−1.
Comparing this with the MLE estimate, which was
πi=nxi
we can see the concentration parameter α affects the
probability. If we were to set a uniform prior with αi=1, we would
recover the original MLE estimate.
Posterior predictive
The posterior predictive distribution give us a distribution over the possible
outcomes while taking into account our uncertainty in the parameters given by
the posterior distribution. For a general model with an outcome X and a parameter
vector θ the posterior predictive is given by the following
p(X∣D)=∫p(X∣θ,D)p(θ∣D) dθ
Before we can integrate this, let us introduce a small
trick. For any
θ=(θ1,…,θm) let us define
θ¬j=(θ1,…,θj−1,θj+1,…,θm), that is all θi except for θj.
Using this we can write a marginal p(θj) as
∫p(θj,θ¬j) dθ¬j=p(θj)
The posterior predictive
p(X=j∣θ)=∫p(X∣θ)p(θ) dθ
can then be re-written using this trick as a double integral
∫θj∫θ¬jp(X=j∣θ)p(θ) dθ¬j dθj.
Posterior predictive for single trival Dirichlet-Categorical
If we’re considering a single-trial multinomial (multinoulli) we have p(X=j∣π)=πj, which is independent of
π¬j, simplifying the above expression
∫πjπj∫π¬jp(π) dπ¬j dπj.
Now applying the marginalization trick we get ∫π¬jp(π) dπ¬j=p(πj) and our posterior has the
form
∫πjπjp(πj) dπj.
Looking more closely at the formula, we can see this is an expectation of
πj under the posterior, that is
∫πjπjp(πj∣D) dπj=E[πj∣D]=∑i=1m(αi+xi)αj+xj=α0+Nαj+xj
where α0=∑i=1mαi and N=∑i=1mxi.
Repeating the result one more time for clarity, the posterior predictive
for a single trial Multinomial (Multinoulli) is given by
p(X=j∣D)=α0+Nαj+xj
Posterior predictive for a general multi-trial Dirichlet-Multinomial
Generalizing the posterior predictive to a Dirichlet-Multinomial model with
multiple trials is going to be a little bit more work. Let us begin by writing
the posterior predictive in its full form (note we drop the conditioning on D
in the likelihood for brevity, and because it is not needed). To avoid notation
clashes, let us replace the posterior α+x by
α′, so we’ll write Dir(α′) and αi′
in place of Dir(α+x) and αi+xi.
p(X∣D)=∫p(X∣π)p(π∣D) dπ=∫Mult(X∣π)Dir(α′) dπ=∫((x1!…xm!n!)i=1∏mπixi)(B(α+x)1i=1∏mπiαi′−1) dπ=(x1!…xm!n!)B(α′)1∫i=1∏mπixii=1∏mπiαi′−1 dπ=(x1!…xm!n!)B(α′)1∫i=1∏mπixi+αi′−1 dπ=(x1!…xm!n!)B(α′)1B(α′+x)
where in the last equality we made use of knowing that the integral of an
unnormalized Dirichlet distribution is B(α). Let us repeat
the definition of B(α) again, that is
B(α)=Γ(∑i=1mαi)∏i=1mΓ(αi)
and plugging this back into the formula we computed
p(X∣D)=(x1!…xm!n!)B(α′)1B(α′+x)=x1!…xm!n!∏i=1mΓ(αi′)Γ(∑i=1mαi′)Γ(∑i=1mαi′+xi)∏i=1mΓ(αi′+xi)
To move forward, we need to introduce a more general form for the multinomial distribution
which allows for non-integer counts. All it comes down is basically replacing factorials with
the gamma function, that is instead of
p(x∣π,n)=(x1!…xm!n!)i=1∏mπixi
we write
p(x∣π,n)=∏i=1mΓ(xi+1)Γ(∑i=1mxi+1)i=1∏mπixi.
Since only the normalizing constant changed, we can plug it back into our posterior predictive formula
p(X∣D)=∏i=1mΓ(xi+1)Γ(∑i=1mxi+1)∏i=1mΓ(αi′)Γ(∑i=1mαi′)Γ(∑i=1mαi′+xi)∏i=1mΓ(αi′+xi)
which although ugly, it is the posterior predictive distribution in closed form :)
References
If you have any questions, feedback, or suggestions, please do share them in the comments! I'll try to answer each and every one. If something in the article wasn't clear don't be afraid to mention it. The goal of these articles is to be as informative as possible.