Posterior Predictive Distribution for the Dirichlet-Categorical Model (Bag of Words)

7 min read • Published: December 04, 2018

In the previous article we derived a maximum likelihood estimate (MLE) for the parameters of a Multinomial distribution. This time we’re going to compute the full posterior of the Dirichlet-Categorical model as well as derive the posterior predictive distribution. This will close our exploration of the Bag of Words model.


Similarly as in the previous article, our likelihood will be defined by a Multinomial distribution, that is

p(Dπ)i+1mπixi.p(D|\boldsymbol\pi) \propto \prod_{i+1}^m \pi_i^{x_i}.

Since the Dirichlet distribution is a conjugate prior to the Multinomial, we can omit the normalization constants as we will be able to infer them afterwards from the unnormalized posterior parameters. Knowing that the posterior is again a Dirichlet distribution saves us a lot of tedious work.


Much like the model name would suggest, our prior will be the Dirichlet distribution, which defines an mdimensionalm-dimensional probability simplex over the Multinomial’s parameters. The prior has the form

p(πα)=1B(α)i=1mπiαi1.p(\boldsymbol\pi|\boldsymbol\alpha) = \frac{1}{B(\boldsymbol\alpha)} \prod_{i=1}^m \pi_i^{\alpha_i - 1}.


Multiplying the likelihood by the prior will directly give us the shape of the posterior because of the conjugacy. We don’t have to care about the normalizing constant. As a result, we obtain

p(πD)p(Dπ)p(πα)=i=1mπixii=1mπiαi1i=1mπiαi+xi1Dir(πα1+x1,α2+x2,,αm+xm)\begin{aligned} p(\boldsymbol\pi | D) &\propto p(D|\boldsymbol\pi) p(\boldsymbol\pi | \boldsymbol\alpha) \\ &= \prod_{i=1}^m \pi_i^{x_i} \prod_{i=1}^m \pi_i^{\alpha_i - 1} \\ &\propto \prod_{i=1}^m \pi_i^{\alpha_i + x_i - 1} \\ &\propto Dir(\boldsymbol\pi | \alpha_1 + x_1, \alpha_2 + x_2, \ldots, \alpha_m + x_m) \end{aligned}

We can write this more succintly as Dir(παx)Dir(\boldsymbol\pi | \boldsymbol\alpha - \boldsymbol x) where xx is the vector of counts of the observed data DD.

MAP estimate of the parameters

Since we have our posterior, we can take a small detour and compute the maximum-aposteriori (MAP) estimate of the parameters, which is simply the mode of the posterior (its maximum). We can do this similarly to the previous article and use lagrange multipliers to enforce the constraint that i=1mπi=1\sum_{i=1}^m \pi_i = 1. Since the Dirichlet distribution is again of the exponential family, we differentiate the log posterior, which in turn is the log likelihood plus the log prior

logp(πD)logp(Dπ)+logp(πα).\log p(\boldsymbol\pi | D) \propto \log p(D | \boldsymbol\pi) + \log p(\boldsymbol\pi | \boldsymbol\alpha).

The lagrangian than has the following form

L(π,λ)=i=1mxilogπi+i=1m(αi1)logπi+λ(1i=1mπi).L(\boldsymbol\pi, \lambda) = \sum_{i=1}^m x_i \log \pi_i + \sum_{i=1}^m (\alpha_i - 1) \log \pi_i + \lambda \left( 1 - \sum_{i=1}^m \pi_i \right).

Same as before, we differentiate the lagrangian with respect to πi\boldsymbol\pi_i

πiL(π,λ)=xiπi+αi1πiλ=xi+αi1πiλ\frac{\partial}{\partial\pi_i} L(\boldsymbol\pi, \lambda) = \frac{x_i}{\pi_i} + \frac{\alpha_i - 1}{\pi_i} - \lambda = \frac{x_i + \alpha_i - 1}{\pi_i} - \lambda

and set it equal to zero

0=xi+αi1πiλλ=xi+αi1πiπi=xi+αi1λ.\begin{aligned} 0 &= \frac{x_i + \alpha_i - 1}{\pi_i} - \lambda \\ \lambda &= \frac{x_i + \alpha_i - 1}{\pi_i} \\ \pi_i &= \frac{x_i + \alpha_i - 1}{\lambda}. \end{aligned}

Finally, we can apply the same trick as before and solve for λ\lambda

πi=xi+αi1λi=1mπi=i=1mxi+αi1λ1=i=1mxi+αi1λλ=i=1m(xi+αi1)λ=nm+i=1mαi.\begin{aligned} \pi_i &= \frac{x_i + \alpha_i - 1}{\lambda} \\ \sum_{i=1}^m \pi_i &= \sum_{i=1}^m \frac{x_i + \alpha_i - 1}{\lambda} \\ 1 &= \sum_{i=1}^m \frac{x_i + \alpha_i - 1}{\lambda} \\ \lambda &= \sum_{i=1}^m \left( x_i + \alpha_i - 1 \right) \\ \lambda &= n - m + \sum_{i=1}^m \alpha_i. \end{aligned}

We can plug this back in to get the MAP estimate

πi=xi+αi1n+(i=1mαi)m.\pi_i = \frac{x_i + \alpha_i - 1}{n + \left(\sum_{i=1}^m \alpha_i \right) - m}.

Comparing this with the MLE estimate, which was

πi=xin\pi_i = \frac{x_i}{n}

we can see the concentration parameter α\boldsymbol\alpha affects the probability. If we were to set a uniform prior with αi=1\alpha_i=1, we would recover the original MLE estimate.

Posterior predictive

The posterior predictive distribution give us a distribution over the possible outcomes while taking into account our uncertainty in the parameters given by the posterior distribution. For a general model with an outcome XX and a parameter vector θ\boldsymbol\theta the posterior predictive is given by the following

p(XD)=p(Xθ,D)p(θD) dθp(X|D) = \int p(X | \boldsymbol\theta, D) p(\boldsymbol\theta | D)\ d\boldsymbol\theta

Before we can integrate this, let us introduce a small trick. For any θ=(θ1,,θm)\boldsymbol\theta = (\theta_1,\ldots,\theta_m) let us define θ¬j=(θ1,,θj1,θj+1,,θm)\theta_{\neg j} = (\theta_1, \ldots, \theta_{j-1}, \theta_{j+1}, \ldots, \theta_m), that is all θi\theta_i except for θj\theta_j. Using this we can write a marginal p(θj)p(\theta_j) as

p(θj,θ¬j) dθ¬j=p(θj)\int p(\theta_j, \theta_{\neg j})\ d \theta_{\neg j} = p(\theta_j)

The posterior predictive

p(X=jθ)=p(Xθ)p(θ) dθp(X = j | \boldsymbol\theta) = \int p(X | \boldsymbol\theta) p(\boldsymbol\theta)\ d\theta

can then be re-written using this trick as a double integral

θjθ¬jp(X=jθ)p(θ) dθ¬j dθj.\int_{\theta_j} \int_{\theta_{\neg j}} p(X = j | \boldsymbol\theta) p(\boldsymbol\theta)\ d\theta_{\neg j}\ d\theta_j.

Posterior predictive for single trival Dirichlet-Categorical

If we’re considering a single-trial multinomial (multinoulli) we have p(X=jπ)=πjp(X = j | \boldsymbol\pi) = \pi_j, which is independent of π¬j\pi_{\neg j}, simplifying the above expression

πjπjπ¬jp(π) dπ¬j dπj.\int_{\pi_j} \pi_j \int_{\pi_{\neg j}} p(\boldsymbol\pi)\ d\pi_{\neg j}\ d\pi_j.

Now applying the marginalization trick we get π¬jp(π) dπ¬j=p(πj)\int_{\pi_{\neg j}} p(\pi)\ d\pi_{\neg j} = p(\pi_j) and our posterior has the form

πjπjp(πj) dπj.\int_{\pi_j} \pi_j p(\pi_j)\ d\pi_j.

Looking more closely at the formula, we can see this is an expectation of πj\pi_j under the posterior, that is

πjπjp(πjD) dπj=E[πjD]=αj+xji=1m(αi+xi)=αj+xjα0+N\int_{\pi_j} \pi_j p(\pi_j | D)\ d\pi_j = E[\pi_j | D] = \frac{\alpha_j + x_j}{\sum_{i=1}^m \left( \alpha_i + x_i \right)} = \frac{\alpha_j + x_j}{\alpha_0 + N}

where α0=i=1mαi\alpha_0 = \sum_{i=1}^m \alpha_i and N=i=1mxiN = \sum_{i=1}^m x_i. Repeating the result one more time for clarity, the posterior predictive for a single trial Multinomial (Multinoulli) is given by

p(X=jD)=αj+xjα0+Np(X=j | D) = \frac{\alpha_j + x_j}{\alpha_0 + N}

Posterior predictive for a general multi-trial Dirichlet-Multinomial

Generalizing the posterior predictive to a Dirichlet-Multinomial model with multiple trials is going to be a little bit more work. Let us begin by writing the posterior predictive in its full form (note we drop the conditioning on DD in the likelihood for brevity, and because it is not needed). To avoid notation clashes, let us replace the posterior α+x\boldsymbol\alpha + \boldsymbol x by α\boldsymbol \alpha', so we’ll write Dir(α)Dir(\boldsymbol\alpha') and αi\alpha_i' in place of Dir(α+x)Dir(\boldsymbol\alpha + \boldsymbol x) and αi+xi\alpha_i + x_i.

p(XD)=p(Xπ)p(πD) dπ=Mult(Xπ)Dir(α) dπ=((n!x1!xm!)i=1mπixi)(1B(α+x)i=1mπiαi1) dπ=(n!x1!xm!)1B(α)i=1mπixii=1mπiαi1 dπ=(n!x1!xm!)1B(α)i=1mπixi+αi1 dπ=(n!x1!xm!)1B(α)B(α+x)\begin{aligned} p(X|D) &= \int p(X | \boldsymbol\pi) p(\boldsymbol\pi | D)\ d\boldsymbol\pi \\ &= \int Mult(X | \boldsymbol\pi) Dir(\boldsymbol\alpha') \ d\boldsymbol\pi \\ &= \int \left(\binom{n!}{x_1! \ldots x_m!} \prod_{i=1}^m \pi_i^{x_i} \right) \left(\frac{1}{B(\boldsymbol\alpha + \boldsymbol x)} \prod_{i=1}^m \pi_i^{\alpha_i' - 1} \right) \ d\boldsymbol\pi \\ &= \binom{n!}{x_1! \ldots x_m!} \frac{1}{B(\boldsymbol\alpha')} \int \prod_{i=1}^m \pi_i^{x_i} \prod_{i=1}^m \pi_i^{\alpha_i' - 1} \ d\boldsymbol\pi \\ &= \binom{n!}{x_1! \ldots x_m!} \frac{1}{B(\boldsymbol\alpha')} \int \prod_{i=1}^m \pi_i^{x_i + \alpha_i' - 1} \ d\boldsymbol\pi \\ &= \binom{n!}{x_1! \ldots x_m!} \frac{1}{B(\boldsymbol\alpha')} B(\boldsymbol\alpha' + \boldsymbol x) \\ \end{aligned}

where in the last equality we made use of knowing that the integral of an unnormalized Dirichlet distribution is B(α)B(\boldsymbol\alpha). Let us repeat the definition of B(α)B(\boldsymbol\alpha) again, that is

B(α)=i=1mΓ(αi)Γ(i=1mαi)B(\boldsymbol\alpha) = \frac{\prod_{i=1}^m \Gamma(\alpha_i)}{\Gamma(\sum_{i=1}^m \alpha_i)}

and plugging this back into the formula we computed

p(XD)=(n!x1!xm!)1B(α)B(α+x)=n!x1!xm!Γ(i=1mαi)i=1mΓ(αi)i=1mΓ(αi+xi)Γ(i=1mαi+xi)\begin{aligned} p(X|D) &= \binom{n!}{x_1! \ldots x_m!} \frac{1}{B(\boldsymbol\alpha')} B(\boldsymbol\alpha' + \boldsymbol x)\\ &= \frac{n!}{x_1! \ldots x_m!} \frac{\Gamma(\sum_{i=1}^m \alpha_i')}{\prod_{i=1}^m \Gamma(\alpha_i')} \frac{\prod_{i=1}^m \Gamma(\alpha_i' + x_i)}{\Gamma(\sum_{i=1}^m \alpha_i' + x_i)} \\ \end{aligned}

To move forward, we need to introduce a more general form for the multinomial distribution which allows for non-integer counts. All it comes down is basically replacing factorials with the gamma function, that is instead of

p(xπ,n)=(n!x1!xm!)i=1mπixip(\boldsymbol x | \boldsymbol\pi, n) = \binom{n!}{x_1!\ldots x_m!} \prod_{i=1}^m \pi_i^{x_i}

we write

p(xπ,n)=Γ(i=1mxi+1)i=1mΓ(xi+1)i=1mπixi.p(\boldsymbol x | \boldsymbol\pi, n) = \frac{\Gamma(\sum_{i=1}^m x_i + 1)}{\prod_{i=1}^m \Gamma(x_i + 1)} \prod_{i=1}^m \pi_i^{x_i}.

Since only the normalizing constant changed, we can plug it back into our posterior predictive formula

p(XD)=Γ(i=1mxi+1)i=1mΓ(xi+1)Γ(i=1mαi)i=1mΓ(αi)i=1mΓ(αi+xi)Γ(i=1mαi+xi)\begin{aligned} p(X|D) &= \frac{\Gamma(\sum_{i=1}^m x_i + 1)}{\prod_{i=1}^m \Gamma(x_i + 1)} \frac{\Gamma(\sum_{i=1}^m \alpha_i')}{\prod_{i=1}^m \Gamma(\alpha_i')} \frac{\prod_{i=1}^m \Gamma(\alpha_i' + x_i)}{\Gamma(\sum_{i=1}^m \alpha_i' + x_i)} \\ \end{aligned}

which although ugly, it is the posterior predictive distribution in closed form :)


Share on Twitter and Facebook

Discussion of "Posterior Predictive Distribution for the Dirichlet-Categorical Model (Bag of Words)"

If you have any questions, feedback, or suggestions, please do share them in the comments! I'll try to answer each and every one. If something in the article wasn't clear don't be afraid to mention it. The goal of these articles is to be as informative as possible.

If you'd prefer to reach out to me via email, my address is loading ..