Posterior Predictive Distribution for the Dirichlet-Categorical Model (Bag of Words)

In the previous article we derived a maximum likelihood estimate (MLE) for the parameters of a Multinomial distribution. This time we’re going to compute the full posterior of the Dirichlet-Categorical model as well as derive the posterior predictive distribution. This will close our exploration of the Bag of Words model.

Likelihood

Similarly as in the previous article, our likelihood will be defined by a Multinomial distribution, that is

$$ p(D|\boldsymbol\pi) \propto \prod_{i+1}^m \pi_i^{x_i}. $$

Since the Dirichlet distribution is a conjugate prior to the Multinomial, we can omit the normalization constants as we will be able to infer them afterwards from the unnormalized posterior parameters. Knowing that the posterior is again a Dirichlet distribution saves us a lot of tedious work.

Prior

Much like the model name would suggest, our prior will be the Dirichlet distribution, which defines an $m-dimensional$ probability simplex over the Multinomial’s parameters. The prior has the form

$$ p(\boldsymbol\pi|\boldsymbol\alpha) = \frac{1}{B(\boldsymbol\alpha)} \prod_{i=1}^m \pi_i^{\alpha_i - 1}. $$

Posterior

Multiplying the likelihood by the prior will directly give us the shape of the posterior because of the conjugacy. We don’t have to care about the normalizing constant. As a result, we obtain

$$ \begin{align} p(\boldsymbol\pi | D) &\propto p(D|\boldsymbol\pi) p(\boldsymbol\pi | \boldsymbol\alpha) \\\\ &= \prod_{i=1}^m \pi_i^{x_i} \prod_{i=1}^m \pi_i^{\alpha_i - 1} \\\\ &\propto \prod_{i=1}^m \pi_i^{\alpha_i + x_i - 1} \\\\ &\propto Dir(\boldsymbol\pi | \alpha_1 + x_1, \alpha_2 + x_2, \ldots, \alpha_m + x_m) \end{align} $$

We can write this more succintly as $Dir(\boldsymbol\pi | \boldsymbol\alpha

\boldsymbol x)$ where $x$ is the vector of counts of the observed data $D$.

MAP estimate of the parameters

Since we have our posterior, we can take a small detour and compute the maximum-aposteriori (MAP) estimate of the parameters, which is simply the mode of the posterior (its maximum). We can do this similarly to the previous article and use lagrange multipliers to enforce the constraint that $\sum_{i=1}^m

\pi_i = 1$. Since the Dirichlet distribution is again of the exponential family, we differentiate the log posterior, which in turn is the log likelihood plus the log prior

$$ \log p(\boldsymbol\pi | D) \propto \log p(D | \boldsymbol\pi) + \log p(\boldsymbol\pi | \boldsymbol\alpha). $$

The lagrangian than has the following form

$$ L(\boldsymbol\pi, \lambda) = \sum_{i=1}^m x_i \log \pi_i + \sum_{i=1}^m (\alpha_i - 1) \log \pi_i + \lambda \left( 1 - \sum_{i=1}^m \pi_i \right). $$

Same as before, we differentiate the lagrangian with respect to $\boldsymbol\pi_i$

$$ \frac{\partial}{\partial\pi_i} L(\boldsymbol\pi, \lambda) = \frac{x_i}{\pi_i} + \frac{\alpha_i - 1}{\pi_i} - \lambda = \frac{x_i + \alpha_i

1}{\pi_i} - \lambda $$

and set it equal to zero

$$ \begin{align} 0 &= \frac{x_i + \alpha_i - 1}{\pi_i} - \lambda \\\\ \lambda &= \frac{x_i + \alpha_i - 1}{\pi_i} \\\\ \pi_i &= \frac{x_i + \alpha_i - 1}{\lambda}. \end{align} $$

Finally, we can apply the same trick as before and solve for $\lambda$

$$ \begin{align} \pi_i &= \frac{x_i + \alpha_i - 1}{\lambda} \\\\ \sum_{i=1}^m \pi_i &= \sum_{i=1}^m \frac{x_i + \alpha_i - 1}{\lambda} \\\\ 1 &= \sum_{i=1}^m \frac{x_i + \alpha_i - 1}{\lambda} \\\\ \lambda &= \sum_{i=1}^m \left( x_i + \alpha_i - 1 \right) \\\\ \lambda &= n - m + \sum_{i=1}^m \alpha_i. \end{align} $$

We can plug this back in to get the MAP estimate

$$ \pi_i = \frac{x_i + \alpha_i - 1}{n + \left(\sum_{i=1}^m \alpha_i \right) - m}. $$

Comparing this with the MLE estimate, which was

$$ \pi_i = \frac{x_i}{n} $$

we can see the concentration parameter $\boldsymbol\alpha$ affects the probability. If we were to set a uniform prior with $\alpha_i=1$, we would recover the original MLE estimate.

Posterior predictive

The posterior predictive distribution give us a distribution over the possible outcomes while taking into account our uncertainty in the parameters given by the posterior distribution. For a general model with an outcome $X$ and a parameter vector $\boldsymbol\theta$ the posterior predictive is given by the following

$$ p(X|D) = \int p(X | \boldsymbol\theta, D) p(\boldsymbol\theta | D)\ d\boldsymbol\theta $$

Before we can integrate this, let us introduce a small trick. For any $\boldsymbol\theta = (\theta_1,\ldots,\theta_m)$ let us define $\theta_{\neg j} = (\theta_1, \ldots, \theta_{j-1}, \theta_{j+1}, \ldots, \theta_m)$, that is all $\theta_i$ except for $\theta_j$. Using this we can write a marginal $p(\theta_j)$ as

$$ \int p(\theta_j, \theta_{\neg j})\ d \theta_{\neg j} = p(\theta_j) $$

The posterior predictive

$$ p(X = j | \boldsymbol\theta) = \int p(X | \boldsymbol\theta) p(\boldsymbol\theta)\ d\theta $$

can then be re-written using this trick as a double integral

$$ \int_{\theta_j} \int_{\theta_{\neg j}} p(X = j | \boldsymbol\theta) p(\boldsymbol\theta)\ d\theta_{\neg j}\ d\theta_j. $$

Posterior predictive for single trival Dirichlet-Categorical

If we’re considering a single-trial multinomial (multinoulli) we have $p(X = j | \boldsymbol\pi) = \pi_j$, which is independent of $\pi_{\neg j}$, simplifying the above expression

$$ \int_{\pi_j} \pi_j \int_{\pi_{\neg j}} p(\boldsymbol\pi)
d\pi_{\neg j}\ d\pi_j. $$

Now applying the marginalization trick we get $\int_{\pi_{\neg j}} p(\pi)\ d\pi_{\neg j} = p(\pi_j)$ and our posterior has the form

$$ \int_{\pi_j} \pi_j p(\pi_j)\ d\pi_j. $$

Looking more closely at the formula, we can see this is an expectation of $\pi_j$ under the posterior, that is

$$ \int_{\pi_j} \pi_j p(\pi_j | D)\ d\pi_j = E[\pi_j | D] = \frac{\alpha_j + x_j}{\sum_{i=1}^m \left( \alpha_i + x_i \right)} = \frac{\alpha_j + x_j}{\alpha_0 + N} $$

where $\alpha_0 = \sum_{i=1}^m \alpha_i$ and $N = \sum_{i=1}^m x_i$. Repeating the result one more time for clarity, the posterior predictive for a single trial Multinomial (Multinoulli) is given by

$$ p(X=j | D) = \frac{\alpha_j + x_j}{\alpha_0 + N} $$

Posterior predictive for a general multi-trial Dirichlet-Multinomial

Generalizing the posterior predictive to a Dirichlet-Multinomial model with multiple trials is going to be a little bit more work. Let us begin by writing the posterior predictive in its full form (note we drop the conditioning on $D$ in the likelihood for brevity, and because it is not needed). To avoid notation clashes, let us replace the posterior $\boldsymbol\alpha + \boldsymbol x$ by $\boldsymbol \alpha’$, so we’ll write $Dir(\boldsymbol\alpha’)$ and $\alpha_i’$ in place of $Dir(\boldsymbol\alpha + \boldsymbol x)$ and $\alpha_i + x_i$.

$$ \begin{align} p(X|D) &= \int p(X | \boldsymbol\pi) p(\boldsymbol\pi | D)\ d\boldsymbol\pi \\\\ &= \int Mult(X | \boldsymbol\pi) Dir(\boldsymbol\alpha’) \ d\boldsymbol\pi \\\\ &= \int \left(\binom{n!}{x_1! \ldots x_m!} \prod_{i=1}^m \pi_i^{x_i} \right) \left(\frac{1}{B(\boldsymbol\alpha + \boldsymbol x)} \prod_{i=1}^m \pi_i^{\alpha_i’ - 1} \right) \ d\boldsymbol\pi \\\\ &= \binom{n!}{x_1! \ldots x_m!} \frac{1}{B(\boldsymbol\alpha’)} \int \prod_{i=1}^m \pi_i^{x_i} \prod_{i=1}^m \pi_i^{\alpha_i’ - 1} \ d\boldsymbol\pi \\\\ &= \binom{n!}{x_1! \ldots x_m!} \frac{1}{B(\boldsymbol\alpha’)} \int \prod_{i=1}^m \pi_i^{x_i + \alpha_i’ - 1} \ d\boldsymbol\pi \\\\ &= \binom{n!}{x_1! \ldots x_m!} \frac{1}{B(\boldsymbol\alpha’)} B(\boldsymbol\alpha’ + \boldsymbol x) \\\\ \end{align} $$

where in the last equality we made use of knowing that the integral of an unnormalized Dirichlet distribution is $B(\boldsymbol\alpha)$. Let us repeat the definition of $B(\boldsymbol\alpha)$ again, that is

$$ B(\boldsymbol\alpha) = \frac{\prod_{i=1}^m \Gamma(\alpha_i)}{\Gamma(\sum_{i=1}^m \alpha_i)} $$

and plugging this back into the formula we computed

$$ \begin{align} p(X|D) &= \binom{n!}{x_1! \ldots x_m!} \frac{1}{B(\boldsymbol\alpha’)} B(\boldsymbol\alpha’ + \boldsymbol x)\\\\ &= \frac{n!}{x_1! \ldots x_m!} \frac{\Gamma(\sum_{i=1}^m \alpha_i’)}{\prod_{i=1}^m \Gamma(\alpha_i’)} \frac{\prod_{i=1}^m \Gamma(\alpha_i’ + x_i)}{\Gamma(\sum_{i=1}^m \alpha_i’ + x_i)} \\\\ \end{align} $$

To move forward, we need to introduce a more general form for the multinomial distribution which allows for non-integer counts. All it comes down is basically replacing factorials with the gamma function, that is instead of

$$ p(\boldsymbol x | \boldsymbol\pi, n) = \binom{n!}{x_1!\ldots x_m!} \prod_{i=1}^m \pi_i^{x_i} $$

we write

$$ p(\boldsymbol x | \boldsymbol\pi, n) = \frac{\Gamma(\sum_{i=1}^m x_i + 1)}{\prod_{i=1}^m \Gamma(x_i + 1)} \prod_{i=1}^m \pi_i^{x_i}. $$

Since only the normalizing constant changed, we can plug it back into our posterior predictive formula

$$ \begin{align} p(X|D) &= \frac{\Gamma(\sum_{i=1}^m x_i + 1)}{\prod_{i=1}^m \Gamma(x_i + 1)} \frac{\Gamma(\sum_{i=1}^m \alpha_i’)}{\prod_{i=1}^m \Gamma(\alpha_i’)} \frac{\prod_{i=1}^m \Gamma(\alpha_i’ + x_i)}{\Gamma(\sum_{i=1}^m \alpha_i’ + x_i)} \\\\ \end{align} $$

which although ugly, it is the posterior predictive distribution in closed form :)

References

Previous:
Maximum Likelihood for the Multinomial Distribution (Bag of Words)

Next:
Mixture of Categoricals and Latent Dirichlet Allocation (LDA)