In the previous article we looked at the Beta-Bernoulli model. This time we’ll extend it to a model with multiple possible outcomes. We’ll also take a look at the Dirichlet, Categorical and Multinomial distributions.
After this, we’ll be quite close to implementing interesting models such as the Latent Dirichlet Allocation (LDA). But for now, we have to understand the basics first.
Multinomial coefficients
Before we can dive into the dirichlet-categorical model we have to briefly look at the multinomial coefficient, which is the generalization of a binomial coefficient. First, here’s a definition of the binomial coefficient
$$ \binom{n}{k} = \frac{n!}{k! (n - k)!} $$
which represents the number of ways we can choose $k$ items out of $n$ total.
We can generalize this to more than two types of items using the multinomial coefficient defined as
$$ \binom{n}{k_1, k_2, \ldots, k_m} = \frac{n!}{k_1! k_2! \ldots k_m!}. $$
which represents the number of ways we can split $n$ items into $m$ groups, with $k_1$ items in the first group, $k_2$ items in the second group, and so on.
Categorical distribution
Now that we are comfortable with multinomial coefficients, let us continue with the generalization of the Bernoulli distribution, that is the Categorical distribution, denoted as $Cat(\boldsymbol{\pi})$, where $\boldsymbol\pi$ is a vector of probabilities for each possible outcome. The probability mass function (PMF) is simply
$$ p(x|\boldsymbol \pi) = \prod_{i=1}^k \pi_i^{I[x = i]} $$
where $I[x = i]$ is an indicator variable which evaluates to $1$ if $x=i$, or to $0$ otherwise. Note that we require $\sum_{i=1}^k \pi_i = 1$.
We can also re-formulate this for the case of one-of-K encoding, where only one of the outcomes is $1$, and the remaining elements equal $0$. Then the distribution becomes
$$ p(x|\boldsymbol\pi) = \prod_{i=1}^k \pi_i^{x_i}. $$
An example of this would be a single roll of a dice, where only one of the outcomes is possible, but each might have a different probability (unfair dice).
Multinomial distribution
Having understood the Categorical distribution, we can now move to the generalization of the Binomial distribution to multiple outcomes, that is the Multinomial distribution. An easy way to think of it is $n$ rolls of a $k$-sided dice.
- When $n = 1$ and $k = 2$ we have a Bernoulli distribution.
- When $n = 1$ and $k > 2$ we have a Categorical distribution.
- When $n > 1$ and $k = 2$ we have a Binomial distribution.
- And finally, when $n > 1$ and $k > 2$ we have a Multinomial distribution.
Of course we can simply always use the Multinomial distribution as it is the most general. The PMF in the one-of-K case is then simply
$$ p(\boldsymbol{x} | \boldsymbol{\pi},n) = \binom{n!}{x_1!x_2! \ldots x_k!} \prod_{i=1}^k \pi_i^{x_i} $$
In this case $\boldsymbol{x} = (x_1, \ldots, x_k)$ represent the number of times each outcome was observed, while again $\boldsymbol{\pi} = (\pi_1, \ldots, \pi_k)$ represent the probabilities of each outcome.
An example of the multinomial distribution is the Bag of Words model, which describes the number of occurences of each word in a dataset. There are $k$ possible words in a dictionary and the document consists of $n$ words in total.
Dirichlet distribution
Lastly, let us consider the Dirichlet distribution, which is a generalization of the Beta distribution to more than two outcomes. The Dirichlet distribution is to the Categorical/Mutlinomial what the Beta is to the Bernoulli/Binomial.
A random vector $\boldsymbol{\pi} = (\pi_1, \ldots, \pi_k)$ with $\sum_{i=1}^k \pi_i = 1$ and $\pi_i \in (0; 1)$ has a Dirichlet distribution with a PMF
$$ Dir(\boldsymbol{\pi} | \alpha_1, \ldots, \alpha_m) = \frac{\Gamma(\sum_{i=1}^k \alpha_i)}{\prod_{i=1}^k \Gamma(\alpha_i)} \prod_{i=1}^k \pi_i^{\alpha_i - 1}. $$
Just like we did with the Beta distribution, we can simplify things by naming normalization constant, as it can be computed in closed form from the parameters, that is
$$ Dir(\boldsymbol{\pi} | \alpha_1, \ldots, \alpha_m) = \frac{1}{B(\boldsymbol{\alpha})} \prod_{i=1}^k \pi_i^{\alpha_i - 1} $$
where
$$ B(\boldsymbol{\alpha}) = \frac{\prod_{i=1}^k \Gamma(\alpha_i)}{\Gamma(\sum_{i=1}^k \alpha_i)}. $$
Note that for shorthand we will also write $\boldsymbol\alpha = (\alpha_1, \ldots, \alpha_k)$, giving us a shorter notation when writing $Dir(\boldsymbol\pi | \boldsymbol\alpha)$.
Dirichlet-Categorical Model
Similarly in the previous article about the Beta-Bernoulli model we will now introduce the Dirichlet-Categorical model. Since everything is analogous we won’t go into that much detail. The Dirichlet distribution is a conjugate prior to the Categorical and Multinomial distributions, which means if we set our prior to Dirichlet and our likelihood to Categorical or Mutlinomial, the resulting distribution will again be a Dirichlet distribution.
We can observe this easily by just multiplying out the probability mass functions for $Cat(\boldsymbol x | \boldsymbol \pi)$ and $Dir(\boldsymbol\pi|\boldsymbol\alpha)$, that is
$$ Cat(\boldsymbol x | \boldsymbol \pi) Dir(\boldsymbol\pi|\boldsymbol\alpha) \propto \prod_{i=1}^k \pi_i^{x_i} \prod_{i=1}^k \pi_i^{\alpha_i - 1}. $$
Since only one of the $x_i$ in the Categorical distribution can be $1$ and the rest are $0$, say $x_j =1 $, then this will get multiplied by the respective $\pi_j$ in the Dirichlet distribution and we can immediately see that $\alpha_j$ will be increased by one, giving us a new Dirichlet distribution with a parameter $(\alpha_1, \ldots, \alpha_j + 1, \ldots, \alpha_k)$.