Maximum Likelihood for the Multinomial Distribution (Bag of Words)

In this short article we’ll derive the maximum likelihood estimate (MLE) of the parameters of a Multinomial distribution. If you need a refresher on the Multinomial distribution, check out the previous article.

Let us begin by repeating the definition of a Multinomial random variable. Consider the bag of words model where we’re counting the nubmer of words in a document, where the words are generated from a fixed dictionary. The probability mass function (PMF) is defined as

$$ p(\boldsymbol x | \boldsymbol \pi, n) = \binom{n!}{x_1! x_2! \ldots x_m!} \prod_{i=1}^m \pi_i^{x_i} = n! \prod_{i=1}^m \frac{\pi_i^{x_i}}{x_i!} $$

where $\pi_i$ is the probability of $i-th$ word, $x_i$ is the nubmer of occurences of that word, $m$ is the number of words in the dictionary, and $n$ is the total number of occurences of all words.

Since the Multinomial distribution comes from the exponential family, we know computing the log-likelihood will give us a simpler expression, and since $\log$ is concave computing the MLE on the log-likelihood will be equivalent as computing it on the original likelihood function.

Now taking the log-likelihood

$$ \begin{align} \log L(\boldsymbol \pi) &= \log n! \left( \prod_{i=1}^m \frac{\pi_i^{x_i}}{x_i!} \right) \\\\ &= \log n! + \sum_{i=1}^m x_i \log \pi_i - \sum_{i=1}^m \log x_i!. \end{align} $$

Before we can differentiate the log-likelihood to find the maximum, we need to introduce the constraint that all probabilities $\pi_i$ sum up to $1$, that is

$$ \sum_{i=1}^m \pi_i = 1. $$

The lagrangian with the constraint than has the following form

$$ \mathcal{L}(\boldsymbol \pi, \lambda) = \log L(\boldsymbol \pi) + \lambda (1 - \sum_{i=1}^m \pi_i). $$

To find the maximum, we differentiate the lagrangian w.r.t. $\pi_i$ as follows

$$ \begin{align} \frac{\partial}{\partial\pi_i} \mathcal{L}(\boldsymbol\pi, \lambda) &= \frac{\partial}{\partial\pi_i}\log L(\boldsymbol \pi) + \frac{\partial}{\partial\pi_i} \lambda (1 - \sum_{i=1}^m \pi_i) \\\\ &= \frac{\partial}{\partial\pi_i}\log L(\boldsymbol \pi) - \lambda \\\\ &= \frac{\partial}{\partial\pi_i} \left(\log n! + \sum_{i=1}^m x_i \log \pi_i - \sum_{i=1}^m \log x_i! \right) - \lambda \\\\ &= \frac{x_i}{\pi_i} - \lambda. \end{align} $$

Finally, setting the lagrangian equal to zero allows us to compute the extremum as

$$ \pi_i = \frac{x_i}{\lambda}. $$

To solve for $\lambda$, we sum both sides and make use of our initial constraint

$$ \begin{align} \pi_i &= \frac{x_i}{\lambda} \\\\ \sum_{i=1}^m \pi_i &= \sum_{i=1}^m \frac{x_i}{\lambda} \\\\ 1 &= \frac{1}{\lambda }\sum_{i=1}^m x_i \\\\ 1 &= \frac{1}{\lambda} n \\\\ \lambda &= n \\\\ \end{align} $$

giving us the final form of the MLE for $\pi_i$, that is

$$ \pi_i = \frac{x_i}{n} $$

which is what we would expect. The MLE for a word is exactly its frequency in the document.

Previous:
Dirichlet-Categorical Model

Next:
Posterior Predictive Distribution for the Dirichlet-Categorical Model (Bag of Words)