Jakub Arnold's Blog


Maximum Likelihood for the Multinomial Distribution (Bag of Words)

In this short article we’ll derive the maximum likelihood estimate (MLE) of the parameters of a Multinomial distribution. If you need a refresher on the Multinomial distribution, check out the previous article.

Let us begin by repeating the definition of a Multinomial random variable. Consider the bag of words model where we’re counting the nubmer of words in a document, where the words are generated from a fixed dictionary. The probability mass function (PMF) is defined as

$$ p(\boldsymbol x | \boldsymbol \pi, n) = \binom{n!}{x_1! x_2! \ldots x_m!} \prod_{i=1}^m \pi_i^{x_i} = n! \prod_{i=1}^m \frac{\pi_i^{x_i}}{x_i!} $$

where $\pi_i$ is the probability of $i-th$ word, $x_i$ is the nubmer of occurences of that word, $m$ is the number of words in the dictionary, and $n$ is the total number of occurences of all words.

Since the Multinomial distribution comes from the exponential family, we know computing the log-likelihood will give us a simpler expression, and since $\log$ is concave computing the MLE on the log-likelihood will be equivalent as computing it on the original likelihood function.

Now taking the log-likelihood

$$ \begin{align} \log L(\boldsymbol \pi) &= \log n! \left( \prod_{i=1}^m \frac{\pi_i^{x_i}}{x_i!} \right) \\\\
&= \log n! + \sum_{i=1}^m x_i \log \pi_i - \sum_{i=1}^m \log x_i!. \end{align} $$

Before we can differentiate the log-likelihood to find the maximum, we need to introduce the constraint that all probabilities $\pi_i$ sum up to $1$, that is

$$ \sum_{i=1}^m \pi_i = 1. $$

The lagrangian with the constraint than has the following form

$$ \mathcal{L}(\boldsymbol \pi, \lambda) = \log L(\boldsymbol \pi) + \lambda (1 - \sum_{i=1}^m \pi_i). $$

To find the maximum, we differentiate the lagrangian w.r.t. $\pi_i$ as follows

$$ \begin{align} \frac{\partial}{\partial\pi_i} \mathcal{L}(\boldsymbol\pi, \lambda) &= \frac{\partial}{\partial\pi_i}\log L(\boldsymbol \pi) + \frac{\partial}{\partial\pi_i} \lambda (1 - \sum_{i=1}^m \pi_i) \\\\
&= \frac{\partial}{\partial\pi_i}\log L(\boldsymbol \pi) - \lambda \\\\
&= \frac{\partial}{\partial\pi_i} \left(\log n! + \sum_{i=1}^m x_i \log \pi_i - \sum_{i=1}^m \log x_i! \right) - \lambda \\\\
&= \frac{x_i}{\pi_i} - \lambda. \end{align} $$

Finally, setting the lagrangian equal to zero allows us to compute the extremum as

$$ \pi_i = \frac{x_i}{\lambda}. $$

To solve for $\lambda$, we sum both sides and make use of our initial constraint

$$ \begin{align} \pi_i &= \frac{x_i}{\lambda} \\\\
\sum_{i=1}^m \pi_i &= \sum_{i=1}^m \frac{x_i}{\lambda} \\\\
1 &= \frac{1}{\lambda }\sum_{i=1}^m x_i \\\\
1 &= \frac{1}{\lambda} n \\\\
\lambda &= n \\\\
\end{align} $$

giving us the final form of the MLE for $\pi_i$, that is

$$ \pi_i = \frac{x_i}{n} $$

which is what we would expect. The MLE for a word is exactly its frequency in the document.