# Maximum Likelihood for the Multinomial Distribution (Bag of Words)

3 min read • Published: December 03, 2018

In this short article we’ll derive the maximum likelihood estimate (MLE) of the parameters of a Multinomial distribution. If you need a refresher on the Multinomial distribution, check out the previous article.

Let us begin by repeating the definition of a Multinomial random variable. Consider the bag of words model where we’re counting the nubmer of words in a document, where the words are generated from a fixed dictionary. The probability mass function (PMF) is defined as

$p(\boldsymbol x | \boldsymbol \pi, n) = \binom{n!}{x_1! x_2! \ldots x_m!} \prod_{i=1}^m \pi_i^{x_i} = n! \prod_{i=1}^m \frac{\pi_i^{x_i}}{x_i!}$

where $\pi_i$ is the probability of $i-th$ word, $x_i$ is the nubmer of occurences of that word, $m$ is the number of words in the dictionary, and $n$ is the total number of occurences of all words.

Since the Multinomial distribution comes from the exponential family, we know computing the log-likelihood will give us a simpler expression, and since $\log$ is concave computing the MLE on the log-likelihood will be equivalent as computing it on the original likelihood function.

Now taking the log-likelihood

\begin{aligned} \log L(\boldsymbol \pi) &= \log n! \left( \prod_{i=1}^m \frac{\pi_i^{x_i}}{x_i!} \right) \\ &= \log n! + \sum_{i=1}^m x_i \log \pi_i - \sum_{i=1}^m \log x_i!. \end{aligned}

Before we can differentiate the log-likelihood to find the maximum, we need to introduce the constraint that all probabilities $\pi_i$ sum up to $1$, that is

$\sum_{i=1}^m \pi_i = 1.$

The lagrangian with the constraint than has the following form

$\mathcal{L}(\boldsymbol \pi, \lambda) = \log L(\boldsymbol \pi) + \lambda (1 - \sum_{i=1}^m \pi_i).$

To find the maximum, we differentiate the lagrangian w.r.t. $\pi_i$ as follows

\begin{aligned} \frac{\partial}{\partial\pi_i} \mathcal{L}(\boldsymbol\pi, \lambda) &= \frac{\partial}{\partial\pi_i}\log L(\boldsymbol \pi) + \frac{\partial}{\partial\pi_i} \lambda (1 - \sum_{i=1}^m \pi_i) \\ &= \frac{\partial}{\partial\pi_i}\log L(\boldsymbol \pi) - \lambda \\ &= \frac{\partial}{\partial\pi_i} \left(\log n! + \sum_{i=1}^m x_i \log \pi_i - \sum_{i=1}^m \log x_i! \right) - \lambda \\ &= \frac{x_i}{\pi_i} - \lambda. \end{aligned}

Finally, setting the lagrangian equal to zero allows us to compute the extremum as

$\pi_i = \frac{x_i}{\lambda}.$

To solve for $\lambda$, we sum both sides and make use of our initial constraint

\begin{aligned} \pi_i &= \frac{x_i}{\lambda} \\ \sum_{i=1}^m \pi_i &= \sum_{i=1}^m \frac{x_i}{\lambda} \\ 1 &= \frac{1}{\lambda }\sum_{i=1}^m x_i \\ 1 &= \frac{1}{\lambda} n \\ \lambda &= n \\ \end{aligned}

giving us the final form of the MLE for $\pi_i$, that is

$\pi_i = \frac{x_i}{n}$

which is what we would expect. The MLE for a word is exactly its frequency in the document.