# Maximum Likelihood for the Multinomial Distribution (Bag of Words)

3 min read • Published: December 03, 2018

In this short article we’ll derive the maximum likelihood estimate (MLE) of the parameters of a Multinomial distribution. If you need a refresher on the Multinomial distribution, check out the previous article.

Let us begin by repeating the definition of a Multinomial random variable. Consider the bag of words model where we’re counting the nubmer of words in a document, where the words are generated from a fixed dictionary. The probability mass function (PMF) is defined as

$p(\boldsymbol x | \boldsymbol \pi, n) = \binom{n!}{x_1! x_2! \ldots x_m!} \prod_{i=1}^m \pi_i^{x_i} = n! \prod_{i=1}^m \frac{\pi_i^{x_i}}{x_i!}$

where $\pi_i$ is the probability of $i-th$ word, $x_i$ is the nubmer of occurences of that word, $m$ is the number of words in the dictionary, and $n$ is the total number of occurences of all words.

Since the Multinomial distribution comes from the exponential family, we know computing the log-likelihood will give us a simpler expression, and since $\log$ is concave computing the MLE on the log-likelihood will be equivalent as computing it on the original likelihood function.

Now taking the log-likelihood

\begin{aligned} \log L(\boldsymbol \pi) &= \log n! \left( \prod_{i=1}^m \frac{\pi_i^{x_i}}{x_i!} \right) \\ &= \log n! + \sum_{i=1}^m x_i \log \pi_i - \sum_{i=1}^m \log x_i!. \end{aligned}

Before we can differentiate the log-likelihood to find the maximum, we need to introduce the constraint that all probabilities $\pi_i$ sum up to $1$, that is

$\sum_{i=1}^m \pi_i = 1.$

The lagrangian with the constraint than has the following form

$\mathcal{L}(\boldsymbol \pi, \lambda) = \log L(\boldsymbol \pi) + \lambda (1 - \sum_{i=1}^m \pi_i).$

To find the maximum, we differentiate the lagrangian w.r.t. $\pi_i$ as follows

\begin{aligned} \frac{\partial}{\partial\pi_i} \mathcal{L}(\boldsymbol\pi, \lambda) &= \frac{\partial}{\partial\pi_i}\log L(\boldsymbol \pi) + \frac{\partial}{\partial\pi_i} \lambda (1 - \sum_{i=1}^m \pi_i) \\ &= \frac{\partial}{\partial\pi_i}\log L(\boldsymbol \pi) - \lambda \\ &= \frac{\partial}{\partial\pi_i} \left(\log n! + \sum_{i=1}^m x_i \log \pi_i - \sum_{i=1}^m \log x_i! \right) - \lambda \\ &= \frac{x_i}{\pi_i} - \lambda. \end{aligned}

Finally, setting the lagrangian equal to zero allows us to compute the extremum as

$\pi_i = \frac{x_i}{\lambda}.$

To solve for $\lambda$, we sum both sides and make use of our initial constraint

\begin{aligned} \pi_i &= \frac{x_i}{\lambda} \\ \sum_{i=1}^m \pi_i &= \sum_{i=1}^m \frac{x_i}{\lambda} \\ 1 &= \frac{1}{\lambda }\sum_{i=1}^m x_i \\ 1 &= \frac{1}{\lambda} n \\ \lambda &= n \\ \end{aligned}

giving us the final form of the MLE for $\pi_i$, that is

$\pi_i = \frac{x_i}{n}$

which is what we would expect. The MLE for a word is exactly its frequency in the document.

### Discussion of "Maximum Likelihood for the Multinomial Distribution (Bag of Words)"

If you have any questions, feedback, or suggestions, please do share them in the comments! I'll try to answer each and every one. If something in the article wasn't clear don't be afraid to mention it. The goal of these articles is to be as informative as possible.