In this short article we'll derive the maximum likelihood estimate (MLE) of the parameters of a Multinomial distribution. If you need a refresher on the Multinomial distribution, check out the previous article.

Let us begin by repeating the definition of a Multinomial random variable. Consider the bag of words model where we're counting the nubmer of words in a document, where the words are generated from a fixed dictionary. The probability mass function (PMF) is defined as

$$ p(\boldsymbol x | \boldsymbol \pi, n) = \binom{n!}{x_1! x_2! \ldots x_m!} \prod_{i=1}^m \pi_i^{x_i} = n! \prod_{i=1}^m \frac{\pi_i^{x_i}}{x_i!} $$

where $\pi_i$ is the probability of $i-th$ word, $x_i$ is the nubmer of occurences of that word, $m$ is the number of words in the dictionary, and $n$ is the total number of occurences of all words.

Since the Multinomial distribution comes from the exponential family, we know computing the log-likelihood will give us a simpler expression, and since $\log$ is concave computing the MLE on the log-likelihood will be equivalent as computing it on the original likelihood function.

Now taking the log-likelihood

$$
\begin{align}
\log L(\boldsymbol \pi) &= \log n! \left( \prod_{i=1}^m \frac{\pi_i^{x_i}}{x_i!} \right) \\\\

&= \log n! + \sum_{i=1}^m x_i \log \pi_i - \sum_{i=1}^m \log x_i!.
\end{align}
$$

Before we can differentiate the log-likelihood to find the maximum, we need to introduce the constraint that all probabilities $\pi_i$ sum up to $1$, that is

$$ \sum_{i=1}^m \pi_i = 1. $$

The lagrangian with the constraint than has the following form

$$ \mathcal{L}(\boldsymbol \pi, \lambda) = \log L(\boldsymbol \pi) + \lambda (1 - \sum_{i=1}^m \pi_i). $$

To find the maximum, we differentiate the lagrangian w.r.t. $\pi_i$ as follows

$$
\begin{align}
\frac{\partial}{\partial\pi_i} \mathcal{L}(\boldsymbol\pi, \lambda) &=
\frac{\partial}{\partial\pi_i}\log L(\boldsymbol \pi) + \frac{\partial}{\partial\pi_i} \lambda (1 - \sum_{i=1}^m \pi_i) \\\\

&= \frac{\partial}{\partial\pi_i}\log L(\boldsymbol \pi) - \lambda \\\\

&= \frac{\partial}{\partial\pi_i} \left(\log n! + \sum_{i=1}^m x_i \log \pi_i - \sum_{i=1}^m \log x_i! \right) - \lambda \\\\

&= \frac{x_i}{\pi_i} - \lambda.
\end{align}
$$

Finally, setting the lagrangian equal to zero allows us to compute the extremum as

$$ \pi_i = \frac{x_i}{\lambda}. $$

To solve for $\lambda$, we sum both sides and make use of our initial constraint

$$
\begin{align}
\pi_i &= \frac{x_i}{\lambda} \\\\

\sum_{i=1}^m \pi_i &= \sum_{i=1}^m \frac{x_i}{\lambda} \\\\

1 &= \frac{1}{\lambda }\sum_{i=1}^m x_i \\\\

1 &= \frac{1}{\lambda} n \\\\

\lambda &= n \\\\

\end{align}
$$

giving us the final form of the MLE for $\pi_i$, that is

$$ \pi_i = \frac{x_i}{n} $$

which is what we would expect. The MLE for a word is exactly its frequency in the document.

#### See also

- Dirichlet-Categorical Model
- Beta Distribution and the Beta-Bernoulli Model
- The Gaussian Distribution - Basic Properties

#### Liked this blog post? Share it!

comments powered by Disqus