Maximum Likelihood for the Multinomial Distribution (Bag of Words)

3 min read • Published: December 03, 2018

In this short article we’ll derive the maximum likelihood estimate (MLE) of the parameters of a Multinomial distribution. If you need a refresher on the Multinomial distribution, check out the previous article.

Let us begin by repeating the definition of a Multinomial random variable. Consider the bag of words model where we’re counting the nubmer of words in a document, where the words are generated from a fixed dictionary. The probability mass function (PMF) is defined as

p(xπ,n)=(n!x1!x2!xm!)i=1mπixi=n!i=1mπixixi!p(\boldsymbol x | \boldsymbol \pi, n) = \binom{n!}{x_1! x_2! \ldots x_m!} \prod_{i=1}^m \pi_i^{x_i} = n! \prod_{i=1}^m \frac{\pi_i^{x_i}}{x_i!}

where πi\pi_i is the probability of ithi-th word, xix_i is the nubmer of occurences of that word, mm is the number of words in the dictionary, and nn is the total number of occurences of all words.

Since the Multinomial distribution comes from the exponential family, we know computing the log-likelihood will give us a simpler expression, and since log\log is concave computing the MLE on the log-likelihood will be equivalent as computing it on the original likelihood function.

Now taking the log-likelihood

logL(π)=logn!(i=1mπixixi!)=logn!+i=1mxilogπii=1mlogxi!.\begin{aligned} \log L(\boldsymbol \pi) &= \log n! \left( \prod_{i=1}^m \frac{\pi_i^{x_i}}{x_i!} \right) \\ &= \log n! + \sum_{i=1}^m x_i \log \pi_i - \sum_{i=1}^m \log x_i!. \end{aligned}

Before we can differentiate the log-likelihood to find the maximum, we need to introduce the constraint that all probabilities πi\pi_i sum up to 11, that is

i=1mπi=1.\sum_{i=1}^m \pi_i = 1.

The lagrangian with the constraint than has the following form

L(π,λ)=logL(π)+λ(1i=1mπi).\mathcal{L}(\boldsymbol \pi, \lambda) = \log L(\boldsymbol \pi) + \lambda (1 - \sum_{i=1}^m \pi_i).

To find the maximum, we differentiate the lagrangian w.r.t. πi\pi_i as follows

πiL(π,λ)=πilogL(π)+πiλ(1i=1mπi)=πilogL(π)λ=πi(logn!+i=1mxilogπii=1mlogxi!)λ=xiπiλ.\begin{aligned} \frac{\partial}{\partial\pi_i} \mathcal{L}(\boldsymbol\pi, \lambda) &= \frac{\partial}{\partial\pi_i}\log L(\boldsymbol \pi) + \frac{\partial}{\partial\pi_i} \lambda (1 - \sum_{i=1}^m \pi_i) \\ &= \frac{\partial}{\partial\pi_i}\log L(\boldsymbol \pi) - \lambda \\ &= \frac{\partial}{\partial\pi_i} \left(\log n! + \sum_{i=1}^m x_i \log \pi_i - \sum_{i=1}^m \log x_i! \right) - \lambda \\ &= \frac{x_i}{\pi_i} - \lambda. \end{aligned}

Finally, setting the lagrangian equal to zero allows us to compute the extremum as

πi=xiλ.\pi_i = \frac{x_i}{\lambda}.

To solve for λ\lambda, we sum both sides and make use of our initial constraint

πi=xiλi=1mπi=i=1mxiλ1=1λi=1mxi1=1λnλ=n\begin{aligned} \pi_i &= \frac{x_i}{\lambda} \\ \sum_{i=1}^m \pi_i &= \sum_{i=1}^m \frac{x_i}{\lambda} \\ 1 &= \frac{1}{\lambda }\sum_{i=1}^m x_i \\ 1 &= \frac{1}{\lambda} n \\ \lambda &= n \\ \end{aligned}

giving us the final form of the MLE for πi\pi_i, that is

πi=xin\pi_i = \frac{x_i}{n}

which is what we would expect. The MLE for a word is exactly its frequency in the document.

Share on Twitter and Facebook

Discussion of "Maximum Likelihood for the Multinomial Distribution (Bag of Words)"

If you have any questions, feedback, or suggestions, please do share them in the comments! I'll try to answer each and every one. If something in the article wasn't clear don't be afraid to mention it. The goal of these articles is to be as informative as possible.

If you'd prefer to reach out to me via email, my address is loading ..