Mixture of Categoricals and Latent Dirichlet Allocation (LDA)

9 min read • Published: December 05, 2018

Now that we’ve worked through the Dirichlet-Categorical model in quite a bit of detail we can move onto document modeling.

Let us begin with a very simple document model in which we consider only a single distribution over words across all documents. We have the following variables:

  • NdN_d: number of words in dd-th document.
  • DD: number of documents.
  • MM: number of words in the dictionary.
  • β=(β1,,βM)\boldsymbol\beta = (\beta_1,\ldots,\beta_M): probabilities of each word.
  • wndCat(β)w_{nd} \sim Cat(\boldsymbol\beta): nn-th word in dd-th document.
  • I(wnd=m)I(w_{nd} = m): indicator variable saying that the nn-th word in the dd-th document is mm.

To summarize, there is only one random variable wndw_{nd}, which is observed and has a Categorical distribution with a parameter β\boldsymbol\beta. We can fit this model [using maximum likelihood estimation (MLE) directly as we’ve shown before]( {{< relref “maximum-likelihood-for-multinomial-distribution.markdown” >}}), that is

βm=cmN\beta_m = \frac{c_m}{N}

where cmc_m is the number of occurences of the mm-th word and NN is the total number of word occurences in all documents. It is important to distinguish between MM and NN, as MM is the number of words in the dictionary, that is unique words, and NN is the sum of all the counts. Specifically:

N=d=1DNdcm=d=1Dn=1NdI(wnd=m)\begin{aligned} N &= \sum_{d=1}^D N_d \\ c_m &= \sum_{d=1}^D \sum_{n=1}^{N_d} I(w_{nd} = m) \end{aligned}

In this case cmc_m sums over document, and in each document it sums up all the occurences of mm-th word, that is summing over all indexes in that document and adding 11 for each occurence of mm.

There are downsides to this simple model though. Sharing one β\boldsymbol\beta between all documents means that all documents have the same distribution of words. What we would like instead is allow each document to be about a different topic, and have a different distribution of words for each topic.

Mixture of Categoricals

We’ll modify our model to allow it to have multiple topics, each of which will have its own categorical distribution on words. We introduce a second random variable zdCat(θ)z_d \sim Cat(\boldsymbol\theta) where θ\boldsymbol\theta is a vector of topic probabilities.

  • zdz_d: assigns document dd to one of the KK categories.
  • θk=p(zd=k)\theta_k = p(z_d = k): probability that a document dd is assigned to category kk.
  • wndzdCat(βzd)w_{nd} | z_d \sim Cat(\boldsymbol\beta_{z_d}): distribution over words at nn-th position in document dd given we have observed its topic zdz_d.

A small note, the distribution of wndw_{nd} actually depends on the document dd (on its topic), while in the previous document it remained constant throughout. We can write this model in its generative process specification as follows:

zdCat(θ)wndzdCat(βzd)\begin{aligned} z_d &\sim Cat(\boldsymbol\theta) \\ w_{nd} | z_d &\sim Cat(\boldsymbol\beta_{z_d}) \end{aligned}

In order to allow our model to capture different topics, we introduced a set of latent (hidden) variables zdz_d. Now the problem becomes, how do we perform maximum likelihood estimation with these hidden variables? Let us write out the likelihood

p(wθ,β)=d=1Dp(wdθ,β)expanding the marginal=d=1Dk=1Kp(wd,zd=kθ,β)=d=1Dk=1Kp(wdzd=k,β)p(zd=kθ)=d=1Dk=1K(n=1Ndp(wndzd=k,β))p(zd=kθ)\begin{aligned} p(\boldsymbol w | \boldsymbol\theta, \boldsymbol\beta) &= \prod_{d=1}^D p(\boldsymbol w_d | \boldsymbol\theta, \boldsymbol\beta) \qquad\text{expanding the marginal}\\ &= \prod_{d=1}^D \sum_{k=1}^K p(\boldsymbol w_d, z_d = k | \boldsymbol\theta, \boldsymbol\beta) \\ &= \prod_{d=1}^D \sum_{k=1}^K p(\boldsymbol w_d | z_d = k, \boldsymbol\beta) p(z_d = k | \boldsymbol\theta) \\ &= \prod_{d=1}^D \sum_{k=1}^K \left( \prod_{n=1}^{N_d} p(w_{nd} | z_d = k, \boldsymbol\beta) \right) p(z_d = k | \boldsymbol\theta) \\ \end{aligned}

We cannot easily optimize through the marginalization, and even if we were to take the log likelihood

logp(wθ,β)=d=1Dlog(k=1K(n=1Ndp(wndzd=k,β))p(zd=kθ))\log p(\boldsymbol w | \boldsymbol\theta, \boldsymbol\beta) = \sum_{d=1}^D \log \left( \sum_{k=1}^K \left( \prod_{n=1}^{N_d} p(w_{nd} | z_d = k, \boldsymbol\beta) \right) p(z_d = k | \boldsymbol\theta) \right)

we run into a problem with a sum inside a log, that is log()\sum \log \left(\sum \ldots \right), and we cannot move the log inside the sum, and solving the equation analytically is not possible, at least in the general form. This is where the Expectation-Maximization (EM) algorithm comes in and allows us to find a local optimum using MLE. But for now we’ll move onto the bayesian approach and cover EM in a separate article in more depth.

Bayesian Mixture of Categoricals

To move from a point estimate of the EM to a fully bayesian treatment, we’ll introduce a prior distribution over the parameters θ\boldsymbol\theta and β\boldsymbol\beta. Since they are both parameters of a Categorical distribution, it is of no surprise that our priors will be a Dirichlet distribution.

The generative model specification then becomes

θDir(α)βkDir(γ)zdθCat(θ)wndzd,βzdCat(βzd)\begin{aligned} \boldsymbol\theta \sim Dir(\boldsymbol\alpha) \\ \boldsymbol\beta_k \sim Dir(\boldsymbol\gamma) \\ z_d | \boldsymbol\theta \sim Cat(\boldsymbol\theta) \\ w_{nd} | z_d,\boldsymbol\beta_{z_d} \sim Cat(\boldsymbol\beta_{z_d}) \end{aligned}

where α\boldsymbol\alpha is the hyperparameter over topic probabilities, while γ\boldsymbol\gamma is the hyperparameter over dictionary probabilities.

Now that we have our priors, we could compute a MAP estimate using EM, but we’ll instead go one step further, extend our model to Latent Dirichlet allocation (LDA), and then cover full bayesian inference using Gibbs sampling.

Latent Dirichlet Allocation (LDA)

One limitation of the mixture of categoricals model is that words in each document are drawn only from one specific topic. The problem is when we have documents that span more than one topic, in which case we need to learn a mixture of those topics. We also allow the distribution of topics to vary across documents.

In LDA, each document becomes a mixture of topics, but each word is still drawn from one of those topics. We don’t need to introduce new random variables, we’ll simply create more of what we have. In the mixture of categoricals we had zdz_d be a distribution over possible topics of the dd-th document. We replace it with zndz_{nd} which becomes a distribution over possible topics of the nn-th word in dd-th document. We also introduce θd\boldsymbol\theta_d, which is a distribution over topics for the dd-th document.

A generative model then becomes:

θdDir(α)βkDir(γ)zndθdCat(θd)wndznd,βzdCat(βzd)\begin{aligned} \boldsymbol\theta_d &\sim Dir(\boldsymbol\alpha) \\ \boldsymbol\beta_k &\sim Dir(\boldsymbol\gamma) \\ z_{nd} | \boldsymbol\theta_d &\sim Cat(\boldsymbol\theta_d) \\ w_{nd} | z_{nd}, \boldsymbol\beta_{z_d} &\sim Cat(\boldsymbol\beta_{z_d}) \\ \end{aligned}

We can view this generative process as a sequential recipe for generating a set of documents:

  • For each document dd, draw θd\boldsymbol\theta_d from the prior distribution Dir(α)Dir(\boldsymbol\alpha).This is the parameter for our Categorical distribution over topics.
  • For each topic kk, draw βk\boldsymbol\beta_k from the prior Dir(γ)Dir(\boldsymbol\gamma). This is the parameter for our Categorical distribution over words in that topic.
  • For each zndz_{nd}, that is each word position nn in a document dd draw its topic from Cat(θd)Cat(\boldsymbol\theta_d). This allows us to have each word in a document drawn from a different topic.
  • Draw each word wndw_{nd} from its corresponding Cat(βznd)Cat(\boldsymbol\beta_{z_{nd}}).

The critical part here is in the last step, where each word at position nn in document dd is drawn from Cat(βznd)Cat(\boldsymbol\beta_{z_{nd}}). The parameter βznd\boldsymbol\beta_{z_{nd}} of the distribution of wndw_{nd} depends on zndz_{nd}, that is p(wndznd)p(w_{nd} | z_{nd}). Even though each document dd has only a single distribution over topics Cat(θd)Cat(\boldsymbol\theta_d), we draw a topic zndz_{nd} for each word position nn in the document dd.

We can think of zndz_{nd} as defining a distribution over words at position nn in document dd. The problem with viewing it directly as a distribution is that in such view we ignore the topics and don’t share word probabilities among different positions from the same topic. That’s why we keep a separate set of random variables βk\boldsymbol\beta_k, which define the distribution over words in a topic kk, and zndz_{nd} simply acts as an index into one of those βk\boldsymbol\beta_k.

This allows us to draw the topic of each position in each document independently, while sharing the probabilities of words within the same topic between document.

Just to summarize, we have the following count constants:

  • DD is the number of documents.
  • NdN_d is the number of words in document dd.
  • KK is the number of topics.

and the following random variables:

  • One hyperparameter α\boldsymbol\alpha for the prior distribution on topics for each document.
  • One hyperparameter γ\boldsymbol\gamma for the prior distribution on words in each topic.
  • A set of DD (number of documents) parameters θ1:D\boldsymbol\theta_{1:D} for our per-document topic distribution.
  • A set of KK parameters β1:K\boldsymbol\beta_{1:K} for our per-topic word distribution.
  • A set of N×DN \times D random variables zndz_{nd} for each position in each document signifying the topic of a word at that position.
  • A set of N×DN \times D random variables wndw_{nd} representing the actual word at a given position in each document, drawn from the zndz_{nd}-th topic.

Inference in LDA

Similarly to the Mixture of Categoricals, if we wanted to compute the posterior over our parameters β1:K\boldsymbol\beta_{1:K} and θ1:D\boldsymbol\theta_{1:D} given we observed words zndz_{nd}, we’d need to marginalize out the latent variables zndz_{nd}. Let us write out the posterior first:

p(β,θ,zw,α,γ)=p(β,θ,z,wα,γ)p(wα,γ)p(\boldsymbol\beta, \boldsymbol\theta, \boldsymbol z | \boldsymbol w, \alpha, \gamma) = \frac{p(\boldsymbol\beta, \boldsymbol\theta, \boldsymbol z, \boldsymbol w | \alpha, \gamma)}{ p(\boldsymbol w | \alpha, \gamma) }


p(β,θ,z,wγ,α)=k=1Kp(βkγ)d=1D(p(θdα)n=1Nd(p(zndθd)p(wndβ,znd)))p(\boldsymbol\beta, \boldsymbol\theta, \boldsymbol z, \boldsymbol w | \gamma, \alpha) = \prod_{k=1}^K p(\boldsymbol\beta_k | \gamma) \prod_{d=1}^D \left( p(\boldsymbol\theta_d | \alpha) \prod_{n=1}^{N_d} \left( p(z_{nd} | \boldsymbol\theta_d) p(w_{nd} | \boldsymbol\beta, z_{nd}) \right) \right)

and the normalization constant p(wα,γ)p(\boldsymbol w | \alpha, \gamma) written out as

p(wα,γ)=zidd=1Dk=1Kn=1Ndp(zndθd)p(θdα)p(wndβ,znd)p(βkγ) dβk dθdp(\boldsymbol w | \alpha, \gamma) = \int \int \sum_{z_{id}} \prod_{d=1}^D \prod_{k=1}^K \prod_{n=1}^{N_d} p(z_{nd} | \boldsymbol\theta_d) p(\boldsymbol\theta_d | \alpha) p(w_{nd} | \boldsymbol\beta, z_{nd}) p(\boldsymbol\beta_k | \gamma)\ d\boldsymbol\beta_k\ d\boldsymbol\theta_d

is intractable, since we’d need to marginalize out the latent variables zndz_{nd}. If every document had NN words, this means KNK^N configurations per document.

Although the posterior is intractable for exact inference, we can use many approximate inference algorithms, e.g. Markov-Chain Monte Carlo and variational inference. In the next article we’ll see how to apply Gibbs sampling to LDA.


Share on Twitter and Facebook

Discussion of "Mixture of Categoricals and Latent Dirichlet Allocation (LDA)"

If you have any questions, feedback, or suggestions, please do share them in the comments! I'll try to answer each and every one. If something in the article wasn't clear don't be afraid to mention it. The goal of these articles is to be as informative as possible.

If you'd prefer to reach out to me via email, my address is loading ..