Now that we’ve worked through the Dirichlet-Categorical model in quite a bit of detail we can move onto document modeling.
Let us begin with a very simple document model in which we consider only a single distribution over words across all documents. We have the following variables:
- $N_d$: number of words in $d$-th document.
- $D$: number of documents.
- $M$: number of words in the dictionary.
- $\boldsymbol\beta = (\beta_1,\ldots,\beta_M)$: probabilities of each word.
- $w_{nd} \sim Cat(\boldsymbol\beta)$: $n$-th word in $d$-th document.
- $I(w_{nd} = m)$: indicator variable saying that the $n$-th word in the $d$-th document is $m$.
To summarize, there is only one random variable $w_{nd}$, which is observed and has a Categorical distribution with a parameter $\boldsymbol\beta$. We can fit this model using maximum likelihood estimation (MLE) directly as we’ve shown before, that is
$$ \beta_m = \frac{c_m}{N} $$
where $c_m$ is the number of occurences of the $m$-th word and $N$ is the total number of word occurences in all documents. It is important to distinguish between $M$ and $N$, as $M$ is the number of words in the dictionary, that is unique words, and $N$ is the sum of all the counts. Specifically:
$$ \begin{align} N &= \sum_{d=1}^D N_d \\\\ c_m &= \sum_{d=1}^D \sum_{n=1}^{N_d} I(w_{nd} = m) \end{align} $$
In this case $c_m$ sums over document, and in each document it sums up all the occurences of $m$-th word, that is summing over all indexes in that document and adding $1$ for each occurence of $m$.
There are downsides to this simple model though. Sharing one $\boldsymbol\beta$ between all documents means that all documents have the same distribution of words. What we would like instead is allow each document to be about a different topic, and have a different distribution of words for each topic.
Mixture of Categoricals
We’ll modify our model to allow it to have multiple topics, each of which will have its own categorical distribution on words. We introduce a second random variable $z_d \sim Cat(\boldsymbol\theta)$ where $\boldsymbol\theta$ is a vector of topic probabilities.
- $z_d$: assigns document $d$ to one of the $K$ categories.
- $\theta_k = p(z_d = k)$: probability that a document $d$ is assigned to category $k$.
- $w_{nd} | z_d \sim Cat(\boldsymbol\beta_{z_d})$: distribution over words at $n$-th position in document $d$ given we have observed its topic $z_d$.
A small note, the distribution of $w_{nd}$ actually depends on the document $d$ (on its topic), while in the previous document it remained constant throughout. We can write this model in its generative process specification as follows:
$$ \begin{align} z_d &\sim Cat(\boldsymbol\theta) \\\\ w_{nd} | z_d &\sim Cat(\boldsymbol\beta_{z_d}) \end{align} $$
In order to allow our model to capture different topics, we introduced a set of latent (hidden) variables $z_d$. Now the problem becomes, how do we perform maximum likelihood estimation with these hidden variables? Let us write out the likelihood
$$ \begin{align} p(\boldsymbol w | \boldsymbol\theta, \boldsymbol\beta) &= \prod_{d=1}^D p(\boldsymbol w_d | \boldsymbol\theta, \boldsymbol\beta) \qquad\text{expanding the marginal}\\\\ &= \prod_{d=1}^D \sum_{k=1}^K p(\boldsymbol w_d, z_d = k | \boldsymbol\theta, \boldsymbol\beta) \\\\ &= \prod_{d=1}^D \sum_{k=1}^K p(\boldsymbol w_d | z_d = k, \boldsymbol\beta) p(z_d = k | \boldsymbol\theta) \\\\ &= \prod_{d=1}^D \sum_{k=1}^K \left( \prod_{n=1}^{N_d} p(w_{nd} | z_d = k, \boldsymbol\beta) \right) p(z_d = k | \boldsymbol\theta) \\\\ \end{align} $$
We cannot easily optimize through the marginalization, and even if we were to take the log likelihood
$$ \log p(\boldsymbol w | \boldsymbol\theta, \boldsymbol\beta) = \sum_{d=1}^D \log \left( \sum_{k=1}^K \left( \prod_{n=1}^{N_d} p(w_{nd} | z_d = k, \boldsymbol\beta) \right) p(z_d = k | \boldsymbol\theta) \right) $$
we run into a problem with a sum inside a log, that is $\sum \log \left(\sum \ldots \right)$, and we cannot move the log inside the sum, and solving the equation analytically is not possible, at least in the general form. This is where the Expectation-Maximization (EM) algorithm comes in and allows us to find a local optimum using MLE. But for now we’ll move onto the bayesian approach and cover EM in a separate article in more depth.
Bayesian Mixture of Categoricals
To move from a point estimate of the EM to a fully bayesian treatment, we’ll introduce a prior distribution over the parameters $\boldsymbol\theta$ and $\boldsymbol\beta$. Since they are both parameters of a Categorical distribution, it is of no surprise that our priors will be a Dirichlet distribution.
The generative model specification then becomes
$$ \begin{align} \boldsymbol\theta \sim Dir(\boldsymbol\alpha) \\\\ \boldsymbol\beta_k \sim Dir(\boldsymbol\gamma) \\\\ z_d | \boldsymbol\theta \sim Cat(\boldsymbol\theta) \\\\ w_{nd} | z_d,\boldsymbol\beta_{z_d} \sim Cat(\boldsymbol\beta_{z_d}) \end{align} $$
where $\boldsymbol\alpha$ is the hyperparameter over topic probabilities, while $\boldsymbol\gamma$ is the hyperparameter over dictionary probabilities.
Now that we have our priors, we could compute a MAP estimate using EM, but we’ll instead go one step further, extend our model to Latent Dirichlet allocation (LDA), and then cover full bayesian inference using Gibbs sampling.
Latent Dirichlet Allocation (LDA)
One limitation of the mixture of categoricals model is that words in each document are drawn only from one specific topic. The problem is when we have documents that span more than one topic, in which case we need to learn a mixture of those topics. We also allow the distribution of topics to vary across documents.
In LDA, each document becomes a mixture of topics, but each word is still drawn from one of those topics. We don’t need to introduce new random variables, we’ll simply create more of what we have. In the mixture of categoricals we had $z_d$ be a distribution over possible topics of the $d$-th document. We replace it with $z_{nd}$ which becomes a distribution over possible topics of the $n$-th word in $d$-th document. We also introduce $\boldsymbol\theta_d$, which is a distribution over topics for the $d$-th document.
A generative model then becomes:
$$ \begin{align} \boldsymbol\theta_d &\sim Dir(\boldsymbol\alpha) \\\\ \boldsymbol\beta_k &\sim Dir(\boldsymbol\gamma) \\\\ z_{nd} | \boldsymbol\theta_d &\sim Cat(\boldsymbol\theta_d) \\\\ w_{nd} | z_{nd}, \boldsymbol\beta_{z_d} &\sim Cat(\boldsymbol\beta_{z_d}) \\\\ \end{align} $$
We can view this generative process as a sequential recipe for generating a set of documents:
- For each document $d$, draw $\boldsymbol\theta_d$ from the prior distribution $Dir(\boldsymbol\alpha)$.This is the parameter for our Categorical distribution over topics.
- For each topic $k$, draw $\boldsymbol\beta_k$ from the prior $Dir(\boldsymbol\gamma)$. This is the parameter for our Categorical distribution over words in that topic.
- For each $z_{nd}$, that is each word position $n$ in a document $d$ draw its topic from $Cat(\boldsymbol\theta_d)$. This allows us to have each word in a document drawn from a different topic.
- Draw each word $w_{nd}$ from its corresponding $Cat(\boldsymbol\beta_{z_{nd}})$.
The critical part here is in the last step, where each word at position $n$ in document $d$ is drawn from $Cat(\boldsymbol\beta_{z_{nd}})$. The parameter $\boldsymbol\beta_{z_{nd}}$ of the distribution of $w_{nd}$ depends on $z_{nd}$, that is $p(w_{nd} | z_{nd})$. Even though each document $d$ has only a single distribution over topics $Cat(\boldsymbol\theta_d)$, we draw a topic $z_{nd}$ for each word position $n$ in the document $d$.
We can think of $z_{nd}$ as defining a distribution over words at position $n$ in document $d$. The problem with viewing it directly as a distribution is that in such view we ignore the topics and don’t share word probabilities among different positions from the same topic. That’s why we keep a separate set of random variables $\boldsymbol\beta_k$, which define the distribution over words in a topic $k$, and $z_{nd}$ simply acts as an index into one of those $\boldsymbol\beta_k$.
This allows us to draw the topic of each position in each document independently, while sharing the probabilities of words within the same topic between document.
Just to summarize, we have the following count constants:
- $D$ is the number of documents.
- $N_d$ is the number of words in document $d$.
- $K$ is the number of topics.
and the following random variables:
- One hyperparameter $\boldsymbol\alpha$ for the prior distribution on topics for each document.
- One hyperparameter $\boldsymbol\gamma$ for the prior distribution on words in each topic.
- A set of $D$ (number of documents) parameters $\boldsymbol\theta_{1:D}$ for our per-document topic distribution.
- A set of $K$ parameters $\boldsymbol\beta_{1:K}$ for our per-topic word distribution.
- A set of $N \times D$ random variables $z_{nd}$ for each position in each document signifying the topic of a word at that position.
- A set of $N \times D$ random variables $w_{nd}$ representing the actual word at a given position in each document, drawn from the $z_{nd}$-th topic.
Inference in LDA
Similarly to the Mixture of Categoricals, if we wanted to compute the posterior over our parameters $\boldsymbol\beta_{1:K}$ and $\boldsymbol\theta_{1:D}$ given we observed words $z_{nd}$, we’d need to marginalize out the latent variables $z_{nd}$. Let us write out the posterior first:
$$ p(\boldsymbol\beta, \boldsymbol\theta, \boldsymbol z | \boldsymbol w, \alpha, \gamma) = \frac{p(\boldsymbol\beta, \boldsymbol\theta, \boldsymbol z, \boldsymbol w | \alpha, \gamma)}{ p(\boldsymbol w | \alpha, \gamma) } $$
where
$$ p(\boldsymbol\beta, \boldsymbol\theta, \boldsymbol z, \boldsymbol w | \gamma, \alpha) = \prod_{k=1}^K p(\boldsymbol\beta_k | \gamma) \prod_{d=1}^D \left( p(\boldsymbol\theta_d | \alpha) \prod_{n=1}^{N_d} \left( p(z_{nd} | \boldsymbol\theta_d) p(w_{nd} | \boldsymbol\beta, z_{nd}) \right) \right) $$
and the normalization constant $p(\boldsymbol w | \alpha, \gamma)$ written out as
$$ p(\boldsymbol w | \alpha, \gamma) = \int \int \sum_{z_{id}} \prod_{d=1}^D \prod_{k=1}^K \prod_{n=1}^{N_d} p(z_{nd} | \boldsymbol\theta_d) p(\boldsymbol\theta_d | \alpha) p(w_{nd} | \boldsymbol\beta, z_{nd}) p(\boldsymbol\beta_k | \gamma)\ d\boldsymbol\beta_k\ d\boldsymbol\theta_d $$
is intractable, since we’d need to marginalize out the latent variables $z_{nd}$. If every document had $N$ words, this means $K^N$ configurations per document.
Although the posterior is intractable for exact inference, we can use many approximate inference algorithms, e.g. Markov-Chain Monte Carlo and variational inference. In the next article we’ll see how to apply Gibbs sampling to LDA.