Eigenvalues and eigenvectors of a matrix $\boldsymbol A$ tell us a lot about the matrix. On the other hand, if we know our matrix $\boldsymbol A$ is somehow special (say symmetric) it will tell us some information about how its eigenvalues and eigenvectors look like.
Let us begin with a definition. Given a matrix $\boldsymbol A$, the vector $x$ is an eigenvector of $\boldsymbol A$ and has a corresponding eigenvalue $\lambda$, if
Read More →
Now that we've worked through the Dirichlet-Categorical model in quite a bit of detail we can move onto document modeling.
Let us begin with a very simple document model in which we consider only a single distribution over words across all documents. We have the following variables:
$N_d$: number of words in $d$-th document. $D$: number of documents. $M$: number of words in the dictionary. $\boldsymbol\beta = (\beta_1,\ldots,\beta_M)$: probabilities of each word.
Read More →
In the previous article we derived a maximum likelihood estimate (MLE) for the parameters of a Multinomial distribution. This time we're going to compute the full posterior of the Dirichlet-Categorical model as well as derive the posterior predictive distribution. This will close our exploration of the Bag of Words model.
Likelihood Similarly as in the previous article, our likelihood will be defined by a Multinomial distribution, that is
$$ p(D|\boldsymbol\pi) \propto \prod_{i+1}^m \pi_i^{x_i}.
Read More →
In this short article we'll derive the maximum likelihood estimate (MLE) of the parameters of a Multinomial distribution. If you need a refresher on the Multinomial distribution, check out the previous article.
Let us begin by repeating the definition of a Multinomial random variable. Consider the bag of words model where we're counting the nubmer of words in a document, where the words are generated from a fixed dictionary. The probability mass function (PMF) is defined as
Read More →
In the previous article we looked at the Beta-Bernoulli model. This time we'll extend it to a model with multiple possible outcomes. We'll also take a look at the Dirichlet, Categorical and Multinomial distributions.
After this, we'll be quite close to implementing interesting models such as the Latent Dirichlet Allocation (LDA). But for now, we have to understand the basics first.
Multinomial coefficients Before we can dive into the dirichlet-categorical model we have to briefly look at the multinomial coefficient, which is the generalization of a binomial coefficient.
Read More →
The Beta distribution is a parametric distribution defined on the interval $[0; 1]$ with two positive shape parameters, denoted $\alpha$ and $\beta$. Probably the most common use case is using Beta as a distribution over probabilities, as in the case of the parameter of a Bernoulli random variable. Even more importantly, the Beta distribution is a conjugate prior for the Bernoulli, binomial, negative binomial and geometric distributions.
The PDF of the Beta distribution, for $x \in [0; 1]$ is defined as
Read More →
The Gaussian distribution has many interesting properties, many of which make it useful in various different applications. Before moving further, let us just define the univariate PDF with a mean $\mu$ and variance $\sigma^2$
$$ \mathcal{N}(x | \mu, \sigma^2) = \frac{1}{\sqrt{2 \pi \sigma^2}} \exp \left( -\frac{(x - \mu)^2}{2 \sigma^2} \right). $$
In the general multi-dimensional case, the mean becomes a mean vector, and the variance turns into a $D \times D$ covariance matrix.
Read More →
$$ \newcommand{\bigci}{\perp\mkern-10mu\perp} $$
This article is a brief overview of conditional independence in graphical models, and the related d-separation. Let us begin with a definition.
For three random variables $X$, $Y$ and $Z$, we say $X$ is conditionally independent of $Y$ given $Z$ iff
$$ p(X, Y | Z) = p(X | Z) p(Y | Z). $$
We can use a shorthand notation
$$ X \bigci Y | Z $$
Read More →
This post describes two approaches for deriving the Expected Lower Bound (ELBO) used in variational inference. Let us begin with a little bit of motivation.
Consider a probabilistic model where we are interested in maximizing the marginal likelihood $p(X)$ for which direct optimization is difficult, but optimizing complete-data likelihood $p(X, Z)$ is significantly easier.
In a bayesian setting, we condition on the data $X$ and compute the posterior distribution $p(Z | X)$ over the latent variables given our observed data.
Read More →
Before we begin, let me just define a few terms:
$S_t$ is the state at time $t$. $A_t$ is the action performed at time $t$. $R_t$ is the reward received at time $t$. $G_t$ is the return, that is the sum of discounted rewards received from time $t$ onwards, defined as $G_t = \sum_{i=0}^\infty \gamma^i R_{t+i+1}$. $V^\pi(s)$ is the value of a state when following a policy $\pi$, that is the expected return when starting in state $s$ and following a policy $\pi$, defined as $V^\pi(s) = E[G_t | S_t = s]$.
Read More →