The Beta distribution is a parametric distribution defined on the interval $[0; 1]$ with two positive shape parameters, denoted $\alpha$ and $\beta$. Probably the most common use case is using Beta as a distribution over probabilities, as in the case of the parameter of a Bernoulli random variable. Even more importantly, the Beta distribution is a conjugate prior for the Bernoulli, binomial, negative binomial and geometric distributions.

The PDF of the Beta distribution, for $x \in [0; 1]$ is defined as

$$ p(x | \alpha, \beta) = \frac{1}{B(\alpha, \beta)} x^{\alpha - 1} (1 - x)^{\beta - 1} $$

where $B(\alpha, \beta)$ is the normalizing constant which can be directly computed from the parameters using the gamma function (denoted $\Gamma$ and defined via an integral $\Gamma(z) = \int_0^\infty x^{z-1} e^{-x}\ dx$) as follows

$$ B(\alpha, \beta) = \frac{\Gamma(\alpha) \Gamma(\beta)}{\Gamma(\alpha + \beta)}. $$

This gives us the complete form of the PDF

$$ Beta(x | \alpha, \beta) = \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha) \Gamma(\beta)} x^{\alpha - 1} (1 - x)^{\beta - 1}. $$

Because of the conjugacy, we rarely have to worry about the normalizing constant and can simply compute it in closed form.

As a small aside, let us compute the expectation of a Beta random variable $X \sim Beta(\alpha, \beta)$. Note that the support of the Beta distribution is $[0; 1]$, which means we’re only integrating over that interval.

$$
\begin{align}
\mu = E[X] &= \int_0^1 x p(x | \alpha, \beta)\ dx \\\\

&= \int_0^1 x \frac{x^{\alpha - 1} (1 - x)^{\beta - 1}}{B(\alpha, \beta)}\ dx \\\\

&= \frac{1}{B(\alpha, \beta)}\int_0^1 x^{\alpha} (1 - x)^{\beta - 1}\ dx \\\\

\end{align}
$$

Here we make use of a simple trick. Since $B(\alpha, \beta)$ is the normalizing constant, it must hold that the integral over an unnormalized $Beta(\alpha, \beta)$ distribution is exactly $B(\alpha, \beta)$, that is

$$ \int_0^1 x^{\alpha - 1} (1 - x)^{\beta - 1}\ dx = B(\alpha, \beta). $$

If we look at the integral we got in the previous expression, it is very similar, except the $\alpha$ instead of $\alpha - 1$. But that is ok, it simply corresponds to $B(\alpha + 1, \beta)$. We can plug this back in and get

$$
\begin{align}
\mu &= \frac{B(\alpha + 1, \beta)}{B(\alpha, \beta)} \\\\

&= \frac{\Gamma(\alpha + 1)\Gamma(\beta)}{\Gamma(\alpha + 1 + \beta)}
\frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha)\Gamma(\beta)} \\\\

&= \frac{\alpha \Gamma(\alpha)\Gamma(\beta)}{(\alpha + \beta)\Gamma(\alpha +
\beta)} \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha)\Gamma(\beta)} \\\\

&= \frac{\alpha}{\alpha + \beta} \\\\

\end{align}
$$

using the identity $\Gamma(x + 1) = x \Gamma(x)$.

## Beta-Bernoulli model

Let us now show a simple example where we make use of the conjugacy between Beta and Bernoulli distributions.

Consider a random variable representing the outcome of a single coin toss, which has a Bernoulli distribution with a parameter $\theta$ (probability of heads). Before we observe the coin toss, we might have some prior belief about the fairness of the coin. Let us set the prior belief as if we’ve seen 1 head and 1 tail before tossing the coin, that is $Beta(1, 1)$.

Because Bayesian inference models uncertainty directly, this does not mean that we believe the coin is fair, even though the maximum likelihood estimate of $\theta$ for these two coin tosses would be $0.5$. We are however interested in computing the full posterior over $\theta$, that is $p(\theta | D)$ where $D$ is our observed data. Using Bayes theorem we get

$$ p(\theta | D) = \frac{p(\theta | \alpha, \beta) p(D | \theta)}{p(D)}. $$

Now knowing that the Beta distribution is a conjugate prior for the Bernoulli distribution, and given that our prior is Beta and our likelihood is Bernoulli, we know that our posterior must be a Beta distribution as well. We can thus omit the normalizing constant $p(D)$ since we can infer it from the computed parameters from multiplying the prior by the likelihood.

Let’s say we toss the coin once and observe heads. We can write the likelihood

$$ p(D | \theta) = \theta $$

and putting this together with the prior

$$ p(\theta | \alpha, \beta) \propto \theta^{\alpha - 1} (1 - \theta)^{\beta - 1} $$

we can compute the posterior

$$ p(\theta | D) \propto \theta\theta^{\alpha - 1} (1 - \theta)^{\beta - 1} = \theta^{(\alpha - 1) + 1} (1 - \theta)^{\beta - 1} \propto Beta(\theta | \alpha + 1, \beta). $$

As you can see, multiplying the likelihood and the prior gives again gives a distribution which is exactly the same shape as a Beta distribution. We can thus infer back the normalizing constant to be $B(\alpha + 1, \beta)$ and write our full posterior in closed form

$$ p(\theta | D) = \frac{1}{B(\alpha + 1, \beta)} \theta^{\alpha} (1 - \theta)^{\beta - 1} $$

If we observed tails, the likelihood would be $p(D | \theta) = 1 - \theta$ since $\theta$ is the probability of heads. Plugging this back into the previous formula we can easily see that the resulting distribution would be $Beta(\alpha, \beta + 1)$.

The Beta distribution basically acts as a *counter*. With every newly observed
coin toss it gets added to our existing prior belief to compute the posterior,
which then can become a prior for the next coin toss, but with our belief updated.
This is a simple example of how Bayesian models can be updated on-line as new data
comes in.