Beta Distribution and the Beta-Bernoulli Model

5 min read • Published: December 01, 2018

The Beta distribution is a parametric distribution defined on the interval [0;1][0; 1] with two positive shape parameters, denoted α\alpha and β\beta. Probably the most common use case is using Beta as a distribution over probabilities, as in the case of the parameter of a Bernoulli random variable. Even more importantly, the Beta distribution is a conjugate prior for the Bernoulli, binomial, negative binomial and geometric distributions.

The PDF of the Beta distribution, for x[0;1]x \in [0; 1] is defined as

p(xα,β)=1B(α,β)xα1(1x)β1p(x | \alpha, \beta) = \frac{1}{B(\alpha, \beta)} x^{\alpha - 1} (1 - x)^{\beta - 1}

where B(α,β)B(\alpha, \beta) is the normalizing constant which can be directly computed from the parameters using the gamma function (denoted Γ\Gamma and defined via an integral Γ(z)=0xz1ex dx\Gamma(z) = \int_0^\infty x^{z-1} e^{-x}\ dx) as follows

B(α,β)=Γ(α)Γ(β)Γ(α+β).B(\alpha, \beta) = \frac{\Gamma(\alpha) \Gamma(\beta)}{\Gamma(\alpha + \beta)}.

This gives us the complete form of the PDF

Beta(xα,β)=Γ(α+β)Γ(α)Γ(β)xα1(1x)β1.Beta(x | \alpha, \beta) = \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha) \Gamma(\beta)} x^{\alpha - 1} (1 - x)^{\beta - 1}.

Because of the conjugacy, we rarely have to worry about the normalizing constant and can simply compute it in closed form.

As a small aside, let us compute the expectation of a Beta random variable XBeta(α,β)X \sim Beta(\alpha, \beta). Note that the support of the Beta distribution is [0;1][0; 1], which means we’re only integrating over that interval.

μ=E[X]=01xp(xα,β) dx=01xxα1(1x)β1B(α,β) dx=1B(α,β)01xα(1x)β1 dx\begin{aligned} \mu = E[X] &= \int_0^1 x p(x | \alpha, \beta)\ dx \\ &= \int_0^1 x \frac{x^{\alpha - 1} (1 - x)^{\beta - 1}}{B(\alpha, \beta)}\ dx \\ &= \frac{1}{B(\alpha, \beta)}\int_0^1 x^{\alpha} (1 - x)^{\beta - 1}\ dx \\ \end{aligned}

Here we make use of a simple trick. Since B(α,β)B(\alpha, \beta) is the normalizing constant, it must hold that the integral over an unnormalized Beta(α,β)Beta(\alpha, \beta) distribution is exactly B(α,β)B(\alpha, \beta), that is

01xα1(1x)β1 dx=B(α,β).\int_0^1 x^{\alpha - 1} (1 - x)^{\beta - 1}\ dx = B(\alpha, \beta).

If we look at the integral we got in the previous expression, it is very similar, except the α\alpha instead of α1\alpha - 1. But that is ok, it simply corresponds to B(α+1,β)B(\alpha + 1, \beta). We can plug this back in and get

μ=B(α+1,β)B(α,β)=Γ(α+1)Γ(β)Γ(α+1+β)Γ(α+β)Γ(α)Γ(β)=αΓ(α)Γ(β)(α+β)Γ(α+β)Γ(α+β)Γ(α)Γ(β)=αα+β\begin{aligned} \mu &= \frac{B(\alpha + 1, \beta)}{B(\alpha, \beta)} \\ &= \frac{\Gamma(\alpha + 1)\Gamma(\beta)}{\Gamma(\alpha + 1 + \beta)} \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha)\Gamma(\beta)} \\ &= \frac{\alpha \Gamma(\alpha)\Gamma(\beta)}{(\alpha + \beta)\Gamma(\alpha + \beta)} \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha)\Gamma(\beta)} \\ &= \frac{\alpha}{\alpha + \beta} \\ \end{aligned}

using the identity Γ(x+1)=xΓ(x)\Gamma(x + 1) = x \Gamma(x).

Beta-Bernoulli model

Let us now show a simple example where we make use of the conjugacy between Beta and Bernoulli distributions.

Consider a random variable representing the outcome of a single coin toss, which has a Bernoulli distribution with a parameter θ\theta (probability of heads). Before we observe the coin toss, we might have some prior belief about the fairness of the coin. Let us set the prior belief as if we’ve seen 1 head and 1 tail before tossing the coin, that is Beta(1,1)Beta(1, 1).

Because Bayesian inference models uncertainty directly, this does not mean that we believe the coin is fair, even though the maximum likelihood estimate of θ\theta for these two coin tosses would be 0.50.5. We are however interested in computing the full posterior over θ\theta, that is p(θD)p(\theta | D) where DD is our observed data. Using Bayes theorem we get

p(θD)=p(θα,β)p(Dθ)p(D).p(\theta | D) = \frac{p(\theta | \alpha, \beta) p(D | \theta)}{p(D)}.

Now knowing that the Beta distribution is a conjugate prior for the Bernoulli distribution, and given that our prior is Beta and our likelihood is Bernoulli, we know that our posterior must be a Beta distribution as well. We can thus omit the normalizing constant p(D)p(D) since we can infer it from the computed parameters from multiplying the prior by the likelihood.

Let’s say we toss the coin once and observe heads. We can write the likelihood

p(Dθ)=θp(D | \theta) = \theta

and putting this together with the prior

p(θα,β)θα1(1θ)β1p(\theta | \alpha, \beta) \propto \theta^{\alpha - 1} (1 - \theta)^{\beta - 1}

we can compute the posterior

p(θD)θθα1(1θ)β1=θ(α1)+1(1θ)β1Beta(θα+1,β).p(\theta | D) \propto \theta\theta^{\alpha - 1} (1 - \theta)^{\beta - 1} = \theta^{(\alpha - 1) + 1} (1 - \theta)^{\beta - 1} \propto Beta(\theta | \alpha + 1, \beta).

As you can see, multiplying the likelihood and the prior gives again gives a distribution which is exactly the same shape as a Beta distribution. We can thus infer back the normalizing constant to be B(α+1,β)B(\alpha + 1, \beta) and write our full posterior in closed form

p(θD)=1B(α+1,β)θα(1θ)β1p(\theta | D) = \frac{1}{B(\alpha + 1, \beta)} \theta^{\alpha} (1 - \theta)^{\beta - 1}

If we observed tails, the likelihood would be p(Dθ)=1θp(D | \theta) = 1 - \theta since θ\theta is the probability of heads. Plugging this back into the previous formula we can easily see that the resulting distribution would be Beta(α,β+1)Beta(\alpha, \beta + 1).

The Beta distribution basically acts as a counter. With every newly observed coin toss it gets added to our existing prior belief to compute the posterior, which then can become a prior for the next coin toss, but with our belief updated. This is a simple example of how Bayesian models can be updated on-line as new data comes in.



Share on Twitter and Facebook



Discussion of "Beta Distribution and the Beta-Bernoulli Model"

If you have any questions, feedback, or suggestions, please do share them in the comments! I'll try to answer each and every one. If something in the article wasn't clear don't be afraid to mention it. The goal of these articles is to be as informative as possible.

If you'd prefer to reach out to me via email, my address is loading ..