The Beta distribution is a parametric distribution defined on the interval with two positive shape parameters, denoted and . Probably the most common use case is using Beta as a distribution over probabilities, as in the case of the parameter of a Bernoulli random variable. Even more importantly, the Beta distribution is a conjugate prior for the Bernoulli, binomial, negative binomial and geometric distributions.
The PDF of the Beta distribution, for is defined as
where is the normalizing constant which can be directly computed from the parameters using the gamma function (denoted and defined via an integral ) as follows
This gives us the complete form of the PDF
Because of the conjugacy, we rarely have to worry about the normalizing constant and can simply compute it in closed form.
As a small aside, let us compute the expectation of a Beta random variable . Note that the support of the Beta distribution is , which means we’re only integrating over that interval.
Here we make use of a simple trick. Since is the normalizing constant, it must hold that the integral over an unnormalized distribution is exactly , that is
If we look at the integral we got in the previous expression, it is very similar, except the instead of . But that is ok, it simply corresponds to . We can plug this back in and get
using the identity .
Let us now show a simple example where we make use of the conjugacy between Beta and Bernoulli distributions.
Consider a random variable representing the outcome of a single coin toss, which has a Bernoulli distribution with a parameter (probability of heads). Before we observe the coin toss, we might have some prior belief about the fairness of the coin. Let us set the prior belief as if we’ve seen 1 head and 1 tail before tossing the coin, that is .
Because Bayesian inference models uncertainty directly, this does not mean that we believe the coin is fair, even though the maximum likelihood estimate of for these two coin tosses would be . We are however interested in computing the full posterior over , that is where is our observed data. Using Bayes theorem we get
Now knowing that the Beta distribution is a conjugate prior for the Bernoulli distribution, and given that our prior is Beta and our likelihood is Bernoulli, we know that our posterior must be a Beta distribution as well. We can thus omit the normalizing constant since we can infer it from the computed parameters from multiplying the prior by the likelihood.
Let’s say we toss the coin once and observe heads. We can write the likelihood
and putting this together with the prior
we can compute the posterior
As you can see, multiplying the likelihood and the prior gives again gives a distribution which is exactly the same shape as a Beta distribution. We can thus infer back the normalizing constant to be and write our full posterior in closed form
If we observed tails, the likelihood would be since is the probability of heads. Plugging this back into the previous formula we can easily see that the resulting distribution would be .
The Beta distribution basically acts as a counter. With every newly observed coin toss it gets added to our existing prior belief to compute the posterior, which then can become a prior for the next coin toss, but with our belief updated. This is a simple example of how Bayesian models can be updated on-line as new data comes in.
Share on Twitter and Facebook
Discussion of "Beta Distribution and the Beta-Bernoulli Model"
If you have any questions, feedback, or suggestions, please do share them in the comments! I'll try to answer each and every one. If something in the article wasn't clear don't be afraid to mention it. The goal of these articles is to be as informative as possible.
If you'd prefer to reach out to me via email, my address is loading ..