Conjugate priors, Bayesian updating, and decision theory on a globe toss.
bayesian
probability
models
Develops the beta-binomial conjugate model for estimating the fraction of water on a globe. Covers prior selection, sequential Bayesian updating, loss functions, decision theory, and sampling-based marginalization for posterior predictive inference.
Published
February 19, 2025
%matplotlib inlineimport numpy as npfrom scipy import statsfrom scipy.stats import normimport matplotlib.pyplot as pltimport seaborn as snssns.set_style("whitegrid")
//anaconda/envs/py35/lib/python3.5/site-packages/matplotlib/__init__.py:872: UserWarning: axes.color_cycle is deprecated and replaced with axes.prop_cycle; please use the latter.
warnings.warn(self.msg_depr % (key, alt_key))
Formulation of the problem
This problem, taken from McElreath’s book, involves a seal (or a well trained human) tossing a globe, catching it on the nose, and noting down if the globe came down on water or land.
The seal tells us that the first 9 samples were:
WLWWWLWLW.
We wish to understand the evolution of belief in the fraction of water on earth as the seal tosses the globe.
Suppose \(\theta\) is the true fraction of water covering the globe. Our data story if that \(\theta\) then is the probability of the nose landing on water, with each throw or toss of the globe being independent.
Now we build a probabilistic model for the problem, which we shall use to guide a process of Bayesian updating of the model as data comes in.
Since our seal hasnt really seen any water or land, (strange, I know), it assigns equal probabilities, ie uniform probability to any value of \(\theta\).
This is our prior information
For reasons of conjugacy we choose as prior the beta distribution, with \(Beta(1,1)\) being the uniform prior.
Choosing a prior and posterior
The mean of \(Beta(\alpha, \beta)\) is \(\mu = \frac{\alpha}{\alpha+\beta}\) while the variance is
We shall choose \(\alpha=1\) and \(\beta=1\) to be uniform.
\[ p(\theta) = {\rm Beta}(\theta,\alpha, \beta) = \frac{\theta^{\alpha-1} (1-x)^{\beta-1} }{B(\alpha, \beta)} \] where \(B(\alpha, \beta)\) is independent of \(\theta\) and it is the normalization factor.
From Bayes theorem, the posterior for \(\theta\) is
//anaconda/envs/py35/lib/python3.5/site-packages/statsmodels/nonparametric/kdetools.py:20: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
y = X[:m/2+1] + np.r_[0,X[m/2+1:],0]*1j
Now we can calculate all sorts of stuff.
The probability that the amount of water is less than 50%
np.mean(samples <0.5)
0.17299999999999999
The probability by which we get 80% of the samples.
np.percentile(samples, 80)
0.76255263476156399
You might try and find a credible interval. This, unlike the wierd definition of confidence intervals, is exactly what you think it is, the amount of probability mass between certain percentages, like the middle 80%
np.percentile(samples, [10, 90])
array([ 0.44604094, 0.81516349])
You can make various point estimates: mean, median
np.mean(samples), np.median(samples), np.percentile(samples, 50) #last 2 are same
A particularly important and useful point estimate is the MAP, or “maximum a-posteriori” estimate, the value of the parameter at which the pdf (num-samples) reach a maximum.
A principled way to get these point estimates is a loss function. This is the subject of decision theory, and we shall come to it soon. Different losses correspond to different well known point estimates, as we shall see.
But as a quick idea of this, consider the squared error decision loss:
seems to be a complex integral. But if you parse it, its not so complex. This diagram from McElreath helps:
The posterior predictive distribution as a mixture: each parameter value implies a sampling distribution, weighted by the posterior probability, producing the marginal prediction. From McElreath, Statistical Rethinking.
A similar risk-minimization holds for the posterior-predictive so that
\[y_{min mse} = \int dy \, y \, p(y \vert D)\]
which is indeed what we would use in a regression scenario…
Plug-in Approximation
Also, often, people will use the plug-in approximation by putting the posterior mean or MAP value
This approximation is just sampling from the likelihood(sampling distribution), at a posterior-obtained value of \(\theta\). It might be useful if the posterior is an expensive MCMC and the MAP is easier to find by optimization, and can be used in conjunction with quadratic (gaussian) approximations to the posterior, as we will see in variational inference. But for now we have all the samples, and it would be inane not to use them…
The posterior predictive from sampling
But really from the perspective of sampling, all we have to do is to first draw the thetas from the posterior, then draw y’s from the likelihood, and histogram the likelihood. This is the same logic as marginal posteriors, with the addition of the fact that we must draw y from the likelihood once we drew \(\theta\). You might think that we have to draw multiple \(y\)s at a theta, but this is already taken care of for us because of the nature of sampling. We already have multiple \(\theta\)a in a bin.
You can interrogate the posterior-predictive, or simulated samples in other ways, asking about the longest run of water tosses, or the number of times the water/land switched. This is left as an exercise. In particular, you will find that the number of switches is not consistent with what you see in our data. This might lead you to question our model…always a good thing..but note that we have very little data as yet to go on