When less data tells you everything, and order doesn’t matter.
bayesian
probability
statistics
Connects the exponential family to conjugate priors through sufficient statistics. Introduces exchangeability and de Finetti’s theorem, then develops the Poisson-Gamma conjugate model as a worked example.
Published
February 19, 2025
%matplotlib inlineimport numpy as npimport matplotlib.pylab as plt import seaborn as snfrom scipy.stats import norm
//anaconda/envs/py35/lib/python3.5/site-packages/matplotlib/__init__.py:872: UserWarning: axes.color_cycle is deprecated and replaced with axes.prop_cycle; please use the latter.
warnings.warn(self.msg_depr % (key, alt_key))
Sufficient Statistics and the Exponential Family
Probability distributions that belong to an exponential family have natural conjugate prior distributions. The form of the exponential family is:
Now notice that the product of y -dependent stuff in front is irrelavant as far as sampling goes: it does not interact with \(\theta\) in any way! If I wanted the actual value of the likelihood it would be important to model it well. But if all I want is to use this expression in a samples generator, I dont care. This kind of observation will bve critical for us as we sample from ever more complex models: indeed isolating such dependencies is at the cornerstone of the gibbs method.
Thus one can say, that for all n and y, this has a fixed form as a functio of \(\theta\):
where \(t(y) = \sum_{i=1}^{n} u(y_i)\) is said to be a sufficient statistic for \(\theta\), because the likelihood for theta “depends” on y only through \(t(y)\).
In general the exponential families are the only classes of distributions that have natural conjugate prior distributions, since, apart from some special cases, they are the only distributions having a fixed number of sufficient statistics for all \(n\).
This family includes exponential, poisson, gamma, beta, pareto, binomial, gaussian….
An example with Poissons and Gammas
Consider some data gathered in the 1990s on educational attainment. The data consists of 155 women who were 40 years old. We are interested in the birth rate of women with a college degree and women without. We are told that 111 women without college degrees have 217 children, while 44 women with college degrees have 66 children.
Let \(Y_{1,1}, \ldots, Y_{n_1,1}\) denote the number of children for the \(n_1\) women without college degrees and \(Y_{1,2}, \ldots, Y_{n_2,2}\) be the data for \(n_2\) women with college degrees.
Exchangeability
Lets assume that the number of children of a women in any one of these classes can me modelled as coming from ONE birth rate (we dont know anything about their individual situations so we treat each woman as interchangeable or exchageable with another within the same class). This is the basis for the IID assumption that we generally use.
Another way to think about it, is that the in-class likelihood for these women is invariant to a permutation of variables. If we assume a Poisson likelihood (low counts) for the number of births for each woman, we have, for each woman:
The quantity \(\sum Y_i\) contains all the information about \(\theta\) and thus \(\sum Y_i\) is sufficient statistics. Indeed all you need is the total number of children in each class of mom as far as making any inferences about the \(\theta_{1 or 2}\) are concerned.
So as long as we dont need the exact value of the likelihood, we are go ob treating the likelihood as a
For our example we have \(n_1 =111\), \(\sum_i^{n_1} Y_{i,1} =217\) and \(n_2=44\), \(\sum_i^{n_2} Y_{i,2} =66\).
Congugate Priors
Lets now choose priors. A class of priors is said to be conjugate for a sampling distribution \(p(y_1, \ldots, y_n \vert \theta)\) if the posterior is also in the class.
The mean of our posterior is then a ratio of posterior kids to moms:
\[ E[\theta] = (a + \sum y_i)/(b + N), var[\theta] = (a + \sum y_i)/(b + N)^2 .\]
In this case 219/112 and 68/45 which is not very sensitive to our prior as you might expect.
219/112, 68/45
(1.9553571428571428, 1.511111111111111)
We can calculate and plot the posterior predictives. We do that here for \(\theta_1\) and \(\theta_2\). We also show (lack of) sensitivity to the prior by considering a wierd prior with a=20, b=2.
from scipy.stats import gammaa =2# Gamma prior, a,b values b =1n1 =111sy1 =217# sum of y1n2 =44sy2=66#sum of y2N=1000# ACTUAL VALUES # posterior mean (a+sy1)/(b+n1) (a+sy2)/(b+n2)# EXACT POSTERIORStheta1=gamma.rvs(a+sy1, scale=1.0/( b+n1), size=N)q=plt.hist(theta1, 50,linewidth=1.5,normed=True, histtype='step', label=u'posterior for theta1')theta2 = gamma.rvs(a+sy2,scale=1./(b+n2), size=N)q=plt.hist(theta2, 50, linewidth=1,histtype='step', alpha=1.0,normed=True, label=u'posterior for theta2') th_prior = gamma.rvs(2.0, 1.0, size=N);plt.hist(th_prior, 50,linewidth=1, histtype='step',alpha=1.0, normed=True, label=u'prior') #just for theta1, try a wierd prith_priorwierd = gamma.rvs(20.0, 1.0, size=N);theta1wierd=gamma.rvs(20+sy1, scale=1.0/( 2+n1), size=N)plt.hist(th_priorwierd, 50,linewidth=1, histtype='step',alpha=1.0, normed=True, label=u'prior 20,2') plt.hist(theta1wierd, 50, linewidth=1,histtype='step', alpha=1.0,normed=True, label=u'posterior for theta1 with 20,2') #plt.xlim( [0,8])plt.xlabel('\theta')# ## MONTE CARLO APPROACH - REJECTION METHOD # a =2.0 # b = 1.0 # prior = lambda theta: gamma.pdf(theta, a ,b)# pdf_s1 = lambda theta: prior(theta)* poisson.pmf(sy1, n1*theta)# pdf_s2 = lambda theta: prior(theta)* poisson.pmf(sy2, n2*theta)plt.xlim([0,8])plt.legend()## Finally we can do inference as we wish
The mean birth-rates can be calculated from the samples, as can the variances, which are also given us by the formulae from above:
\[ E[\theta] = (a + \sum y_i)/(b + N), var[\theta] = (a + \sum y_i)/(b + N)^2 .\]
np.mean(theta1), np.var(theta1)
(1.9516881521791478, 0.018527204185785785)
np.mean(theta2), np.var(theta2)
(1.5037252100213609, 0.034220717257786061)
Its easy to get the posterior birth-rate difference from the samples
np.mean(theta1 - theta2)
(0.43687373794997003, -0.43687373794997003)
Posterior predictives
Remember that the posterior predictive is the following integral
From the perspective of sampling, all we have to do is to first draw the thetas from the posterior, then draw y’s from the likelihood, and histogram the likelihood. This is the same logic as marginal posteriors, with the addition of the fact that we must draw y from the likelihood once we drew \(\theta\). You might think that we have to draw multiple \(y\)s at a theta, but this is already taken care of for us because of the nature of sampling. We already have multiple \(\theta\)a in a bin.
from scipy.stats import poissonpostpred1 = poisson.rvs(theta1)postpred2 = poisson.rvs(theta2)
It turns out that the distribution characterizing the posterior predictive is a negative binomial (see wikipedia, this requires some manipulations of gamma functions which we shall not reproduce here). The mean of the posterior predictive distribution is the same as that of the posterior
Notice that the error on the posterior predictive is much larger, even with the same means. The reason for this is that in the posterior predictive, you are smearing out the posterior error. At each point in the posterior, there is the smearing associated with the sampling distribution for \(y^* \sim \theta\), and thus the posterior predictive is conservative.
np.mean(postpred1 - postpred2)
0.47399999999999998
Why bother with the posterior predictive?
you might want to make predictions
model checking: is the model kosher?
There are multiple ways of accomplishing the latter (all of which we shall see).
Future observations could be compared to the posterior predictive.
cross-validation could be used for this purpose to calculate a prediction error
just plotting posterior predictives can be useful since sometimes a visual inspection of simulation data gives away the fact that it looks nothing like actual data.