## Understanding coverage: a big ‘tension headache’

If you selected a sample of 100 astronomers/cosmologists who had previously published a paper in which was set forth a Bayesian analysis of some data, and you asked each of them to explain what a 95% posterior credible interval was, I would be willing to be that 99 of them would come up with something along the lines of an exact Frequentist confidence interval.  That is, I would predict 99 of them to give an answer like “if we repeated this experiment 100 times and constructed the 95% credible interval then in about 95% of these experiments (or more) the 95% credible interval will include the true parameter value”. Isn’t this exactly the thinking behind statements like “the posteriors of $H_0$ from these two experiments are in disagreement at the 3 sigma level, therefore there must be systematic biases in one or both of the experiments”?  The idea that if there is no systematic biases in the experiments then if we were to repeat these two experiments (for hypothetical replications of the Universe generated under a fixed set of cosmological parameters) and for each repetition compare posteriors then most of the time their 95% credible intervals should seriously overlap.

Certainly it does seems to be taken for granted by key astrostatistical studies proposing measures and checks of tension.  For instance, in Marshall et al. (2006) a Bayes factor is proposed to check for the presence of systematic errors, with the Bayes factor there comparing between the hypothesis that the two experimental systems share identical values for their common parameters and the hypothesis that they do not.  (Ditto here & here.)  A recent H0liCow paper gives an example application to this effect: “We check that all our lenses can be combined without any loss of consistency by comparing their $D_{\Delta t}$ posteriors in the full cosmological parameter space and measuring the degree to which they overlap.”.

Unfortunately, this isn’t how Bayesian credible intervals work.

Bayesian credible intervals represent a summary of the Bayesian posterior, which is itself a principled update of a set of prior distributions subjected to the likelihood of the observed data.  Since they are not constructed with the aim of satisfying a Frequentist coverage definition we can’t assume they will have any particular coverage properties in general.  For nice models with nice data and identifiable parameters there will likely be a Bernstein–von Mises result that indicates an asymptotic convergence of the Bayesian intervals to Frequentist ones.  But in those cases we’re usually talking experiments with lots of independent data; when the models involve random fields or other ‘non-parametric’ and ‘semi-parametric’ stochastic processes within the hierarchy or likelihood such results are much harder to come by.  Or the limits of the possible data collection (e.g. infill asymptotics for random fields) might exclude any such identifiability or restrict posterior concentration.  So, as a very general rule of thumb, the asymptotic studies of Bayesian coverage tell us that models involving, say, observations of a single realisation (one Universe) of a log-Normal random density field, probably won’t be saved by Bernstein–von Mises.

Stepping back from the asymptotic regime, what additional factors (beyond having a stochastic process somewhere in our model) tend to shape the Frequentist properties of Bayesian posteriors?  Generally speaking, in the well-specified case: (i) priors, (ii) the presence of hard constraints in the parameter space, and/or (iii) discrete data.  Pretty much in that order.  A great example for the impact of prior choice on Bayesian coverage is in regard to inference of the hyper-parameters of a spatial Gaussian process.  A classic study is the objective Bayes analysis by Berger et al. in which some truly woeful coverage is demonstrated for seemingly innocent prior choices; see also some coverage calculations for PC priors.

In their own way, cosmologists have some intuitive understanding of these issues.  For instance, this recent study looks at the impact of a local under-density on the estimation of the Hubble constant from the local distance ladder.  But the discussion becomes confused when this is talked about as a systematic effect, because it suggests something erroneous in the experimental system or the inference procedure.  Whereas in fact the existence of a local under-density is not unexpected in a random Universe and the resulting Bayesian posterior ain’t misbehaving.  If you want to use Bayesian models and you want to interpret them like Frequentists (which I am not here to judge; it’s okay!) then please test the coverage at some fiducial/interesting values of the cosmological parameters, and if the coverage is less than you desire then figure out why and fix it: which is probably going to involve tweaking your priors or blowing up your credible intervals by a certain amount.  And if you want to check for actual systematic errors in your data or inadequacy of your adopted model, do some posterior predictive checks and visualisations.