## Understanding coverage: a big ‘tension headache’

If you selected a sample of 100 astronomers/cosmologists who had previously published a paper in which was set forth a Bayesian analysis of some data, and you asked each of them to explain what a 95% posterior credible interval was, I would be willing to be that 99 of them would come up with something along the lines of an exact Frequentist confidence interval.  That is, I would predict 99 of them to give an answer like “if we repeated this experiment 100 times and constructed the 95% credible interval then in about 95% of these experiments (or more) the 95% credible interval will include the true parameter value”. Isn’t this exactly the thinking behind statements like “the posteriors of $H_0$ from these two experiments are in disagreement at the 3 sigma level, therefore there must be systematic biases in one or both of the experiments”?  The idea that if there is no systematic biases in the experiments then if we were to repeat these two experiments (for hypothetical replications of the Universe generated under a fixed set of cosmological parameters) and for each repetition compare posteriors then most of the time their 95% credible intervals should seriously overlap.

Certainly it does seems to be taken for granted by key astrostatistical studies proposing measures and checks of tension.  For instance, in Marshall et al. (2006) a Bayes factor is proposed to check for the presence of systematic errors, with the Bayes factor there comparing between the hypothesis that the two experimental systems share identical values for their common parameters and the hypothesis that they do not.  (Ditto here & here.)  A recent H0liCow paper gives an example application to this effect: “We check that all our lenses can be combined without any loss of consistency by comparing their $D_{\Delta t}$ posteriors in the full cosmological parameter space and measuring the degree to which they overlap.”.

Unfortunately, this isn’t how Bayesian credible intervals work.

Bayesian credible intervals represent a summary of the Bayesian posterior, which is itself a principled update of a set of prior distributions subjected to the likelihood of the observed data.  Since they are not constructed with the aim of satisfying a Frequentist coverage definition we can’t assume they will have any particular coverage properties in general.  For nice models with nice data and identifiable parameters there will likely be a Bernstein–von Mises result that indicates an asymptotic convergence of the Bayesian intervals to Frequentist ones.  But in those cases we’re usually talking experiments with lots of independent data; when the models involve random fields or other ‘non-parametric’ and ‘semi-parametric’ stochastic processes within the hierarchy or likelihood such results are much harder to come by.  Or the limits of the possible data collection (e.g. infill asymptotics for random fields) might exclude any such identifiability or restrict posterior concentration.  So, as a very general rule of thumb, the asymptotic studies of Bayesian coverage tell us that models involving, say, observations of a single realisation (one Universe) of a log-Normal random density field, probably won’t be saved by Bernstein–von Mises.

Stepping back from the asymptotic regime, what additional factors (beyond having a stochastic process somewhere in our model) tend to shape the Frequentist properties of Bayesian posteriors?  Generally speaking, in the well-specified case: (i) priors, (ii) the presence of hard constraints in the parameter space, and/or (iii) discrete data.  Pretty much in that order.  A great example for the impact of prior choice on Bayesian coverage is in regard to inference of the hyper-parameters of a spatial Gaussian process.  A classic study is the objective Bayes analysis by Berger et al. in which some truly woeful coverage is demonstrated for seemingly innocent prior choices; see also some coverage calculations for PC priors.

In their own way, cosmologists have some intuitive understanding of these issues.  For instance, this recent study looks at the impact of a local under-density on the estimation of the Hubble constant from the local distance ladder.  But the discussion becomes confused when this is talked about as a systematic effect, because it suggests something erroneous in the experimental system or the inference procedure.  Whereas in fact the existence of a local under-density is not unexpected in a random Universe and the resulting Bayesian posterior ain’t misbehaving.  If you want to use Bayesian models and you want to interpret them like Frequentists (which I am not here to judge; it’s okay!) then please test the coverage at some fiducial/interesting values of the cosmological parameters, and if the coverage is less than you desire then figure out why and fix it: which is probably going to involve tweaking your priors or blowing up your credible intervals by a certain amount.  And if you want to check for actual systematic errors in your data or inadequacy of your adopted model, do some posterior predictive checks and visualisations.

This entry was posted in Uncategorized. Bookmark the permalink.

### 5 Responses to Understanding coverage: a big ‘tension headache’

1. Tom Loredo says:

I agree with you that credible intervals are misunderstood. However, I don’t think you’d win this bet: “I would be willing to be that 99 of them would come up with something along the lines of an exact Frequentist confidence interval.” I think very few astronomers could accurately describe what an exact frequentist confidence interval is or means. I suspect few astronomers understand the difference between coverage and confidence level. Often when I teach basic Bayesian estimation and credible intervals, I also teach what a confidence interval is, and most students seem shocked to learn what’s involved in computing an exact confidence interval.

There’s a simple and intuitively appealing quasi-frequentist interpretation of credible intervals that I know you know of, but that I was surprised to not see mentioned here. A Bayesian credible interval procedure will have exact *average* coverage, if in your simulated experiments you draw “true” parameter values from the prior for the replications. Put another way, a frequentist confidence interval seeks to guarantee *worst-case* coverage (where “case” refers to choice of true parameter values), while a Bayesian credible interval guarantees *average* coverage (with respect to the specified prior!). It gives another way—quasi-frequentist—to think about priors over parameters: they express how much you want to worry about coverage in different parts of the parameter space (though one must emphasize that a uniform prior does not guarantee constant coverage; it’s just the averaging weight).

This understanding of credible intervals also provides a good (but expensive) way to verify a Bayesian code—run it on lots of simulated cases (simulating *both* parameters and data) and make sure the credible intervals are calibrated. It’s worth emphasizing that this can be prohibitively expensive for nontrivial problems!

• Fair enough: I guess I left myself some wiggle room in saying ” *something along the lines of* an exact Frequentist CI” … I certainly agree that the precise Frquentist definition of a confident interval is unlikely to be arrived at by any kind of chance or intuition(!), but I stand by the proposal that they would assume Bayesian CrIs have Frequentist coverage on replicates of data at *any* fixed set of parameters within the prior support. And certainly wouldn’t imagine they could have systematic under-coverage anywhere.

Interestingly, Bayarri & Berger (2004) suggest that in reality most Frequentists think about the performance of their intervals in the sense of the test performance when applied to lots of different experiments of the same type on different problems (so, presumably different parameters), rather than the hypothetical replication of data from a single experimental system!

2. Tom Loredo says:

You and I are really on the same page here (and I concede your wiggle room). But your (important!) post hits one of my triggers, so I’ll continue to pick at a gnat of disagreement. In my experience, many astronomers not only don’t understand the difference between coverage and confidence level, they don’t even properly understand the notion of coverage of a confidence region. They think of a confidence region in terms of variability of point estimates—something like the region spanned by the central 68.3% (say) of the estimates found with simulated data (with parameters fixed). And who can blame them, when someone with the stature of Bill Press teaches exactly this (incorrect!) notion in the influential *Numerical Recipes* books. See Fig. 15.6.3 and the surrounding text (the chapter number can vary between editions). This section even has a subsection titled “Probability distribution of parameters in the normal case”—frequentist statistics provides no such thing! (This section is a relic of Press’s pre-Bayesian past, and I suppose reveals that his intuition was Bayesian all along.) What’s particularly frustrating about this section is that it’s Press’s rehashing of the lovely and important paper of Lampton, Margon, and Bowyer (1976), which at least he cites. That paper *correctly* describes confidence intervals, and even has a nice figure displaying the notion of coverage for a two-dimensional interval. Press misleadingly recast it as describing the size of a cloud of point estimates. His construction happens to give the right answer in the normal case, but the reasoning is incorrect and gives the wrong answer when likelihood contours are asymmetrical (I discuss this confusion between variability and uncertainty/confidence further here: “Bayesian Astrostatistics: A Backward Look to the Future,” https://doi.org/10.1007/978-1-4614-3508-2_2).

Incidentally, this is an example of a frustrating aspect of statistics: often the wrong line of reasoning gives a final answer that may be close to or even identical to the answer you get from the correct line of reasoning, coincidentally, for the specific case at hand. This can appear to validate the incorrect line of reasoning, motivating its use in cases where it fails. I suppose this happens in many areas, but it seems to me to be especially common in statistics (perhaps due to asymptotics). Argh!

“And certainly wouldn’t imagine they could have systematic under-coverage anywhere.” Right. The problems I’ve seen where there is *serious* under/overcoverage are rather strange, even contrived. I’ve wondered if there’s any work characterizing problems where this happens (maybe in terms of information geometry, or exploring settings where Bernstein-von Mises failes). Do you know of anything along these lines?

As a last comment, let me provide a pointer to the Bayarri and Berger (2004) article you mentioned, which is a personal favorite that I’ve reread several times: “The Interplay of Bayesian and Frequentist Analysis,” https://doi.org/10.1214/088342304000000116. A fantastic paper and a must-read for those who want to understand these issues.

• I hadn’t seen Lampton, Margon, and Bowyer but it’s now on my reading list!

I think the most common scenario for bad coverage in applied problems must be coverage of the range and/or scale of a Gaussian random field which is hard enough when well-specified (e.g. De Oliveira and Sansó 2001) but a disaster when mis-specified. But that’s estimating one or two numbers that describe the global shape of the power spectrum and the scale; which is must much easier than estimating certain cosmological parameters which depend on fine details of the power spectrum. E.g. h controls the turn over at low wave numbers of the matter power spectrum. AND the cosmological likelihood functions seem to be rather pseudo-likelihoods made after some pre-processing and simplifying assumptions.

For these reasons I’m amazed that I can’t find any references where people have checked coverage with a suite of mock datasets.

• Regarding point estimates under hypothetical data replications: I’m very interested in point estimates under data perturbations (weighting or bootstrap resampling) at the moment thanks to their attractive properties for mis-specified or otherwise odd model+data combinations: https://academic.oup.com/biomet/article/106/2/465/5385582