It seems there’s quite a cottage industry for GP-based non-parametric modelling of H(z) using the cosmic chronometers supernova (SN) dataset: today on astro-ph we had a contribution by Verde et al., following on from Busti et al.’s effort last week. Verde et al. use their GP modelling to do a couple of things with the SN data, none of which impressed me much. First, in their Section 2, they attempt to answer the question, “is the expansion history smooth?”, by marginalising the likelihood over the hyper parameters, ell and ess, of the squared exponential covariance function (with the usual metric distance in z). In particular, they show in their Fig 1 some posterior contours of log ell and log ess, along with the marginal of log ell with the lower bound containing 5% of the posterior mass indicated in red. The fact that this ell boundary corresponds to a rather smooth form for H(z) given their data is taken as conclusive ‘non-parametric’ proof that H(z) is indeed smooth. However, a closer visual inspection of their Fig 1 reveals that the marginal posterior in log ell is in fact very broad and does not at all rule out log ell values much lower than the 5% bound—the latter being entirely prior-dominated. To this end it seems as if the authors use either an improper uniform prior in ell or an improper uniform prior in log ell truncated to log ell > 0. In either case, if one would choose another common improper prior like uniform (with no truncation) in log ell then voila, a completely different 5% bound!

Another unusual aspect of their analysis appears in their Section 5 where the authors apply their “new statistical tool to determine the odds of two distributions being at tension” (Verde et al. 2013) to the task of reconciling the Planck and SN estimates of H_0. This was my first introduction to Verde et al.’s work in this area and there’s so many problems with it that I don’t even know where to begin. The scenario considered is that we have two sets of data from two different experiments ostensibly attempting to constrain the same set of model parameters, but their posteriors don’t look identical, so the authors propose a test to determine whether the disagreement reflects substantial “tension” arising from, e.g., some unmodelled source of systematic errors: “if the null hypothesis is that the two measurements are sampled from the base model adopted , when should the null hypothesis be abandoned?”.

The solution they come up with is a bit like a weird version of Bayesian model selection for comparing the null hypothesis that two systems share the same model parameters (theta_a = theta_b) versus the alternative hypothesis that they do not (theta_a != theta_b); with the additional stipulation that the priors be uniform (on the unit hypercube, it would seem since they specify pi(theta)=1 or 0). This sort of model selection scenario comes up in, e.g., deciding whether two (possibly biased) coins share the same underlying probability of coming up heads versus the case that they do not, and the correct Bayes factor is given by the ratio of marginal likelihood, Z_ab/Z_a*Z_b, where Z_ab represents the model with shared parameter. And indeed this is almost the statistic that the authors come up with (their Eqn 1), except that for some bizarre reason they propose that an additional normalisation is needed by the integral of the two separate model posteriors *shifted* to have common posterior modes! It’s clearly wackadoodle, as the authors would have discovered if they’d at least conducted some trials of this statistic under toy models with known input parameters to show how it performs.

Even worse though is that the proposed method doesn’t really say anything Bayesian about the likelihood of systematic errors: we know a priori that if the model were correct the true parameters targeted by experiment should be identical (they’re measuring the same thing). Indeed the default error model could be correct but have a large “tension score” simply because the two experiments have different designs and constrain the posterior in different regions of the parameter space. Or there could be systematic errors that precisely lead to a low tension score: as in the case of the fine structure constant dipole where the lack of tension in King et al.’s (2012) comparison of the posteriors for the dipole direction was taken as evidence that systematic errors could not be to blame (when in fact I have shown that the systematic error model with reasonable priors gives a very competitive Bayes factor; Cameron & Pettitt 2012,2013).

Verde et al. are correct though in noting that the problem considered lies within the realm of (esp. medical) meta-analysis; and in this context the Bayesian approach is to introduce candidate models for the suspected sources of systematic errors and thereby test these with/without systematics models via ordinary Bayes factors. The semi-parametric case is particularly powerful and I give a lot of details for this method (and a review of the literature) in my second fine structure paper.