I noticed this paper by Dunstall et al. on astro ph yesterday continuing the ad hoc version of ABC methodology for estimating the binary fraction of a progenitor population subsequently evolved and observed noisily that was established by Sana et al.’s Science paper. I’m a bit surprised that they’re continuing to resist using real ABC as there are a number of insights that the formal theory could give to improve their analysis. [*I say resist somewhat tentatively here because I’m 99% sure that I’ve emailed Hughes Sana about ABC before (since this is like the third paper in the series to use the same approach), but since my old QUT email account became inactive I can’t search my email history to be sure.*]

Anyway, the ad hoc version is to simulate mock data and take as combined summary statistic and pseudo discrepancy distance the product of two K-S test p-values on the marginals of the mock and real data plus a binomial likelihood formed from assuming the empirical mock data binary fraction sets the proportion for sampling the observed data binary fraction. Inference proceeds by repeatedly drawing parameter vectors from the prior, simulating mock data, summarizing and computing this pseudo distance, with the parameter yielding minimal discrepancy being accepted as the maximum a posteriori estimate. Uncertainty estimation is then via tests with synthetic datasets simulated under parameters near this maximum a posteriori estimate. It’s not a terrible methodology, but it could easily be improved by taking the full ABC approach and using tools such as Fearnhead & Prangle’s (2012) semi-automatic summary statistic construction or Nunes & Balding‘s optimality criteria to identify optimal summary statistics. Also, the uncertainty reporting could then be made via fully (ABC-approximte) Bayesian credible intervals.

The following comment may set the new world record for ‘most pedantic observation ever’, nevertheless I can’t help but observe the following. In the recent paper by Chiaberge et al. (in which they estimate and compare the proportions of merging systems in various samples of radio loud AGN) the authors write in explaining their use (which I approve of*) of the Bayes factor: “In this simplified version of the Bayesian tests, the priors are uninformative, i.e. a uniform distribution. This is suitable for our purposes, since we have no a priori knowledge of the distribution of mergers in each sample.” That is, they suppose the uniform prior equivalent to Beta(1,1). But then, shockingly 🙂 , when they quote credible intervals on the underlying population proportion a forensic reconstruction reveals they’ve used an improper prior, equivalent to Beta(0,0): “89 out of 101 objects fall into at least one of these categories. This corresponds to a merger fraction of 0.88, with a Bayesian 95% credible interval [0.81–0.94].” So which of these priors really represents their a priori beliefs? Hmmm … [Remember that a proper prior is necessary for obtaining the Bayes factor.]

Forensic reconstruction:

signif(qbeta(c((1-0.95)/2,1-(1-0.95)/2),89,101-89),2) >> 0.81 0.94

signif(qbeta(c((1-0.95)/2,1-(1-0.95)/2),89+1,101-89+1),2) >> 0.80 0.93

* In fact, Dunstall et al. could have used this to establish ‘significance’ for the comparisons of binary fractions in their Table 4.