As one comes to know more colleagues within astrostatistics it naturally becomes more difficult to criticise their work without hestitation. I do, however, feel that the latest contribution from Stein (who I don’t know), van Dyk (who I do know), Kashyp (who I’ve emailed) & Siemiginowska (ditto) deserves some bashing up here đź™‚

In the manuscript, “Detecting Unspecified Structure in Low-Count Images“, the authors propose a framework for ‘testing whether inferred structure in an image with Poisson noise represents a significant departure from a baseline (null) model ofÂ the image’. The chosen test statistic is a little unusual—namely, the tail probability for a parameter describing the relative strength of the unspecified component under the Bayesian posterior—although not too unusual for astronomers used to thinking of (e.g.) Bayes factors as test statistics (see Jenkins & Peacock 2011). The particular context in mind is that of detecting (faint) X-ray jets beside (bright) AGN in Chandra X-ray Observatory data.

For me there are two majorÂ problems with this paper. First, I simply don’t believe that it is worthwhile for such applications to choose a test statistic that requires an MCMC procedure acknowledged to be ‘computationally expensive’. Although the authors adopt a lower bound estimator for the p-value to limit the damage of only being able to run a limited number of simulations under the null, I don’t believe this saves the procedure in practical terms. Far better in my opinion would be to devise a cheap-to-compute test statistic for which the p-values can be computed precisely. The authors note that they use a Bayesian model to infer image structure because ‘it provides a principled way to account for a PSF, varying probability of photon detection, and varying amounts of smoothing at different image resolutions’. Acknowledging this motivation I would point towards analogous deconvolution problems for Gaussian distributed data in which the Wiener filter provides a fast exact solution for the posterior mean (e.g. Ensslin et al. 2009); that is, I’m suggesting one’s efforts in this would be better directed towards a fast estimator of some particular posterior functional under the Poisson likelihood *if* a high level Bayesian approach is sought. Otherwise, I would advocate for abandonning the Bayesian model altogether and constructing a test statistic from a faster direct transform of the observational data; although described from the Bayesian point of view the example of Lochner et al. using the angle between the AGN and apparent twin jets provides direction for formulating an appropriate statistic. Indeed, disentangling the p-value from the Bayesian image reconstruction procedure would seem to dove-tail well with the authors’ aim to provide ‘an additional check to prevent over-interpreting apparent structurre in the noise as discovery of real structure in the signal’.

My second (major) problem is that I don’t believe enough (any?!) attention has been given to the potential for mis-use of p-values in this problem, for instance that due to multiple hypothesis testing since astronomers would no doubt like to roll this approach out to many hundreds, thousands, or even tens of thousands of images. Â Or, perhaps more likely, they will run another algorithmic search procedure to mine data for images with departures then seek the p-value from this Bayesian test statisticÂ as ‘proof’ of a real companion. Â I particularly worry that such behaviour will be encouraged by the authors’ suggestion in section 3.2 of consideringÂ multiple versions of the test statistic according to different cut outs of the target region. Â One can foresee astronomers simply playing with the image cut out until they get the p-value they desire. Â To my mind the authors’ one caveat disclaimer—“Although very popular in practice, p-values are criticized on theoretical grounds from both frequentist and Bayesian perspectives (e.g., Berger and Delampady 1987; Zhao 2014).”—is not sufficient to abrogate their responsibilities to steer the audience towards better practise in the use and interpretation ofÂ p-values.

A couple of minorÂ concerns of mine. Â (1) The multi scale smoothing model based on quadrants of the image seems a bit clunky in the eraÂ of machine learning and fast Bayesian non-parametrics; and indeed the useÂ of a Dirichlet prior with hyper-prior weighting its hyper-parameters towards zero (i.e. highly un-smooth) places responsibility for the smoothing on the number of layers allowed in the multi scale subdivision. Â Â (2) The estimation of tail probabilities from an ordinary MCMC run is a bit uninspired given that astronomers were early adopters of nested sampling which has a close connection to equivalent procedures in rare event estimation (e.g. Guyander’s PhD thesis).

That said, it’s worth remembering that I try to hold astrostatistical contributions by statisticiansÂ (and more ~~competent~~ data-literateÂ astronomers) to a higher standard than ordinary astro-stats papers.

My co-authors and I thank Ewan Cameron for his close read of our paper, but in this case we think his criticisms mostly miss the mark.

No one would disagree that â€” all things being equal â€” an easy-to-compute statistic with a known null distribution is preferable. But the assertion that a procedure that requires MCMC is not worthwhile â€śfor such applicationsâ€ť presumes either the existence of a faster equally powerful alternative or that the applications themselves are uninteresting. Cameron points to methods for Gaussian images and ad-hoc statistics as possible alternatives. The former are simply inappropriate for sparse Poisson images; there is no justification for their use in this setting. Ad-hoc statistics, on the other hand trade statistical power for computational speed. We based our test statistic on the likelihood function under a principled statistical model in order to tune the statistic towards the detection of discrepancies from the null in the direction of the alternative. In this way it conserves statistical power for the particular signals that are of most interest and gives us a better chance at discovery. Giving up statistical power for computational speed only makes sense if the data is cheaper than computing time. (In our applications it is not!)

High-energy astronomy is rife with large images composed of sparse counts. Standard practice for detecting structure is simply to look for features in the images by eye. Without a reliable method to objectively measure the significance of features, the science simply stops. Thus, the measure of success is identifying a method that works. While computational speed is a concern, it is far less important than statistical objectivity and power.

There are three main concerns with p-values here: (i) p-values as a dubious measure of evidence against a null hypothesis; (ii) multiple testing; and (iii) cheating (e.g., trying multiple cutouts and only reporting the one with the smallest p-value).Â Although (i) is by far the most serious and is the issue we address in Cameronâ€™s quotation from our paper, he does not mention it explicitly in his critique. We acknowledge this is an important concern, but it is a general concern with p-values and is beyond the scope of our paper. Multiple testing, on the other hand, is the easiest to address. There are lots of methods available for combining p-values into valid multiple testing procedures (e.g., Benjamini–Hochberg). Choosing the cut-out region or the threshold of the statistic to optimize the p-value is obviously cheating. While it never hurts to say “don’t cheat,” remember that any statistical procedure can be misused. Some unwitting astronomer, for example, might try to apply Wiener filters to low-count Poisson images, but that is not a reason to ban the Wiener filter!

We agree with Cameron that there are other possible models and other computational methods, but it is difficult to weigh the pros and cons of alternatives without a specific proposal. All our images are pixelated and our model takes advantage of the available resolution of the data. We see little advantage of models that are higher resolution than the data, at least in this case.

We also agree that additional cautionary statements would be helpful for non-statisticians. Follow-up work will target multiple-hypothesis testing and explicit mechanisms to avoid cheating. Of course, posts that discuss these issues and the many subtleties of p-values are always helpful!

– David van Dyk, Nathan Stein, and Vinay Kashyap

Thanks for the detailed reply, David (et al)! I guess I’m still somewhat confused as to the relative importance of computation speed in this problem: since it’s the motivation for developing the conservative estimator to take only a limited number of simulations, I assumed it would not be possible to use this technique on many of the interesting sources found in large images (hence my interest in faster if ad hoc solutions). For sure one could look at something like the posterior mode or another target not requiring MCMC sampling of the posterior without throwing away the Poisson likelihood. For instance, one might combine the parametric model with a sparse GMRF approximation of a GP as the second component (log Gaussian cox process) and look at (e.g.) the maximum height of the mean random field (quickly found with INLA) (or some other such statistic computed from it).

[It’s also worth noting that my impression of conservative estimators from (e.g.) my former colleague, Remi Bardenet’s 2014 paper [Towards scaling up Markov chain Monte Carlo: an adaptive subsampling approach] is that they are not worth the effort: that is, the gain in having to use fewer simulations is not worth the loss of power.]

For an interesting example of multiple hypothesis testing in astronomy see my blog post above:

https://astrostatistics.wordpress.com/2015/10/23/another-mnras-stats-disaster

cheers,

Ewan.