It’s been a long time between posts here at Another Astrostatistics Blog; there’s no single reason for my hiatus, more like a diverse series of distractions, including travel to Seattle for the IDM Disease Modelling Symposium and to Pula (Sardinia) for the ISBA World Meeting. Both of these were excellent conferences and I will hopefully find the time to blog about some of the more interesting presentations as I remember them. A more recent distraction has been my referee duties for NIPS 2016; of the seven manuscripts I reviewed from the fields of ‘ABC’, ‘non-parametric models’ and ‘Gaussian processes’ at least two could be important tools for astronomical data analyses—the -free ABC approach of Papamakarios & Murray and the non-parametric (GP/spline-based) model for clustering disease trajectory maps (i.e., irregularly-sampled time series) of Schulam & Arora. The former has already had an unpublished application to the fitting of cosmological models described by Iain Murray at the MaxEnt 2013 meeting, while the latter could well prove useful in the study of x-ray or -ray bursts (for instance).

As a show of good faith for loyal readers wondering if indeed I really have resumed blogging duties I give below some thoughts upon reading the new paper by Licquia & Newman, in which the authors seek an ensemble estimate for the radial scale-length of the Milky Way disk through a meta-analysis of previously published estimates. Methodologically this new paper is quite similar to their earlier meta-analysis study of the Milky Way’s stellar mass and SFR, albeit with an added component of Bayesian model averaging to allow for uncertainty in the optimum choice of model for their ‘bad measurement’ category . By way of reminder, Licquia & Newman’s meta-analysis model is a mixture of two likelihoods—‘good’ studies contribute to the likelihood exactly as per their quoted uncertainties on the disk scale length in what amounts to a fixed-effect meta-analysis model, while ‘bad’ studies either don’t contribute at all or contribute in a weakened sense by re-scaling of their error bars (depending on the model choice).

As with their earlier paper I have a positive impression of the analysis and implementation, and I think there remains great potential for meta-analysis studies to forge more precise estimates of key astronomical quantities from the many conflicting results reported in the literature. However, while reading this manuscript I began to question whether the adopted meta-analysis model—the mixture of a fixed-effect component with an over-dispersed component—is philosophically the ‘right’ model for this purpose. The fixed-effect model is typically used in (e.g.) clinical trial meta-analyses when it is believed that each separate study is targeting the same underlying parameter, with disagreements between estimates driven primarily by the sampling function rather than systematics like structural differences in the study cohorts. In the case of the Milky Way there is ultimately only one galaxy with a fixed population of stars to observe, so the variation in estimated disk scale lengths is driven primarily by differences in the region of the disk included for study, treatment of dust extinction, the adopted galaxy model and the method of fitting it. From this perspective it seems the answer sought requires a model for the variation introduced by different astronomers’ decisions regarding these analysis choices, for which a random-effects meta-analysis would seem more appropriate. Or at least, this seems a question which deserves some further investigation if the community is to be convinced that meta-analyses are indeed a good thing for astronomy.

One side note: Although it’s not discussed explicitly it seems to me that the authors’ computation of the marginal likelihood for each model and the posteriors in and are based on summation of the mixture likelihood such that the ‘labels’ (i.e., ‘good’ or ‘bad’) are not sampled in the posterior representation, and hence each study cannot be assigned a posterior probability of being either ‘good’ or ‘bad’. It’s natural to perform the computation in this manner to avoid having to perform Monte Carlo over a high dimensional discrete space (the labels) but it seems a shame not to be able to report the inferred ‘quality’ of each study in the results. I wonder (not rhetorically) if there’s a simple *a posteriori* ‘trick’ that one can use to compute or approximate these scores?