Some quick comments on a number of recent astro-ph submissions I’ve had open as tabs but had not read until today. The first (in chronological order) is on ‘exploding satellites’ (satellite galaxies, not man-made space junk) by Kuepper et al., in which the authors present a Bayesian analysis underpinned by ‘simulations’ of the orbital paths of test particles in a model potential. At first I thought these were stochastic simulations suggestive of an ABC-like case study, but in fact it turns out these are deterministic simulations for a given number of test particles. In order to form a(n approximate) likelihood the final distribution of the test particles needs to be turned into a pdf via kernel smoothing. Naturally one wonders as to the relationship between the posterior approximation and the number of test particles used. This may well be a potential application area for the multi-objective Bayesian optimisation—where optimisation does not have to mean finding the posterior mode, but could mean optimisation of the approximation to the true posterior in e.g. a KL divergence sense—that I’m currently playing with as a tool for fitting malaria simulations. Basically, the idea expounded in papers by Swersky et al. and Kandasamy et al. (with a supernova data example) is to use many cheap, noisy simulations for exploration and fewer expensive, less-noisy simulations for optimisation. It’s an elegant approach but requires some fine tuning to get the acquisition function right and to implement an efficient Gaussian process-based inference procedure.
The next arXival I read today was a conference proceedings by Knezevic et al. in which the authors describe a Bayesian model for identifying Balmer-dominated shocks in spectroscopic ‘images’ of a supernova remnant. The modelling left me with some odd impressions: in part because the data is first binned using Voronoi tessellations (why bin when you can impose a spatial prior which both finds the appropriate scale for sharing information and brings Bayesian shrinkage to the inference), and in part because the authors use a leave-one-out cross-validation (LOO-CV) estimate of the model evidence. The latter is attributed to Bailer-Jones (2012) in which Coryn suggests an idea equivalent to O’Hagan’s (circa 1991) partial Bayes factors; namely, to use some proportion of the data to first constrain one’s (typically) uninformative priors and then use the remainder of the data to evaluate the marginal likelihood. As discussed in O’Hagan’s 1995 read paper (and, in particular, in the contributed discussion appendix therein) there are a number of practical and theoretical concerns with this proposal. The consensus seems to be that if one is going to use this approach one should use only a minimal amount of data for the initial constraint and most of the data for the marginal likelihood evaluation. By contrast, the Bailer-Jones approach is to perform 10 fold cross-validation (i.e., using 90% of the sample to update the prior, and just 10% for marginal likelihood estimation), which is much closer to Murray Aitkin’s much maligned posterior Bayes factors idea.
The final arXival I read this morning was one by Gratier et al. looking at the molecular gass mass of M33. The statistical model here is a Bayesian errors-in-variables regression, following in part the guidance of Kelly and Hogg et al., which seems to be properly implemented. My only question is to the suitability of the (unimodal) Normal forms supposed for the intrinsic distributions of the observed covariates; histograms of the observed and inferred distributions would be useful here. If the Normal is insufficiently flexible there are of course both simple and less simple alternatives: one of which is to use an infinite mixture of Normals which can be quite easy to code up as a Gibbs sampler by cobbling together existing R packages as we found in Bonnie’s paper on calibration of RDT and microscopy-based prevalence estimates.
I noticed the recent paper by Vallisneri & Haasteren on the arXiv entitled “Taming outliers in pulsar-timing datasets with hierarchical likelihoods and Hamiltonian sampling” in which the authors point out that Bayesian hierarchical models can, and perhaps should, be used to account for outliers when modelling pulsar-timing datasets with Gaussian processes. I wasn’t convinced that this point justifies an entire paper, since the idea of using such a model with a mixture of likelihoods for good and bad data points is well known already in astronomy as per Hogg, Bovy & Lang (as pointed out by V&H) and others (e.g. I would point to Taylor et al.); the only novelty here being that it’s described and applied to a single mock pulsar dataset. I was also not convinced by the authors’ arguments to show that the principled approach would surely beat the ad hoc alternative of 3-clipping, which was argued on the grounds that since the pulsar model includes estimation of the error term there is no single 3 threshold to work with. One would think that an simple marginalised estimator of model ‘discrepancy’ (marginalising over the unknown and underlying GP) for each data point would provide a tractable equivalent to threshold possible outliers. It would then be meaningful to compare the performance of this approach against the hierarchical model in terms of bias and runtime under a variety of bad data scenarios.
Having said that I was very interested to see that random Fourier features are being embraced by the pulsar timing community, as described in greater detail in an earlier paper by the same authors and another by Lentati et al. I was recently talking to Seth Flaxman about how odd it is that random Fourier bases aren’t more popular, with one possibility being that people find they need a huge number of components to get a close approximation to their targeted GP. But of course this is going to depend on the application, or whether you even care: like with the INLA/SPDE approximation we use a lot at MAP, in practical applications it may be sufficient to say ‘we adopt the approximate model as our model’ and then just make sure you do a damn good job testing its performance on mock data and out-of-sample cross-validation (etc.).
Earlier this week I popped up to the 2016 RSS Conference in Manchester to give a talk as part of the Geostatistical Models for Tropical Medicine session organised by Michelle Stanton (and featuring two other speakers: Victor Alegana from Southampton and Emanuele Giogi from Lancaster). Given the rather steep conference fees (!) I decided to only register for one day, but nevertheless from the few talks I saw there were a couple of obvious relevance to astronomical statistics. First, Simon Preston described the ‘ESAG’ (Elliptically Symmetric Angular Gaussian) distribution on the sphere which is constructed by projection/marginalisation of a three-dimensional Gaussian in to the space . Two additional conditions on the mean, , and covariance matrix, , of the three-dimensional Gaussian complete the definition of the ESAG and reduce the size of its parameter space to 5: namely, and . One could well imagine using a mixture model in which the ESAG is the component distribution to represent something like the distribution of gamma-ray bursts on the sky. Second, Timothy Cannings described the methodology behind his R package, RPEnsemble, for learning binary classifications via an ensemble of random projections from the input space, , to a lower-dimensional space, . Given the prevalence of classification tasks in astronomical data analysis (e.g. distinguishing quasars from other bright sources in a wide-field survey catalogue) I would expect this one also to be a neat addition to the astronomers’ toolkit.
Apologies for the off topic post, but as an Australian blogger I thought it important to give emphasis to the latest evidence of sadistic abuse going on under the authority of the Australian people at our off shore detention centres: The Nauru Files. The non-Aboriginal peoples of our country have a long had a difficult relationship with other new arrivals, from the Lambing flat riots to the White Australia policy, and the widespread racism experienced by the ‘wogs’ who settled here since the second world war and the ‘gooks’ who arrived after the Vietnam war. In recent times (that is, for at least the past fifteen years) our politicians have had great success in harnessing racist attitudes amongst the population by whipping up fear of boat people and promising ever tougher measures to deter the huddled masses from attempting passage to Australia by boat. A case in point being the Tampa affair which saw John Howard win huge support for refusing entry to Australian waters for a Norwegian commercial vessel that rescued 438 refugees (mostly Afghani) from a sinking ship in international waters between Australia and Indonesia.
The song “77%” by The Herd reflects on the Tampa Affair.
Since the Tampa Affair both sides of politics—namely, the Australian Labor Party and the (ironically named) Liberal Party of Australia—have taken a hardline attitude to ‘illegal’ arrivals of refugees by boat, building up the myth that we need a strong deterrent against those seeking to ‘jump the queue‘ of ‘legal’ refugees. The prime focus of which has been the consignment of such ‘illegal’ arrivals by boat to detention centres on Nauru (a tiny Micronesian republic) and Manus Island (part of Papua New Guinea). Here the asylum seekers (including women and children) are held for years at a time while their cases are ‘investigated’ to determine whether or not the government will grant them refugee status. The living conditions are these internment camps are squalid, the guards are known to harass the women and kids for sexual favours, and rates of mental illness are consequently staggeringly high amongst detainees. In PNG homosexuality is illegal and gay refugees on Manus Island live in fear for their safety. The latest leak of documents from the Nauru detection centre paint a shocking picture which it has been argued amounts to raw evidence of torture.
As much as I feel pride in my fellow Aussies when I see them doing well at the Olympics or excelling in science and the arts, I feel just as much shame at our dickhead attitude towards refugees.
I’ll finish here with some thought-provoking words from “The Phillip Ruddock Blues” by a great Aussie band called TISM:
“Why should we let towel-heads in
’cause their ships won’t float?
what other race has ever come to Australia on a boat?
and if self-interest should rule
five miles out from shore
why the hell don’t it apply to those who live next door?“
Another astro ph submission that I’m reviewing a few weeks late here is this one by Garnett et al., entitled “Detecting Damped Lyman- Absorbers with Gaussian Processes”. The application here is pretty much what it says in the title; that is, probabilistic classification of QSO spectra into those containing DLAs and those without (plus, of course, an estimation of the redshift and column density of any candidate absorption systems). Underneath the hood this is a straightforwards Bayesian model selection problem in which the underlying quasar spectrum is modelled as the realisation of a Gaussian process and any DLAs as Voigt profiles acting multiplicatively on top.
The strength of the analysis here is that a series of sensible decisions are made to yield a fast and effective implementation. In particular, (i) a large training dataset of labelled SDSS quasar spectra is used to learn a reasonable mean and covariance function for quasars without DLAs, with the covariance represented via a low-rank approximation; (ii) the Woodbury identity and Sylvester determinant theorem are used to make efficient matrix operations on the Gaussian; and (iii) the two parameters of each absorber are integrated out numerically via a quasi-Monte Carlo ‘grid’ (FYI: Gerber & Chopin give some useful notes on the relative performance of MC and QMC methods at given dimension in the introduction to their QSMC read paper). While it may not be common in purely observational data analysis problems to have a pre-labelled catalogue against which to learn a covariance function, it is often the case in astronomy that one has a reference set of simulated mock datasets against which a covariance can be learnt so this approach could well be useful in a number of other settings. For example, to learn the (presumably non-stationary) spatial covariance function underlying a log-Gaussian Cox process of the galaxy distribution in a magnitude-limited survey.
Looking forwards, I wonder as to the validity of the assumed Gaussian likelihood (as proxy for Poisson photon counting statistics) as the method is rolled out to noisier spectra; of course, removing Gaussianity from the top layer of the likelihood function would mean analytic solutions are no longer available for key marginalisation steps, with presumably a massive increase in computational cost to the algorithm.
– I was surprised to see an unweighted mean in Eqn 22 since I would have thought each quasar might have a different observational noise on its j-th wavelength bin for which usually one would use inverse variance weighting
– The term ‘unbiased’ is used improperly (from a statistical point of view) in describing the QMC integral of Eqn 57.
I noticed the latest IFT group paper on astro-ph (as a replacement) this morning—“SOMBI*: Bayesian identification of parameter relations in unstructured cosmological data” by Frank et al. [* which I imagine to be pronounced thusly]—and, since I have some free time while waiting for simulations to finish, I thought I would comment on this one for today’s blog entry.
The idea of the SOMBI code/model is to first ‘automatically identify data clusters in high-dimensional datasets via the Self Organizing Map’, and then second to recover parameter relations ‘by means of a Bayesian inference within respective identified data clusters’. Notionally we are to imagine that ‘each sub-sample is drawn from one generation process and that the correlations corresponding to one process can be modeled by unique correlation functions.’ In effect, the authors cut their model such that the identification of clusters is not informed by the nature of any inferred correlation functions; nor is there any shrinkage achieved by a hierarchical structure of shared hyperparameters during the regression on each subset. This seems to me to be an unfortunate choice because it does turn out in the application given (and one would expect more generally) that some parameter relationships are not entirely dissimilar between groups (for which shrinkage would be sensible); likewise, two groups turning out to have similar relationships might actually be sensibly pooled back into one group in a model without cuts (and so forth).
For those interested the model for regression in each group is simply:
with being a design matrix consisting of the 1 to th powers of the independent variable, . The normal-uniform likelihood-prior choice is used to allow explicit marginalisation and the empirical Bayes solution for is adopted under chosen via the BIC.
One criticism I would make is that this work is presented in vacuo without reference to the myriad other approaches to automatic data clustering and/or variable selection/regression already existing in the stats and ML literature. Certainly the BIC has a long history of use for selection of regression coefficients in linear (and GLM) models and its properties are well understood; but these are typically in the context of Bayesian models with priors that inform the expected scale of power terms and without the free parameter of unknown . (For reference, Gelman et al. recommend a standardisation of inputs before construction of power terms under their Cauchy prior structure for logistic regression). Alternative penalties for automatic covariate selection are (e.g.) the LASSO and ridge regression, and Bayesian model selection with Zellner’s g-prior. Likewise, there are numerous Bayesian schemes which aim to achieve clustering and regression inference without cuts to the model, typically involving a clustering over some latent space (with dimensional reduction; e.g. Titsias & Lawrence). In particular, it would have been useful to see such a comparison performed over a common dataset (e.g. one from the default R package) as is common in NIPS submissions.
It’s been a long time between posts here at Another Astrostatistics Blog; there’s no single reason for my hiatus, more like a diverse series of distractions, including travel to Seattle for the IDM Disease Modelling Symposium and to Pula (Sardinia) for the ISBA World Meeting. Both of these were excellent conferences and I will hopefully find the time to blog about some of the more interesting presentations as I remember them. A more recent distraction has been my referee duties for NIPS 2016; of the seven manuscripts I reviewed from the fields of ‘ABC’, ‘non-parametric models’ and ‘Gaussian processes’ at least two could be important tools for astronomical data analyses—the -free ABC approach of Papamakarios & Murray and the non-parametric (GP/spline-based) model for clustering disease trajectory maps (i.e., irregularly-sampled time series) of Schulam & Arora. The former has already had an unpublished application to the fitting of cosmological models described by Iain Murray at the MaxEnt 2013 meeting, while the latter could well prove useful in the study of x-ray or -ray bursts (for instance).
As a show of good faith for loyal readers wondering if indeed I really have resumed blogging duties I give below some thoughts upon reading the new paper by Licquia & Newman, in which the authors seek an ensemble estimate for the radial scale-length of the Milky Way disk through a meta-analysis of previously published estimates. Methodologically this new paper is quite similar to their earlier meta-analysis study of the Milky Way’s stellar mass and SFR, albeit with an added component of Bayesian model averaging to allow for uncertainty in the optimum choice of model for their ‘bad measurement’ category . By way of reminder, Licquia & Newman’s meta-analysis model is a mixture of two likelihoods—‘good’ studies contribute to the likelihood exactly as per their quoted uncertainties on the disk scale length in what amounts to a fixed-effect meta-analysis model, while ‘bad’ studies either don’t contribute at all or contribute in a weakened sense by re-scaling of their error bars (depending on the model choice).
As with their earlier paper I have a positive impression of the analysis and implementation, and I think there remains great potential for meta-analysis studies to forge more precise estimates of key astronomical quantities from the many conflicting results reported in the literature. However, while reading this manuscript I began to question whether the adopted meta-analysis model—the mixture of a fixed-effect component with an over-dispersed component—is philosophically the ‘right’ model for this purpose. The fixed-effect model is typically used in (e.g.) clinical trial meta-analyses when it is believed that each separate study is targeting the same underlying parameter, with disagreements between estimates driven primarily by the sampling function rather than systematics like structural differences in the study cohorts. In the case of the Milky Way there is ultimately only one galaxy with a fixed population of stars to observe, so the variation in estimated disk scale lengths is driven primarily by differences in the region of the disk included for study, treatment of dust extinction, the adopted galaxy model and the method of fitting it. From this perspective it seems the answer sought requires a model for the variation introduced by different astronomers’ decisions regarding these analysis choices, for which a random-effects meta-analysis would seem more appropriate. Or at least, this seems a question which deserves some further investigation if the community is to be convinced that meta-analyses are indeed a good thing for astronomy.
One side note: Although it’s not discussed explicitly it seems to me that the authors’ computation of the marginal likelihood for each model and the posteriors in and are based on summation of the mixture likelihood such that the ‘labels’ (i.e., ‘good’ or ‘bad’) are not sampled in the posterior representation, and hence each study cannot be assigned a posterior probability of being either ‘good’ or ‘bad’. It’s natural to perform the computation in this manner to avoid having to perform Monte Carlo over a high dimensional discrete space (the labels) but it seems a shame not to be able to report the inferred ‘quality’ of each study in the results. I wonder (not rhetorically) if there’s a simple a posteriori ‘trick’ that one can use to compute or approximate these scores?