## Bayesian orbital elements from image streaks …

I enjoyed this recent paper by Schneider & Dawson concerning a Bayesian algorithm for estimating orbital elements of Geosynchronous Earth Orbit objects (GEOs) from the direction, length, and width of their streaks in ground-based telescope images.  The authors detail a number of pragmatic modelling assumptions made to allow for fast computational approximation of the likelihood function contributed by each single image from amongst a set of images containing the same object.  The trick then is to make use of their partial posterior samples of orbital elements constrained by each single image in order to efficiently sample the full joint posterior.  Their solution here is to use a multiple importance sampling approach (Elvira et al. 2015; a la Veach & Guibas 1995).  This requires new evaluations of the approximate likelihood function for each partial posterior—although a KDE-based version is also noted to have been effective which leads me to point out a connection with the wider statistical literature on combining partial posteriors from data partitioning, especially Neiswanger et al.‘s ’embarrassingly parallel MCMC’.

Of course, KDE approaches suffer problems due to poor scaling statistically with the dimension of the parameter space and poor scaling computationally with the number of particles in the estimator, and with limitations when inferences are made far from the positions of the available particles.  The other common strategy (‘consensus MCMC‘) for combing partial posteriors is to use a Normal approximation to each (for which the product is also Normal) but other course this is limited by the suitability of the Normal for representing non-Normal (perhaps even multimodal) partial posteriors.  Ignoring multi-modality for now, I would speculate that astronomers might be well placed to devise a neat higher order approximation scheme to improve on the consensus MCMC approach given our familiarity with expansions about Gaussian potentials (e.g. Blinnikov & Moessner 1997; although I seem to remember their naming of this as an Edgeworth expansion is non-canonical?).

## Astro ph round up …

Some quick comments on a number of recent astro-ph submissions I’ve had open as tabs but had not read until today.  The first (in chronological order) is on ‘exploding satellites’ (satellite galaxies, not man-made space junk) by Kuepper et al., in which the authors present a Bayesian analysis underpinned by ‘simulations’ of the orbital paths of test particles in a model potential.  At first I thought these were stochastic simulations suggestive of an ABC-like case study, but in fact it turns out these are deterministic simulations for a given number of test particles.  In order to form a(n approximate) likelihood the final distribution of the test particles needs to be turned into a pdf via kernel smoothing.  Naturally one wonders as to the relationship between the posterior approximation and the number of test particles used.  This may well be a potential application area for the multi-objective Bayesian optimisation—where optimisation does not have to mean finding the posterior mode, but could mean optimisation of the approximation to the true posterior in e.g. a KL divergence sense—that I’m currently playing with as a tool for fitting malaria simulations.  Basically, the idea expounded in papers by Swersky et al. and Kandasamy et al. (with a supernova data example) is to use many cheap, noisy simulations for exploration and fewer expensive, less-noisy simulations for optimisation.  It’s an elegant approach but requires some fine tuning to get the acquisition function right and to implement an efficient Gaussian process-based inference procedure.

The next arXival I read today was a conference proceedings by Knezevic et al. in which the authors describe a Bayesian model for identifying Balmer-dominated shocks in spectroscopic ‘images’ of a supernova remnant.  The modelling left me with some odd impressions: in part because the data is first binned using Voronoi tessellations (why bin when you can impose a spatial prior which both finds the appropriate scale for sharing information and brings Bayesian shrinkage to the inference), and in part because the authors use a leave-one-out cross-validation (LOO-CV) estimate of the model evidence.  The latter is attributed to Bailer-Jones (2012) in which Coryn suggests an idea equivalent to O’Hagan’s (circa 1991) partial Bayes factors; namely, to use some proportion of the data to first constrain one’s (typically) uninformative priors and then use the remainder of the data to evaluate the marginal likelihood.  As discussed in O’Hagan’s 1995 read paper (and, in particular, in the contributed discussion appendix therein) there are a number of practical and theoretical concerns with this proposal.  The consensus seems to be that if one is going to use this approach one should use only a minimal amount of data for the initial constraint and most of the data for the marginal likelihood evaluation.  By contrast, the Bailer-Jones approach is to perform 10 fold cross-validation (i.e., using 90% of the sample to update the prior, and just 10% for marginal likelihood estimation), which is much closer to Murray Aitkin’s much maligned posterior Bayes factors idea.

The final arXival I read this morning was one by Gratier et al. looking at the molecular gass mass of M33.  The statistical model here is a Bayesian errors-in-variables regression, following in part the guidance of Kelly and Hogg et al., which seems to be properly implemented.  My only question is to the suitability of the (unimodal) Normal forms supposed for the intrinsic distributions of the observed covariates; histograms of the observed and inferred distributions would be useful here.  If the Normal is insufficiently flexible there are of course both simple and less simple alternatives: one of which is to use an infinite mixture of Normals which can be quite easy to code up as a Gibbs sampler by cobbling together existing R packages as we found in Bonnie’s paper on calibration of RDT and microscopy-based prevalence estimates.

## Fourier features for pulsar timing models …

I noticed the recent paper by Vallisneri & Haasteren on the arXiv entitled “Taming outliers in pulsar-timing datasets with hierarchical likelihoods and Hamiltonian sampling” in which the authors point out that Bayesian hierarchical models can, and perhaps should, be used to account for outliers when modelling pulsar-timing datasets with Gaussian processes.  I wasn’t convinced that this point justifies an entire paper, since the idea of using such a model with a mixture of likelihoods for good and bad data points is well known already in astronomy as per Hogg, Bovy & Lang (as pointed out by V&H) and others (e.g. I would point to Taylor et al.); the only novelty here being that it’s described and applied to a single mock pulsar dataset.  I was also not convinced by the authors’ arguments to show that the principled approach would surely beat the ad hoc alternative of 3$\sigma$-clipping, which was argued on the grounds that since the pulsar model includes estimation of the error term there is no single 3$\sigma$ threshold to work with.  One would think that an simple marginalised estimator of model ‘discrepancy’ (marginalising over the unknown $\sigma$ and underlying GP) for each data point would provide a tractable equivalent to threshold possible outliers.  It would then be meaningful to compare the performance of this approach against the hierarchical model in terms of bias and runtime under a variety of bad data scenarios.

Having said that I was very interested to see that random Fourier features are being embraced by the pulsar timing community, as described in greater detail in an earlier paper by the same authors and another by Lentati et al.  I was recently talking to Seth Flaxman about how odd it is that random Fourier bases aren’t more popular, with one possibility being that people find they need a huge number of components to get a close approximation to their targeted GP.  But of course this is going to depend on the application, or whether you even care: like with the INLA/SPDE approximation we use a lot at MAP, in practical applications it may be sufficient to say ‘we adopt the approximate model as our model’ and then just make sure you do a damn good job testing its performance on mock data and out-of-sample cross-validation (etc.).

Posted in Uncategorized | 3 Comments

## Highlights from two sessions at the RSS conference 2016 …

Earlier this week I popped up to the 2016 RSS Conference in Manchester to give a talk as part of the Geostatistical Models for Tropical Medicine session organised by Michelle Stanton (and featuring two other speakers: Victor Alegana from Southampton and Emanuele Giogi from Lancaster).  Given the rather steep conference fees (!) I decided to only register for one day, but nevertheless from the few talks I saw there were a couple of obvious relevance to astronomical statistics.  First, Simon Preston described the ‘ESAG’ (Elliptically Symmetric Angular Gaussian) distribution on the sphere which is constructed by projection/marginalisation of a three-dimensional Gaussian in $\mathcal{R}^3$ to the space $\mathcal{S}^2$.  Two additional conditions on the mean, $\mu$, and covariance matrix, $\Sigma$, of the three-dimensional Gaussian complete the definition of the ESAG and reduce the size of its parameter space to 5: namely, $\Sigma\mu=\mu$ and $|\Sigma|=1$.  One could well imagine using a mixture model in which the ESAG is the component distribution to represent something like the distribution of gamma-ray bursts on the sky.  Second, Timothy Cannings described the methodology behind his R package, RPEnsemble, for learning binary classifications via an ensemble of random projections from the input space, $\mathcal{R}^n$, to a lower-dimensional space, $\mathcal{R}^p$.  Given the prevalence of classification tasks in astronomical data analysis (e.g. distinguishing quasars from other bright sources in a wide-field survey catalogue) I would expect this one also to be a neat addition to the astronomers’ toolkit.

## [Non stats post] I love Australia, but seriously, f*£k Australia!

Apologies for the off topic post, but as an Australian blogger I thought it important to give emphasis to the latest evidence of sadistic abuse going on under the authority of the Australian people at our off shore detention centres: The Nauru Files.  The non-Aboriginal peoples of our country have a long had a difficult relationship with other new arrivals, from the Lambing flat riots to the White Australia policy, and the widespread racism experienced by the ‘wogs’ who settled here since the second world war and the ‘gooks’ who arrived after the Vietnam war.  In recent times (that is, for at least the past fifteen years) our politicians have had great success in harnessing racist attitudes amongst the population by whipping up fear of boat people and promising ever tougher measures to deter the huddled masses from attempting passage to Australia by boat.  A case in point being the Tampa affair which saw John Howard win huge support for refusing entry to Australian waters for a Norwegian commercial vessel that rescued 438 refugees (mostly Afghani) from a sinking ship in international waters between Australia and Indonesia.

The song “77%” by The Herd reflects on the Tampa Affair.

Since the Tampa Affair both sides of politics—namely, the Australian Labor Party and the (ironically named) Liberal Party of Australia—have taken a hardline attitude to ‘illegal’ arrivals of refugees by boat, building up the myth that we need a strong deterrent against those seeking to ‘jump the queue‘ of ‘legal’ refugees.  The prime focus of which has been the consignment of such ‘illegal’ arrivals by boat to detention centres on Nauru (a tiny Micronesian republic) and Manus Island (part of Papua New Guinea).  Here the asylum seekers (including women and children) are held for years at a time while their cases are ‘investigated’ to determine whether or not the government will grant them refugee status.  The living conditions are these internment camps are squalid, the guards are known to harass the women and kids for sexual favours, and rates of mental illness are consequently staggeringly high amongst detainees.  In PNG homosexuality is illegal and gay refugees on Manus Island live in fear for their safety.  The latest leak of documents from the Nauru detection centre paint a shocking picture which it has been argued amounts to raw evidence of torture.

As much as I feel pride in my fellow Aussies when I see them doing well at the Olympics or excelling in science and the arts, I feel just as much shame at our dickhead attitude towards refugees.

I’ll finish here with some thought-provoking words from “The Phillip Ruddock Blues” by a great Aussie band called TISM:

Why should we let towel-heads in
’cause their ships won’t float?
what other race has ever come to Australia on a boat?
and if self-interest should rule
five miles out from shore
why the hell don’t it apply to those who live next door?

Posted in Uncategorized | 2 Comments

## Learning GP covariance matrices by example …

Another astro ph submission that I’m reviewing a few weeks late here is this one by Garnett et al., entitled “Detecting Damped Lyman-$\alpha$ Absorbers with Gaussian Processes”.  The application here is pretty much what it says in the title; that is, probabilistic classification of QSO spectra into those containing DLAs and those without (plus, of course, an estimation of the redshift and column density of any candidate absorption systems).  Underneath the hood this is a straightforwards Bayesian model selection problem in which the underlying quasar spectrum is modelled as the realisation of a Gaussian process and any DLAs as Voigt profiles acting multiplicatively on top.

The strength of the analysis here is that a series of sensible decisions are made to yield a fast and effective implementation.  In particular, (i) a large training dataset of labelled SDSS quasar spectra is used to learn a reasonable mean and covariance function for quasars without DLAs, with the covariance represented via a low-rank approximation; (ii) the Woodbury identity and Sylvester determinant theorem are used to make efficient matrix operations on the Gaussian; and (iii) the two parameters of each absorber are integrated out numerically via a quasi-Monte Carlo ‘grid’ (FYI: Gerber & Chopin give some useful notes on the relative performance of MC and QMC methods at given dimension in the introduction to their QSMC read paper).  While it may not be common in purely observational data analysis problems to have a pre-labelled catalogue against which to learn a covariance function, it is often the case in astronomy that one has a reference set of simulated mock datasets against which a covariance can be learnt so this approach could well be useful in a number of other settings.  For example, to learn the (presumably non-stationary) spatial covariance function underlying a log-Gaussian Cox process of the galaxy distribution in a magnitude-limited survey.

Looking forwards, I wonder as to the validity of the assumed Gaussian likelihood (as proxy for Poisson photon counting statistics) as the method is rolled out to noisier spectra; of course, removing Gaussianity from the top layer of the likelihood function would mean analytic solutions are no longer available for key marginalisation steps, with presumably a massive increase in computational cost to the algorithm.

Minor notes:
– I was surprised to see an unweighted mean in Eqn 22 since I would have thought each quasar might have a different observational noise on its j-th wavelength bin for which usually one would use inverse variance weighting
– The term ‘unbiased’ is used improperly (from a statistical point of view) in describing the QMC integral of Eqn 57.

## Some Bayesian unsupervised learning …

I noticed the latest IFT group paper on astro-ph (as a replacement) this morning—“SOMBI*: Bayesian identification of parameter relations in unstructured cosmological data” by Frank et al.  [* which I imagine to be pronounced thusly]—and, since I have some free time while waiting for simulations to finish, I thought I would comment on this one for today’s blog entry.

The idea of the SOMBI code/model is to first ‘automatically identify data clusters in high-dimensional datasets via the Self Organizing Map’, and then second to recover parameter relations ‘by means of a Bayesian inference within respective identified data clusters’.  Notionally we are to imagine that ‘each sub-sample is drawn from one generation process and that the correlations corresponding to one process can be modeled by unique correlation functions.’  In effect, the authors cut their model such that the identification of clusters is not informed by the nature of any inferred correlation functions; nor is there any shrinkage achieved by a hierarchical structure of shared hyperparameters during the regression on each subset.  This seems to me to be an unfortunate choice because it does turn out in the application given (and one would expect more generally) that some parameter relationships are not entirely dissimilar between groups (for which shrinkage would be sensible); likewise, two groups turning out to have similar relationships might actually be sensibly pooled back into one group in a model without cuts (and so forth).

For those interested the model for regression in each group is simply:
$y \sim N(X'B,\sigma^2)$,
$B_{1:M} \sim U(-\infty,\infty)$,
$\sigma \sim U(0,\infty)$
with $X$ being a design matrix consisting of the 1 to $m$th powers of the independent variable, $x$.  The normal-uniform likelihood-prior choice is used to allow explicit marginalisation and the empirical Bayes solution for $\sigma|m$  is adopted under $m$ chosen via the BIC.

One criticism I would make is that this work is presented in vacuo without reference to the myriad other approaches to automatic data clustering and/or variable selection/regression already existing in the stats and ML literature.  Certainly the BIC has a long history of use for selection of regression coefficients in linear (and GLM) models and its properties are well understood; but these are typically in the context of Bayesian models with priors that inform the expected scale of power terms and without the free parameter of unknown $\sigma$.  (For reference, Gelman et al. recommend a standardisation of inputs before construction of power terms under their Cauchy prior structure for logistic regression).  Alternative penalties for automatic covariate selection are (e.g.) the LASSO and ridge regression, and Bayesian model selection with Zellner’s g-prior.  Likewise, there are numerous Bayesian schemes which aim to achieve clustering and regression inference without cuts to the model, typically involving a clustering over some latent space (with dimensional reduction; e.g. Titsias & Lawrence).  In particular, it would have been useful to see such a comparison performed over a common dataset (e.g. one from the default R package) as is common in NIPS submissions.