## Podcast …

I recently gave in to the conceit of trying to record a podcast; you can see/hear the results here as the first instalment of the Malaria Atlas Project podcast series.

Otherwise I’ve been travelling a lot these past three weeks: one week in Seattle at the IDM Symposium, one week of holidays in San Francisco, and three days in Geneva at the Global Fund.  Back in Oxford next week!

## A convexified mixture model …

Just a brief note to point out this recent arXival by Leistedt & Hogg in which the color-magnitude diagram is used to help refine (parallax) distance estimates from Gaia via a Bayesian hierarchical model.  The idea is that although a star’s apparent magnitude and colour tell us very little about its distance in isolation they can help to provide a substantial amount of Bayesian shrinkage (via the absolute magnitude-colour diagram) given a large enough collection of stars with additional noisy distance measurements (that one may hope to refine).  Bayesian shrinkage (see David van Dyk’s talk from IAUS306) is a powerful idea in hierarchical modelling and, although now used fairly widely in photometric redshift studies and galaxy classifications, its potential certainly has not been exhausted in any sense.

What I found particularly interesting here was the form chosen to represent the density of (true/latent) stellar positions on the (absolute) colour-magnitude diagram, which was a Normal/Gaussian mixture model in which the means of each component are placed in a grid over the 2D parameter space, the standard deviations of the components are all held fixed, and only the vector of component weights is to be inferred.  The motivation given for this modelling decision is that it renders the inference process easier, in particular because it makes it easier to marginalise out the latent variables in the Gibbs sampling step.  In fact the benefits of such a model run even deeper by way of conferring convexity to the mixture model construction as known from studies of this model from the perspective non-parametric maximum likelihood estimation (e.g. Feng & Dicker 2016).

Edit: While cleaning up my hundreds of open browser tabs I realised I should also have pointed to this recent arXival by Si et al. as another example of hierarchical Bayesian shrinkage for Gaia data.

Posted in Uncategorized | 3 Comments

## ABC for Lyman forest studies …

I enjoyed reading this recent arXival by Davies et al. in which the authors present an ABC  (approximate Bayesian computation) analysis of the cosmological reionization history.  In particular, the idea is to use ABC to overcome intractability of the joint likelihood function for Lyman-$alpha$ and Lyman-$beta$ absorption along the line (or lines) of sight to distant ($z>5$) quasars under realistic models for the generating process.  The reason I enjoyed this paper was because the steps of the ABC analysis were comprehensively described and seem to have been carefully thought through: the origin of the intractability in the likelihood and the choice of summary statistics are thoroughly discussed, and mock data simulations under known parameters are used to test the pipeline.  Perhaps not surprising because I see in the acknowledgements both Foreman-Mackey and Hogg, and also Fouesneau who I might one day hope to see apply ABC to his stochastic IMF models.  Also, of interest is that one of the summary statistics explored is a pseudo-likelihood statistic motivated by the connection between ABC and Indirect Inference discussed in (e.g.) Drovandi et al. (2015).

Some outstanding questions I was left with are as follows.  In their demonstration of the inadequacy of a multivariate Gaussian likelihood (e.g. Fig 5) for the sampling distribution of transmission it looks as if the only problem is the approximation of the Poisson by a Gaussian rather than necessarily any non-Gaussianity in the underlying rate parameters of the Poisson.  In that case it looks like a pure indirect inference model in which the rate parameters of the Poisson in each bin are estimated from mock data and then a Poisson likelihood is supplied might be a good option; ignoring the correlation between bins (is this even important) but making the sampling distribution more realistic?  With possible extension to estimating the full mean and covariance matrix of a (multivariate) Gamma-Poisson auxiliary model.

The use of only a PIT (probability integral transform) style test of Bayesian parameter coverage also puzzled me, since this test is only confirming that the ABC likelihood function behaves like a likelihood function (in terms of introducing some new information to the prior) but doesn’t speak to the length of the confidence intervals being generated.  That is, such a test is a good sanity check on the numerical implementation of the ABC algorithm but can’t satisfy the usual role of the PIT run out-of-sample for model checking, nor can it be used to rank summary statistics a la the RMSSE framework of Nunes & Balding (2010).

An interesting side bar is that David Balding is well known for his work on statistics for forensic science.  Unrelated brain dump: the other night I found myself watching a forensic files episode in which an astronomer helps solve the case.

## Bayesian inference from a fixed set of simulations …

I noticed this recent paper on the arXiv by Patel et al. entitled, “Orbits of massive satellite galaxies – II. Bayesian Estimates of the Milky Way and Andromeda masses using high precision astrometry and cosmological simulations”, in which the authors consider an inference problem on the periphery of current ABC & pseudo-marginal methods.  In this case the authors are interested in estimating the virial mass of the Milky Way given observational estimates of some of the orbital elements (proper motion, radial velocity, etc.) with respect to Andromeda.  These data enter their model as (independent) Normal likelihoods for the observations given the latent (‘true’) values of those observables; the latter being imagined drawn from a prior which is represented by the drawing of Milky Way-like galaxy groups from a cosmological N-body simulation.  The catch here is that the simulation is expensive and is not able to be run directly by the authors, so all inferences must be made from a pre-compiled catalog containing a finite number of mock galaxy groups drawn from this prior.

This looks at face values like an ABC or pseudo-marginal problem in the sense that there is a reliance on mock datasets, but in fact there are (ignoring cosmological parameters, which are held fixed) no free parameters in the simulations: all the simulations are there for is to represent the prior on the latent variables of Milky Way-like galaxy groups: virial masses and orbital elements.  For the present paper the authors approximate the posterior for the Milky Way virial mass using an importance sample re-weighting scheme formed by weighing each mock galaxy by the likelihood of observing a subset of its orbital elements.   In one set of experiments these are the the satellite’s maximum circular velocity, its separation from the host, and its total velocity today relative to the host galaxy; and in another set of experiments these are the satellite’s maximum circular velocity and the magnitude of its orbital angular momentum.

In this scheme the posterior is behaving much like a kernel-density estimate in which the kernel acts on distance between the observed and latent orbital elements, using these to predict the virial mass.  As one adds (informative) observations the number of mock galaxy groups contributing non-trivially to the estimate decreases such that although the posterior estimate will typically contract it be also be noisier.  Indeed the authors recognise this effect and consider some of their decisions with respect to the effective sample size, although primarily their investigation here is in terms of sensitivity to the definition of a ‘Milky Way-like’ group (which is a hard threshold on the mock galaxy group catalog I’ve glossed over in the above).

Imaging the extension of this inference procedure to further applications, it would be helpful to have a general framework for deciding which orbital elements and other observational data to use and which to leave out.  Such a framework would need to balance the information gain of each possible orbital element against the reduction in effective sample size over the available simulations; and would presumably be based on cross-validation estimates of bias and coverage for mock galaxy groups in the simulation sample.  This is of course a very similar problem to summary statistic selection in ABC analyses and I’m particularly reminded of the paper by Nunes & Balding (2010) in which those authors run a preliminary ABC analysis with unrefined summary statistics to identify the important region of parameter space over which the summary statistics should be optimised.

Another way to attack this problem would be via a flexible parametric model for the prior as in the RNADE-based scheme of Papamakarios & Murray (2016).  Indeed this looks very similar to the example problem described (no slides available as far as I can see) by Iain Murray at the MaxEnt 2013 conference in Canberra.

## Vale Joe Hilbe (1944-2017) …

I had the sad news this week of the passing of Joseph Hilbe, an important statistician and educator in the field of regression models for count data and an enthusiastic advocate of astro-statistics as an emerging discipline.  Emilie (my wife) & I met Joe three years ago at the IAUS 306 (Statistical Challenges in 21st Century Cosmology) in Lisbon where he helped to establish the COIN (Cosmostatistic Initiative) project which in turn led to a series of three papers on the Overlooked Potential of Generalised Linear Models for Astronomical Data Analysis (1 de Souza et al., 2 Elliott et al., 3 de Souza et al.).  It was actually a lucky coincidence that Joe’s interests in GLMs aligned with the concerns a number of us had been having about how often we’d seen count data mangled into bins for astronomical regression studies, but I suspect he would have found a way to steer us on to his favourite topic in any case!  On a personal level I found Joe to be a good conversationalist, although I only learnt of his track and field exploits through wikipedia (so it seems he was a modest guy too).  Even in his 70s Joe was in high demand from various groups and companies looking for statistical analysis help, and one regret I have is that I never had the time to take on some of the jobs he was hoping to offload on us younger statisticians (I suspect they would have be quite lucrative relative to my academic salary).  All that to say, Joe will be sorely missed and for those who haven’t read his books yet a good place to start is his new one coming out in June this year: Bayesian Models for Astrophysical Data.

## Intrigue in the neutrino kingdom …

I noticed this brief note on the arXiv today by Schwetz et al. highlighting some of those authors’ concerns with a recent arXival by Simpson et al. regarding Bayesian evidence for the normal neutrino hierarchy.  The question examined by Simpson et al.—and in a recent paper by Hannestad & Schwetz—is to what extent the available data from neutrino oscillation experiments and cosmological modelling should lead us to favour the normal hierarchy (NH) over the inverted hierarchy (IH) of neutrino masses: the former being $m_1 \leq m_2 \leq m_3$ and the later $m_3 \leq m_1 \leq m_2$.  The available neutrino oscillation experiments place constraints on the squared differences between two of the three possible mass pairings; these constraints being slightly different for the NH and IH models.  The cosmological data, on the other hand, place an upper bound on the sum of masses ($m_1+m_2+m_3 \leq 0.13$ eV) which is close to a lower limit on the sum of masses under the IH scenario of $0.0982$ eV.  Hence, it is felt that future cosmological experiments may well rule out the IH scenario; notwithstanding the authors’ present interest in precisely how likely should we consider the IH scenario for the time being.

The approach taken by Simpson et al. is to (1) set equal a priori odds on the NH and IH scenarios, (2) specify a generative model for the neutrino masses under each scenario, and then (3) form posterior Bayes odds using the marginal likelihoods with respect to the observed data from both the neutrino oscillation and cosmology experiments together.  Their models are given hierarchical priors such that for the NH scenario:

$\{\log m_1,\log m_2,\log m_3\} = {X_{(1)},X_{(2)},X_{(3)}}|{X_1,X_2,X_3}$ (order statistics)
for $i=1,\dots,3,\ X_i \sim \mathrm{Normal}(\mu,\sigma)$
$\log \mu \sim U(\log 0.001,\log 0.5)$ (eV)
$\log \sigma \sim U(-4,3)$ (reading numerical values off their figure labels).

Their prior for the IH scenario is the same except for the assignment of order statistics in the first line.  Because the sample space is small and likelihood evaluations are cheap they are able to compute posterior Bayes factors and the marginal posteriors in the hyper-parameters $\{\mu,\sigma\}$ via brute force (the arithmetic mean estimator in the mass space over a grid of hyperparameters).

All this seems pretty straightforwards, and indeed when I skimmed this paper (the Simpson et al. one) when it first appeared on the arXiv I didn’t think anything particular other than, “seems sensible, not relevant for blogging because the marginal likelihood was so easy to compute”.  If I understand Schwetz et al.‘s note correctly, their complaint translates (in my interpretation) to be that Simpson et al. set their prior odds on the models before considering the neutrino oscillation data.  By contrast, in their earlier (but still quite recent paper) Hannestad & Schwetz decided to begin from the neutrino oscillation experiment likelihoods, letting the posteriors from non-informative priors updated by those constraints be their priors for examining the cosmological data with equal prior odds of each scenario assigned after the neutrino oscillation experiment.  By the nature of their analysis, Simpson et al. must go further in specifying a prior on their neutrino masses prior to observing any data.  Schwetz et al.‘s objection is that this introduces an artificial weighting of the evidence from the neutrino experiment.

For sure the Simpson et al. approach will be sensitive to the priors adopted (like all Bayesian model selection problems) but it doesn’t seem to me to be a silly thing to do in the context of forming betting odds on the hypotheses from the totality of available data.  Further prior sensitivity analyses might be warranted and the nuances of the analysis could be more clearly explained but I personally would conclude that the Simpson et al. analysis is a worthwhile contribution to the debate.

## Pseudo-marginal MCMC for cosmological statistics …

Having been (re-)inspired in pseudo-marginal methods by Patrick Muchmore’s talk at the recent ABC in Banff conference I’ve started evangelising to cosmologists who might be interested in fits based on multivariate Normal (Gaussian) distributions with estimated covariance matrices.  For example, in the case of estimating cosmological parameters from the (empirical) two point correlation function of a galaxy survey.  Here, although the power spectrum of the cosmological field is given by the model parameters, the covariance matrix of the observable two point correlation function (in a given set of radial bins) cannot be directly computed due to (e.g.) the irregular survey window, variations of imaging depth across the field, missing data due to bright star cut-outs, redshift fibre collisions, and so forth.

In such circumstances a sensible solution is to use mock datasets to side step the intractable calculation of exact likelihoods.  Sometimes this will be done with full cosmological N-body simulations [Option A] which will give the most faithful representation of the random field but is extremely expensive.  Typically one will only have the means to produce $\lesssim 100$ independent N-body realisations; these can be subsampled and bootstrapped to generate estimates of the observational covariance matrix.  Other times the mock datasets will be constructed by directly bootstrapping the original dataset [Option B], though of course this approach cannot capture variability (cosmic variance) on scales beyond the survey footprint.  Yet another approach employing mock datasets is that of approximating the cosmological field as a log-Normal process [Option C] which brings fast Fourier methods for field simulation into play.  For the following discussion I am interested in Option C as perhaps the only computationally feasible method of these three with which to address the dependence of the observational covariance matrix on the true (unknown) cosmological parameters.

It is well known (thanks to Hartlap et al.) that inversion of the sample covariance matrix does not provide an unbiased estimator for the precision matrix, though a small correction factor allows for simple construction of an unbiased estimate of the precision matrix.  This in turn provides an unbiased estimate of the log-likelihood (of a multivariate Normal), but not of the likelihood itself.  Unfortunately, it can be an impossible job to convert an unbiased estimator taking values on the entire real line into an unbiased estimator for a non-negative statistic.  That said, there is a different set of estimators—mostly developed in the 1950s, 60s, and 70s; sometimes for the purpose of studying bombing accuracy—that target directly the multivariate Normal likelihood.  These have been identified by Price et al. and Muchmore & Marjoram as key ingredients for constructing pseudo-marginal MCMC algorithms on multivariate Normal targets.

The achievement of the pseudo-marginal MCMC algorithm is that in the unlucky case that one cannot exactly compute the likelihood function at a given set of parameters one can still construct an MCMC chain targeting the same Bayesian posterior as in the exact case as long as one has access to an unbiased estimate of the likelihood.  Just plug the unbiased likelihood estimate for each new proposal into the Metropolis-Hastings accept/reject step (and, importantly, save and re-use the likelihood estimate attached to each accepted state).  While the pseudo-marginal MCMC chain will always be noisier and stickier (and hence, demand running of a longer chain) than the equivalent MCMC algorithm using exact (non-noisy) likelihoods it will—quite amazingly!—nevertheless represent the same posterior.  This is not guaranteed to be the case for, e.g., a noisy MCMC algorithm built by plugging in the likelihood estimate corresponding to the Hartlap et al estimate for the unbiased precision matrix.  Likewise, for plugging in the marginalised t-distribution likelihood given by Selentin & Heavens or the likelihood corresponding to the shrinkage estimator of Joachimi.  (These latter two should rather be considered as methods relating to Options A and B mentioned earlier.)

The reason the cost of producing mock datasets is an important consideration for this approach is that at each proposal of the MCMC chain a new unbiased estimate has to be computed, and for a multivariate Normal likelihood this will usually mean generating more than one mock dataset.  For instance, the estimator used by Price et al requires at least $d+3$ datasets where $d$ is the dimension of the observed summary statistic (typically ~20-50 for an empirical two point correlation function).  The Muchmore & Marjoram estimator can deal with just one mock dataset but will be very noisy in that regime.  Tuning of the pseudo-marginal algorithm should focus on adjusting this number of mock datasets used, along with the proposal distribution, to target a sensible acceptance rate (see Andreiu & Vihola and Sherlock et al.).  Efficient use of parallel computing architectures would be worthwhile when using this algorithm in anger; even brute force approaches such as batching the $n$ mock datasets to be produced over the available cores at each proposal.