## Daleks vs MultiNest …

A recent paper on the arXiv that caught my attention was this one by entitled Accelerated Parameter Estimation with DALE$\chi$ by Daniel & Linder (2017), in which the authors present a code for determining confidence contours on model parameters for which the likelihood is computationally expensive to evaluate.  Important to note is that the authors are focused here on Frequentist confidence contours, rather than Bayesian credible intervals.  Hence, their objective is to trace out an n-1 dimensional shell (or set of shells) inside the n dimensional parameter space of a given model corresponding to some specified proportion of the maximum likelihood.  This is a fundamentally different objective to that of identifying credible intervals in Bayesian inference in which there is (typically) no simple prescription for the likelihood (or posterior density) that defines a credible interval given the maximum likelihood (or density at the posterior mode).  Though the authors suggest an algorithm in an earlier paper that purports to produce approximate credible intervals from a Frequentist driven exploration of the parameter space it is also acknowledged therein that this is a very crude approximation. For this reason I find it rather surprising that much of this recent paper is focused on a comparison against MultiNest which is designed specifically as a Bayesian tool with completely different objectives: evaluation of the model evidence (marginal likelihood) and posterior sampling.  This confusion of objectives is compounded by numerous incorrect characterisations of methods like MCMC throughout the article, such as the idea that MCMC is focused “on finding the maximum of the posterior probability”.

All that aside, what is the DALE$\chi$ algorithm? Well, basically it’s a guided stochastic search algorithm designed to efficiently map out a given likelihood contour through a mix of directed and random moves.  The direction is through an objective function based on a distance from the target likelihood value and an objective function to maximise distance from previously sampled points.  This is achieved through a Nelder-Mead procedure instead of the Gaussian process-based search strategy used in earlier papers by the same authors and collaborators, including Daniel et al. (2014) and Bryan et al. (2007).  Interestingly the full author list on those papers contains Jeff Schneider who’s more recently been involved in a lot of progressive work on Bayesian optimisation with Gaussian processes, but in Bryan et al. (2007) they spend a long time dumping on Bayesian methods, both the use of MCMC for posterior sampling and the use of Bayesian credible intervals in general! Despite their dogmatic stance in that paper the authors do actually do something interesting which is to look at a non-parameteric confidence ball around the observed CMB power spectrum (based on even earlier work by Miller and Genovese).  In today’s paper though the focus is simply on Neyman-Pearson generalised likelihood ratio contours.

If I was reviewing this paper I would ask for a comparison of the new code not against MultiNest but against the authors’ earlier Gaussian process-based codes.  The interest then is under what types of likelihood function costs does one outperform the other.  At face value one would imagine that the Gaussian process-based code is more efficient for very expensive likelihoods (say, 1 min or more) while the new version would be more efficient for very cheap likelihoods (say <0.1 sec), but their relative performance in between is more difficult to speculate and likely depends significantly on the dimensionality of the problem.  If a comparison against Bayesian methods is desired then I think it would make more sense to also compare against a Bayesian method also designed for genuinely expensive likelihoods such as the Bayesian optimisation codes used for semi-analytic model fitting by Bower et al. (2010); and I would in that case set the $\chi^2$ target to the level set credible interval identified by first running the Bayesian analysis.

## Podcast …

I recently gave in to the conceit of trying to record a podcast; you can see/hear the results here as the first instalment of the Malaria Atlas Project podcast series.

Otherwise I’ve been travelling a lot these past three weeks: one week in Seattle at the IDM Symposium, one week of holidays in San Francisco, and three days in Geneva at the Global Fund.  Back in Oxford next week!

## A convexified mixture model …

Just a brief note to point out this recent arXival by Leistedt & Hogg in which the color-magnitude diagram is used to help refine (parallax) distance estimates from Gaia via a Bayesian hierarchical model.  The idea is that although a star’s apparent magnitude and colour tell us very little about its distance in isolation they can help to provide a substantial amount of Bayesian shrinkage (via the absolute magnitude-colour diagram) given a large enough collection of stars with additional noisy distance measurements (that one may hope to refine).  Bayesian shrinkage (see David van Dyk’s talk from IAUS306) is a powerful idea in hierarchical modelling and, although now used fairly widely in photometric redshift studies and galaxy classifications, its potential certainly has not been exhausted in any sense.

What I found particularly interesting here was the form chosen to represent the density of (true/latent) stellar positions on the (absolute) colour-magnitude diagram, which was a Normal/Gaussian mixture model in which the means of each component are placed in a grid over the 2D parameter space, the standard deviations of the components are all held fixed, and only the vector of component weights is to be inferred.  The motivation given for this modelling decision is that it renders the inference process easier, in particular because it makes it easier to marginalise out the latent variables in the Gibbs sampling step.  In fact the benefits of such a model run even deeper by way of conferring convexity to the mixture model construction as known from studies of this model from the perspective non-parametric maximum likelihood estimation (e.g. Feng & Dicker 2016).

Edit: While cleaning up my hundreds of open browser tabs I realised I should also have pointed to this recent arXival by Si et al. as another example of hierarchical Bayesian shrinkage for Gaia data.

Posted in Uncategorized | 3 Comments

## ABC for Lyman forest studies …

I enjoyed reading this recent arXival by Davies et al. in which the authors present an ABC  (approximate Bayesian computation) analysis of the cosmological reionization history.  In particular, the idea is to use ABC to overcome intractability of the joint likelihood function for Lyman-$alpha$ and Lyman-$beta$ absorption along the line (or lines) of sight to distant ($z>5$) quasars under realistic models for the generating process.  The reason I enjoyed this paper was because the steps of the ABC analysis were comprehensively described and seem to have been carefully thought through: the origin of the intractability in the likelihood and the choice of summary statistics are thoroughly discussed, and mock data simulations under known parameters are used to test the pipeline.  Perhaps not surprising because I see in the acknowledgements both Foreman-Mackey and Hogg, and also Fouesneau who I might one day hope to see apply ABC to his stochastic IMF models.  Also, of interest is that one of the summary statistics explored is a pseudo-likelihood statistic motivated by the connection between ABC and Indirect Inference discussed in (e.g.) Drovandi et al. (2015).

Some outstanding questions I was left with are as follows.  In their demonstration of the inadequacy of a multivariate Gaussian likelihood (e.g. Fig 5) for the sampling distribution of transmission it looks as if the only problem is the approximation of the Poisson by a Gaussian rather than necessarily any non-Gaussianity in the underlying rate parameters of the Poisson.  In that case it looks like a pure indirect inference model in which the rate parameters of the Poisson in each bin are estimated from mock data and then a Poisson likelihood is supplied might be a good option; ignoring the correlation between bins (is this even important) but making the sampling distribution more realistic?  With possible extension to estimating the full mean and covariance matrix of a (multivariate) Gamma-Poisson auxiliary model.

The use of only a PIT (probability integral transform) style test of Bayesian parameter coverage also puzzled me, since this test is only confirming that the ABC likelihood function behaves like a likelihood function (in terms of introducing some new information to the prior) but doesn’t speak to the length of the confidence intervals being generated.  That is, such a test is a good sanity check on the numerical implementation of the ABC algorithm but can’t satisfy the usual role of the PIT run out-of-sample for model checking, nor can it be used to rank summary statistics a la the RMSSE framework of Nunes & Balding (2010).

An interesting side bar is that David Balding is well known for his work on statistics for forensic science.  Unrelated brain dump: the other night I found myself watching a forensic files episode in which an astronomer helps solve the case.

## Bayesian inference from a fixed set of simulations …

I noticed this recent paper on the arXiv by Patel et al. entitled, “Orbits of massive satellite galaxies – II. Bayesian Estimates of the Milky Way and Andromeda masses using high precision astrometry and cosmological simulations”, in which the authors consider an inference problem on the periphery of current ABC & pseudo-marginal methods.  In this case the authors are interested in estimating the virial mass of the Milky Way given observational estimates of some of the orbital elements (proper motion, radial velocity, etc.) with respect to Andromeda.  These data enter their model as (independent) Normal likelihoods for the observations given the latent (‘true’) values of those observables; the latter being imagined drawn from a prior which is represented by the drawing of Milky Way-like galaxy groups from a cosmological N-body simulation.  The catch here is that the simulation is expensive and is not able to be run directly by the authors, so all inferences must be made from a pre-compiled catalog containing a finite number of mock galaxy groups drawn from this prior.

This looks at face values like an ABC or pseudo-marginal problem in the sense that there is a reliance on mock datasets, but in fact there are (ignoring cosmological parameters, which are held fixed) no free parameters in the simulations: all the simulations are there for is to represent the prior on the latent variables of Milky Way-like galaxy groups: virial masses and orbital elements.  For the present paper the authors approximate the posterior for the Milky Way virial mass using an importance sample re-weighting scheme formed by weighing each mock galaxy by the likelihood of observing a subset of its orbital elements.   In one set of experiments these are the the satellite’s maximum circular velocity, its separation from the host, and its total velocity today relative to the host galaxy; and in another set of experiments these are the satellite’s maximum circular velocity and the magnitude of its orbital angular momentum.

In this scheme the posterior is behaving much like a kernel-density estimate in which the kernel acts on distance between the observed and latent orbital elements, using these to predict the virial mass.  As one adds (informative) observations the number of mock galaxy groups contributing non-trivially to the estimate decreases such that although the posterior estimate will typically contract it be also be noisier.  Indeed the authors recognise this effect and consider some of their decisions with respect to the effective sample size, although primarily their investigation here is in terms of sensitivity to the definition of a ‘Milky Way-like’ group (which is a hard threshold on the mock galaxy group catalog I’ve glossed over in the above).

Imaging the extension of this inference procedure to further applications, it would be helpful to have a general framework for deciding which orbital elements and other observational data to use and which to leave out.  Such a framework would need to balance the information gain of each possible orbital element against the reduction in effective sample size over the available simulations; and would presumably be based on cross-validation estimates of bias and coverage for mock galaxy groups in the simulation sample.  This is of course a very similar problem to summary statistic selection in ABC analyses and I’m particularly reminded of the paper by Nunes & Balding (2010) in which those authors run a preliminary ABC analysis with unrefined summary statistics to identify the important region of parameter space over which the summary statistics should be optimised.

Another way to attack this problem would be via a flexible parametric model for the prior as in the RNADE-based scheme of Papamakarios & Murray (2016).  Indeed this looks very similar to the example problem described (no slides available as far as I can see) by Iain Murray at the MaxEnt 2013 conference in Canberra.

## Vale Joe Hilbe (1944-2017) …

I had the sad news this week of the passing of Joseph Hilbe, an important statistician and educator in the field of regression models for count data and an enthusiastic advocate of astro-statistics as an emerging discipline.  Emilie (my wife) & I met Joe three years ago at the IAUS 306 (Statistical Challenges in 21st Century Cosmology) in Lisbon where he helped to establish the COIN (Cosmostatistic Initiative) project which in turn led to a series of three papers on the Overlooked Potential of Generalised Linear Models for Astronomical Data Analysis (1 de Souza et al., 2 Elliott et al., 3 de Souza et al.).  It was actually a lucky coincidence that Joe’s interests in GLMs aligned with the concerns a number of us had been having about how often we’d seen count data mangled into bins for astronomical regression studies, but I suspect he would have found a way to steer us on to his favourite topic in any case!  On a personal level I found Joe to be a good conversationalist, although I only learnt of his track and field exploits through wikipedia (so it seems he was a modest guy too).  Even in his 70s Joe was in high demand from various groups and companies looking for statistical analysis help, and one regret I have is that I never had the time to take on some of the jobs he was hoping to offload on us younger statisticians (I suspect they would have be quite lucrative relative to my academic salary).  All that to say, Joe will be sorely missed and for those who haven’t read his books yet a good place to start is his new one coming out in June this year: Bayesian Models for Astrophysical Data.

## Intrigue in the neutrino kingdom …

I noticed this brief note on the arXiv today by Schwetz et al. highlighting some of those authors’ concerns with a recent arXival by Simpson et al. regarding Bayesian evidence for the normal neutrino hierarchy.  The question examined by Simpson et al.—and in a recent paper by Hannestad & Schwetz—is to what extent the available data from neutrino oscillation experiments and cosmological modelling should lead us to favour the normal hierarchy (NH) over the inverted hierarchy (IH) of neutrino masses: the former being $m_1 \leq m_2 \leq m_3$ and the later $m_3 \leq m_1 \leq m_2$.  The available neutrino oscillation experiments place constraints on the squared differences between two of the three possible mass pairings; these constraints being slightly different for the NH and IH models.  The cosmological data, on the other hand, place an upper bound on the sum of masses ($m_1+m_2+m_3 \leq 0.13$ eV) which is close to a lower limit on the sum of masses under the IH scenario of $0.0982$ eV.  Hence, it is felt that future cosmological experiments may well rule out the IH scenario; notwithstanding the authors’ present interest in precisely how likely should we consider the IH scenario for the time being.

The approach taken by Simpson et al. is to (1) set equal a priori odds on the NH and IH scenarios, (2) specify a generative model for the neutrino masses under each scenario, and then (3) form posterior Bayes odds using the marginal likelihoods with respect to the observed data from both the neutrino oscillation and cosmology experiments together.  Their models are given hierarchical priors such that for the NH scenario:

$\{\log m_1,\log m_2,\log m_3\} = {X_{(1)},X_{(2)},X_{(3)}}|{X_1,X_2,X_3}$ (order statistics)
for $i=1,\dots,3,\ X_i \sim \mathrm{Normal}(\mu,\sigma)$
$\log \mu \sim U(\log 0.001,\log 0.5)$ (eV)
$\log \sigma \sim U(-4,3)$ (reading numerical values off their figure labels).

Their prior for the IH scenario is the same except for the assignment of order statistics in the first line.  Because the sample space is small and likelihood evaluations are cheap they are able to compute posterior Bayes factors and the marginal posteriors in the hyper-parameters $\{\mu,\sigma\}$ via brute force (the arithmetic mean estimator in the mass space over a grid of hyperparameters).

All this seems pretty straightforwards, and indeed when I skimmed this paper (the Simpson et al. one) when it first appeared on the arXiv I didn’t think anything particular other than, “seems sensible, not relevant for blogging because the marginal likelihood was so easy to compute”.  If I understand Schwetz et al.‘s note correctly, their complaint translates (in my interpretation) to be that Simpson et al. set their prior odds on the models before considering the neutrino oscillation data.  By contrast, in their earlier (but still quite recent paper) Hannestad & Schwetz decided to begin from the neutrino oscillation experiment likelihoods, letting the posteriors from non-informative priors updated by those constraints be their priors for examining the cosmological data with equal prior odds of each scenario assigned after the neutrino oscillation experiment.  By the nature of their analysis, Simpson et al. must go further in specifying a prior on their neutrino masses prior to observing any data.  Schwetz et al.‘s objection is that this introduces an artificial weighting of the evidence from the neutrino experiment.

For sure the Simpson et al. approach will be sensitive to the priors adopted (like all Bayesian model selection problems) but it doesn’t seem to me to be a silly thing to do in the context of forming betting odds on the hypotheses from the totality of available data.  Further prior sensitivity analyses might be warranted and the nuances of the analysis could be more clearly explained but I personally would conclude that the Simpson et al. analysis is a worthwhile contribution to the debate.