## Emulators and process convolutions …

I read two interesting papers recently: the first from astro ph was by Kern et al., entitled “Emulating Simulations of Cosmic Dawn for 21cm Power Spectrum Constraints on Cosmology, Reionization, and X-Ray Heating“.  The type of emulator approach described in this paper is similar to the one I adopted for my prevalence-incidence paper in that it uses a training library of simulations to predict the simulator output in latent variable space (here the bispectrum).  This is in contrast to emulators in the Bayesian optimisation family (such as the Bower et al. example for galaxy SAMs) in which the emulator aims to predict in log-likelihood space.  For my purposes in the prevalence-incidence paper the motivation for choosing the former approach was efficiency, since the same library was to be used for a number of different sites so it seemed ‘obviously’ to be better to focus on predicting the latent variables than the log-likelihood.  On the other hand, for this bispectrum case there’s only one dataset to be compared against so the relative merits of one scheme versus the other are not obvious to me.  At face value it seems that the decision is principally shaped by the authors’ preference to do MCMC with the emulator rather than using a GP approximation to the posterior which means foregoing the sequential design efficiency of the latter.

Emulator choice aside, the authors seem to have devised a clever implementation for constructing the training library and output model: using a two-step procedure to refine the library design in a region of non-trivial posterior mass, and reducing the scale of the latent variable emulation problem via a reduced PCA projection of the latent data space.  The coefficients of the PCA projection are interpolated between training points with a Gaussian process operating on the 11 dimensional parameter space.  The authors note that modelling the mean function of the GP is unnecessary because the data are centred about zero before constructing the principal component weights; though I would suggest that substantial improvements could actually still be possible with some sensible parametric terms in the mean function.  Past experience with emulators suggests that in high dimensions the use of mean predictors can help take pressure off learning the GP kernel structure; and indeed in the history matching emulator these terms take on a great importance.  (In my work on openMalaria I’m having a lot of success applying stacked generalisation with GPs.)

The second interesting paper I read this week was “Space & space-time modeling using process convolutions” by Dave Higdon (suggested by Tom Loredo in the comments on a earlier post).  This is an older paper on an alternative approach to constructing Gaussian process models as the convolution of a white noise process by a kernel; with the idea being to approximate the process by sampling the white noise process only at a fixed set of finite points.  The advantage of this method over other GP constructions is that it’s easy to produce a very flexible field by allowing the smoothing kernel to be freely parameterised without the PD constraints on a kernel defined in covariance space.   Extensions to the model to produce non-Gaussian fields or non-stationary Gaussian field are straightforwards.  More interestingly for me is the potential to produce meaningfully correlated fields such as might describe the incidence rates of two infectious diseases sharing a common host vector.  There’s a paper on this by Boyle & Frean which suggests a sensible implementation of this idea.

## Daleks vs MultiNest …

A recent paper on the arXiv that caught my attention was this one by entitled Accelerated Parameter Estimation with DALE$\chi$ by Daniel & Linder (2017), in which the authors present a code for determining confidence contours on model parameters for which the likelihood is computationally expensive to evaluate.  Important to note is that the authors are focused here on Frequentist confidence contours, rather than Bayesian credible intervals.  Hence, their objective is to trace out an n-1 dimensional shell (or set of shells) inside the n dimensional parameter space of a given model corresponding to some specified proportion of the maximum likelihood.  This is a fundamentally different objective to that of identifying credible intervals in Bayesian inference in which there is (typically) no simple prescription for the likelihood (or posterior density) that defines a credible interval given the maximum likelihood (or density at the posterior mode).  Though the authors suggest an algorithm in an earlier paper that purports to produce approximate credible intervals from a Frequentist driven exploration of the parameter space it is also acknowledged therein that this is a very crude approximation. For this reason I find it rather surprising that much of this recent paper is focused on a comparison against MultiNest which is designed specifically as a Bayesian tool with completely different objectives: evaluation of the model evidence (marginal likelihood) and posterior sampling.  This confusion of objectives is compounded by numerous incorrect characterisations of methods like MCMC throughout the article, such as the idea that MCMC is focused “on finding the maximum of the posterior probability”.

All that aside, what is the DALE$\chi$ algorithm? Well, basically it’s a guided stochastic search algorithm designed to efficiently map out a given likelihood contour through a mix of directed and random moves.  The direction is through an objective function based on a distance from the target likelihood value and an objective function to maximise distance from previously sampled points.  This is achieved through a Nelder-Mead procedure instead of the Gaussian process-based search strategy used in earlier papers by the same authors and collaborators, including Daniel et al. (2014) and Bryan et al. (2007).  Interestingly the full author list on those papers contains Jeff Schneider who’s more recently been involved in a lot of progressive work on Bayesian optimisation with Gaussian processes, but in Bryan et al. (2007) they spend a long time dumping on Bayesian methods, both the use of MCMC for posterior sampling and the use of Bayesian credible intervals in general! Despite their dogmatic stance in that paper the authors do actually do something interesting which is to look at a non-parameteric confidence ball around the observed CMB power spectrum (based on even earlier work by Miller and Genovese).  In today’s paper though the focus is simply on Neyman-Pearson generalised likelihood ratio contours.

If I was reviewing this paper I would ask for a comparison of the new code not against MultiNest but against the authors’ earlier Gaussian process-based codes.  The interest then is under what types of likelihood function costs does one outperform the other.  At face value one would imagine that the Gaussian process-based code is more efficient for very expensive likelihoods (say, 1 min or more) while the new version would be more efficient for very cheap likelihoods (say <0.1 sec), but their relative performance in between is more difficult to speculate and likely depends significantly on the dimensionality of the problem.  If a comparison against Bayesian methods is desired then I think it would make more sense to also compare against a Bayesian method also designed for genuinely expensive likelihoods such as the Bayesian optimisation codes used for semi-analytic model fitting by Bower et al. (2010); and I would in that case set the $\chi^2$ target to the level set credible interval identified by first running the Bayesian analysis.

## Podcast …

I recently gave in to the conceit of trying to record a podcast; you can see/hear the results here as the first instalment of the Malaria Atlas Project podcast series.

Otherwise I’ve been travelling a lot these past three weeks: one week in Seattle at the IDM Symposium, one week of holidays in San Francisco, and three days in Geneva at the Global Fund.  Back in Oxford next week!

## A convexified mixture model …

Just a brief note to point out this recent arXival by Leistedt & Hogg in which the color-magnitude diagram is used to help refine (parallax) distance estimates from Gaia via a Bayesian hierarchical model.  The idea is that although a star’s apparent magnitude and colour tell us very little about its distance in isolation they can help to provide a substantial amount of Bayesian shrinkage (via the absolute magnitude-colour diagram) given a large enough collection of stars with additional noisy distance measurements (that one may hope to refine).  Bayesian shrinkage (see David van Dyk’s talk from IAUS306) is a powerful idea in hierarchical modelling and, although now used fairly widely in photometric redshift studies and galaxy classifications, its potential certainly has not been exhausted in any sense.

What I found particularly interesting here was the form chosen to represent the density of (true/latent) stellar positions on the (absolute) colour-magnitude diagram, which was a Normal/Gaussian mixture model in which the means of each component are placed in a grid over the 2D parameter space, the standard deviations of the components are all held fixed, and only the vector of component weights is to be inferred.  The motivation given for this modelling decision is that it renders the inference process easier, in particular because it makes it easier to marginalise out the latent variables in the Gibbs sampling step.  In fact the benefits of such a model run even deeper by way of conferring convexity to the mixture model construction as known from studies of this model from the perspective non-parametric maximum likelihood estimation (e.g. Feng & Dicker 2016).

Edit: While cleaning up my hundreds of open browser tabs I realised I should also have pointed to this recent arXival by Si et al. as another example of hierarchical Bayesian shrinkage for Gaia data.

Posted in Uncategorized | 3 Comments

## ABC for Lyman forest studies …

I enjoyed reading this recent arXival by Davies et al. in which the authors present an ABC  (approximate Bayesian computation) analysis of the cosmological reionization history.  In particular, the idea is to use ABC to overcome intractability of the joint likelihood function for Lyman-$alpha$ and Lyman-$beta$ absorption along the line (or lines) of sight to distant ($z>5$) quasars under realistic models for the generating process.  The reason I enjoyed this paper was because the steps of the ABC analysis were comprehensively described and seem to have been carefully thought through: the origin of the intractability in the likelihood and the choice of summary statistics are thoroughly discussed, and mock data simulations under known parameters are used to test the pipeline.  Perhaps not surprising because I see in the acknowledgements both Foreman-Mackey and Hogg, and also Fouesneau who I might one day hope to see apply ABC to his stochastic IMF models.  Also, of interest is that one of the summary statistics explored is a pseudo-likelihood statistic motivated by the connection between ABC and Indirect Inference discussed in (e.g.) Drovandi et al. (2015).

Some outstanding questions I was left with are as follows.  In their demonstration of the inadequacy of a multivariate Gaussian likelihood (e.g. Fig 5) for the sampling distribution of transmission it looks as if the only problem is the approximation of the Poisson by a Gaussian rather than necessarily any non-Gaussianity in the underlying rate parameters of the Poisson.  In that case it looks like a pure indirect inference model in which the rate parameters of the Poisson in each bin are estimated from mock data and then a Poisson likelihood is supplied might be a good option; ignoring the correlation between bins (is this even important) but making the sampling distribution more realistic?  With possible extension to estimating the full mean and covariance matrix of a (multivariate) Gamma-Poisson auxiliary model.

The use of only a PIT (probability integral transform) style test of Bayesian parameter coverage also puzzled me, since this test is only confirming that the ABC likelihood function behaves like a likelihood function (in terms of introducing some new information to the prior) but doesn’t speak to the length of the confidence intervals being generated.  That is, such a test is a good sanity check on the numerical implementation of the ABC algorithm but can’t satisfy the usual role of the PIT run out-of-sample for model checking, nor can it be used to rank summary statistics a la the RMSSE framework of Nunes & Balding (2010).

An interesting side bar is that David Balding is well known for his work on statistics for forensic science.  Unrelated brain dump: the other night I found myself watching a forensic files episode in which an astronomer helps solve the case.

## Bayesian inference from a fixed set of simulations …

I noticed this recent paper on the arXiv by Patel et al. entitled, “Orbits of massive satellite galaxies – II. Bayesian Estimates of the Milky Way and Andromeda masses using high precision astrometry and cosmological simulations”, in which the authors consider an inference problem on the periphery of current ABC & pseudo-marginal methods.  In this case the authors are interested in estimating the virial mass of the Milky Way given observational estimates of some of the orbital elements (proper motion, radial velocity, etc.) with respect to Andromeda.  These data enter their model as (independent) Normal likelihoods for the observations given the latent (‘true’) values of those observables; the latter being imagined drawn from a prior which is represented by the drawing of Milky Way-like galaxy groups from a cosmological N-body simulation.  The catch here is that the simulation is expensive and is not able to be run directly by the authors, so all inferences must be made from a pre-compiled catalog containing a finite number of mock galaxy groups drawn from this prior.

This looks at face values like an ABC or pseudo-marginal problem in the sense that there is a reliance on mock datasets, but in fact there are (ignoring cosmological parameters, which are held fixed) no free parameters in the simulations: all the simulations are there for is to represent the prior on the latent variables of Milky Way-like galaxy groups: virial masses and orbital elements.  For the present paper the authors approximate the posterior for the Milky Way virial mass using an importance sample re-weighting scheme formed by weighing each mock galaxy by the likelihood of observing a subset of its orbital elements.   In one set of experiments these are the the satellite’s maximum circular velocity, its separation from the host, and its total velocity today relative to the host galaxy; and in another set of experiments these are the satellite’s maximum circular velocity and the magnitude of its orbital angular momentum.

In this scheme the posterior is behaving much like a kernel-density estimate in which the kernel acts on distance between the observed and latent orbital elements, using these to predict the virial mass.  As one adds (informative) observations the number of mock galaxy groups contributing non-trivially to the estimate decreases such that although the posterior estimate will typically contract it be also be noisier.  Indeed the authors recognise this effect and consider some of their decisions with respect to the effective sample size, although primarily their investigation here is in terms of sensitivity to the definition of a ‘Milky Way-like’ group (which is a hard threshold on the mock galaxy group catalog I’ve glossed over in the above).

Imaging the extension of this inference procedure to further applications, it would be helpful to have a general framework for deciding which orbital elements and other observational data to use and which to leave out.  Such a framework would need to balance the information gain of each possible orbital element against the reduction in effective sample size over the available simulations; and would presumably be based on cross-validation estimates of bias and coverage for mock galaxy groups in the simulation sample.  This is of course a very similar problem to summary statistic selection in ABC analyses and I’m particularly reminded of the paper by Nunes & Balding (2010) in which those authors run a preliminary ABC analysis with unrefined summary statistics to identify the important region of parameter space over which the summary statistics should be optimised.

Another way to attack this problem would be via a flexible parametric model for the prior as in the RNADE-based scheme of Papamakarios & Murray (2016).  Indeed this looks very similar to the example problem described (no slides available as far as I can see) by Iain Murray at the MaxEnt 2013 conference in Canberra.