Two recent arXivals to catch up on here, aided in part by a more-than-usually-tedious bus journey from Witney to Oxford this morning (the culprits being ongoing road works in Witney and excess traffic attracted by filming of the Countryfile tv show at Blenheim). The model emulation paper is by Schmit & Pritchard and concerns estimation of cosmological model parameters given observations of the 21cm signal associated with the Epoch of Reionisation. The challenge is that to compute the likelihood one needs to be able to compute the power spectrum for any given set of model parameters, requiring either a very expensive ‘numerical simulation’ or a somewhat less expensive ‘semi-numerical simulation’. Meaning that in either case direct MCMC is rather costly because at each proposal a new simulation must be run. (Statisticians, note that both the numerical and semi-numerical simulations give deterministic outputs, so we’re not in the realm of pseudo-marginal MCMC or ABC.) Naturally then, there is interest in application of model emulation techniques to make efficient inference on this problem.
The specific approach to emulation taken by Schmit & Pritchard is to emulate at the level of the latent (deterministic) variable: namely building an artificial neural network (ANN) model for the power spectrum given a training set of power spectra evaluated on a Latin hypercube of parameter values. This is the same direction as explored by Kern et al. (2017) [see my discussion here]—but is in contrast to the version of model emulation most common in statistics (see, e.g., the review by Levy & Steinberg 2011) in which the emulation is of the objective surface (e.g. log likelihood function) itself. Not surprisingly (since ANNs and GPs can theoretically represent a similar class of functions to a similar level of accuracy), Schmit & Pritchard find that like Kern et al. they too are able to produce an accurate emulator for the ‘semi-numerical’ model and can use it to form an approximate posterior matching reasonably well the full MCMC posterior.
I think this is an interesting direction of research but would be surprised if it can outperform the more conventional type of model emulation (which goes under the term “Bayesian optimisation” in machine learning studies) for posterior sampling problems (use case ii and effectively iii of Schmit & Pritchard) owing to the latter’s advantage of having a continually updated estimate of the posterior along with its uncertainties. Thereby allowing for efficient design-based addition of new training points one-by-one against a utility function capturing chosen aspects of posterior fidelity. Moreover, the ability of multi-task (or multi-fidelty) Bayesian optimisation to learn across similar objectives would naturally fit into a scheme whereby both ‘semi-numerical’ and ‘numerical’ simulations are used judiciously. Where I do agree that these emulations in the latent variable space may have a competitive edge is in their use case i: experimental design for 21cm studies (i.e. determining optimal survey layouts and depths to facilitate inference).
The quantile regression paper is by Kacprzak et al. (2017) who present a strategy for improving the efficiency of ABC posterior approximation relative to simple rejection sampling. The idea being to use successive waves of quantile regression to identify nested regions of parameter space outside of which mock datasets with acceptable epsilons are extremely unlikely to be produced; an approach which they note draws some inspiration from history matching (coincidentally, part of a model emulation strategy previously used for training expensive semi-analytical models of galaxy formation). Since it relies on quantile regression and the estimation of confidence bounds for the regression this is a technique appropriate in the intermediate regime of ABC in which rejection ABC is taking longer than one would like yet we can still generate a decent number (10s of 1000s) of mock datasets during the course of our posterior approximation. In this context I don’t see the appeal of this strategy over sequential Monte Carlo samplers for ABC which can take on board simple parameter distributions within their refreshment kernels such as would have similar flexibility to a quantile regression with first and second order polynomial terms (for a single component Normal) or higher (for a Normal mixture). (Note that the particular quantile regression scheme used by Kacprzak et al. is a SVM-based method for which the required sample size for accurate quantile regression confidence intervals is not readily apparent; but the sizes used here in each round would only just be appropriate for low order polynomial fits at the targeted quantiles; see e.g. Kocherginsky et al.). The advantage of the SMC algorithm though is that it’s targeting the full ABC posterior with known convergence behaviour. Certainly it would be easy for the authors to compare against an off the shelf SMC algorithm thanks to (e.g.) EasyABC, or Ishida et al. (2015) or Jennings & Madigan (2016).
I’ve been thinking a lot recently about non-Bayesian hypothesis testing, in part because my reading has taken me to a number of particle physics papers in which the likelihood ratio test (LRT) is very often the tool of choice for exploring new physics. Having digested Geyer’s (1994) formulation of Chernoff’s version of the LRT—the one (Chernoff’s) used (correctly) by particle physicists when the null model lies at a boundary parameterisation of the alternative model (e.g. Chianese et al. 2017; and see Cowan et al. 2010 for a detailed exposition)—I’ve been wondering whether there are any non-trivial cases in astronomy where the asymptotic distribution can be found via the shape of the tangent cone. One application area might be to the detection of spectral lines. In that context Protassov et al. (2002) highlight the problem and offer a Bayesian solution, but also give the warning: “Although some theoretical progress on the asymptotic distribution of [the test statistic] when [the parameter] is on the boundary of [the parameter space] has been made (e.g., by Chernoff (1954) and specifically for finite mixtures by Lindsay (1995)), extending such results to a realistic highly structured spectral model would require sophisticated mathematical analysis“. That said, my gut instinct is that since they don’t mention many of the papers (like Geyer’s one) giving methods to do the calculation that this comment is not intended to be the final word on the subject, more so a segue to their presentation of the Bayesian solutions.
Thinking about detection of spectral features, I noticed some bizarre mis-use of the chi square asymptotics by Rivera-Thorsen et al. (2015) who have invented a way to shoe-horn the test on to their fitting when there are only as many data points as the number of parameters in the model: “The minimum number of contributing lines required for our fitting procedure is two. In this case, with two data points and two free parameters, there are zero degrees of freedom and thus the usual reduced χ2 is not defined. Instead, we report a pseudo-reduced χ2 , which is defined as χ2 / ([DOF] + 1).” Somehow the warnings on abuse of chi squared in astronomy by Andrae et al. (2010) still aren’t getting through.
An interesting aside from my reading in particle physics is that it turns out that when people report a result as “3 sigma significance” (for instance) it doesn’t necessarily mean what I thought it meant. In particle physics it seems to most often mean the normal quantile of the p-value corresponding to a one sided test (Cowan et al. 2010) whereas I had always thought it meant the p-value corresponding to a two-sided test (as in cases I’m familiar with from astronomy; e.g. King et al. 2012). When I mentioned this on the astrostatistics facebook group some guy thought I was having a go at particle physics but that really wasn’t the case. I was surprised: until last week I thought it was one way and not some other way. If the purpose of converting a p-value to a sigma is to ease interpretation it seems to me that having competing definitions or standard defeats that purpose. Also, there is not entirely consistency within fields; for a particle physics example of the second type see Aaij et al. (2012).
Earlier this week I was up in London for the Statistical Data Science 2017 conference organised by Imperial College London and hosted by Winton Capital. Naturally quite a few of the talks focussed on machine learning methods and Gaussian processes. My talk was on the use of both: that is, stacked generalisation with GPs (e.g. Bhatt et al. 2017) for Bayesian optimisation (one type of model emulation; see also ‘history matching’ for semi-analytical galaxy sims). One idea that got me thinking was Joe Meagher‘s use of Gaussian processes to reconstruct the echolocation call of the ancestral bat species: in this case echolocation calls of different species were represented as basis functions with weights distributed according to a Gaussian process with covariance determined by a distance over the space of phylogenetic trees relating these bats. Apparently something similar has been done for reconstructing ancient speech patterns. Another interesting talk concerned an investigation by Francisco Rodriguez-Algarra of the rules learnt by deep classifiers to distinguish between different music genres: in one application it turned out that the classifier was predominantly based on sub-audible frequencies below 20 Hz! So although it was doing a good job on the training dataset it clearly wasn’t learning the type of difference notable to a human ear, and hence its potential to generalise over different qualities of recordings that may not have such accurate sub-bass reproduction was expected to be poor.
Of relevant to today’s discussion was the talk by Lukas Mosser on the use of generative adversarial networks (GANs) to model the distribution of structures seen in various types of porous media (a beadpack, a Berea sandstone, and a Ketton limestone). Although a flexible Gaussian process model fit to 3D scans of these rocks was able to learn the correlation function of filled/unfilled voxels, it was observed that mock datasets drawn from the GP posterior simply didn’t look convincingly like real porous media—evidently there was information contained in higher order structure functions not captured by the GP. Hence, Lukas and his team turned to GANs as a tool for generating more realistic mock images of the porous media. Reassuringly, they were able to show that the mock images from the GAN also reproduced the correlation function found by the GP, but also key higher order functions.
After his talk Lukas was able to point me towards a recent astronomy paper that I’d missed on the arXiv by Mustafa Mustafa et al. called “Creating Virtual Universes Using Generative Adversarial Networks“. In this arXival the authors train a GAN to mock weak lensing maps produced by an expensive cosmological simulation code. They show that they are successfully able to do this, such that the fitted GAN can then be used to easily draw additional mock weak lensing maps statistically indistinguishable for their original training set. The authors motivate the use of GANs in astronomy with reference to the use of cosmological simulations to learn the covariance matrix of key observational summary statistics under different parameter settings. As has been discussed previously on this blog, there are various ways this can be done with different biased and unbiased estimators for the covariance, although when used as plug-ins all are still producing biased likelihood estimates (Price, Drovandi & Lee 2017; see also Selentin & Heavens 2015). The utility of GANs in this scenario is not readily apparent to me, however. If one imagines using a large number of mock images drawn from the fitted GAN to estimate the covariance matrix then I’m sure you will effectively get a point estimate but it won’t capture any more information than available in the original set of training simulations; rather it’ll just have a somewhat opaque regularisation to it (by the structure of the GAN) and will have been extra-expensive to obtain (because of the cost of fitting the GAN).
I suspect the authors recognise this as in their penultimate paragraph they point towards future work on a parametrised GAN to learn the generative model over a suite of mock datasets from different parameterisations of the base model. This makes sense to me but brings the problem back from the specific power of GANs to the regime of other deep generative models which have been used for this purpose, such as the example by Papamakarios & Murray (2016) and the variation Bayes + GAN version by Tran et al. (2017). I think there certainly are use cases for such approaches but I don’t have faith that precision cosmological modelling is one of them because of the sheer precision required for knowledge of the cosmological covariance matrices (or, if you like, their precision matrices) in order to make progress on the current understanding. I only hope that the cosmologists planning the next generation of precision observations are budgeting for some serious computational simulation time and not just pinning their hopes on some new algorithmic or statistical solution. I know I don’t have any easy answers.
As a minor comment I would note that I’m not a fan of (mis-)using p-values as evidence for the null, even if it’s in a somewhat tautological case (here the network is trained until it passes a p-value threshold of the K-S test).
A recent arXival by Spencer et al. presents a Bayesian analysis of radial velocity data for the binary fraction in (the ultra faint dwarf galaxy) Leo II in which the inference proceeds by comparison of mock datasets against the observed data. In this particular case the methodology ends up corresponding to a Bayesian indirect inference algorithm (e.g. Drovandi et al., Creel et al.) in which the summary statistic is the counts in bins of the beta statistic (observational error weighted squared pairwise difference between radial velocities in consecutive observations) and its auxiliary model is the skewed Normal (independent across bins). In principle one could perform an ordinary Bayesian analysis of this data under a hierarchical model (which would also yield predictions for the binarity or otherwise of each star in the sample) but fitting might (? relative to the potential for contraction of the posterior) be overly time consuming due to the dimensionality and shape of the joint posterior. Instead the indirect inference approach here is straightforwards to implement targeting the binary fraction directly, and a further set of input-output simulations reassure that the posterior median is effectively an unbiased estimator for the true binary fraction.
I spent last week in Paris visiting the Institut Pasteur for a Gates Foundation workshop organised to focus thinking around the problem of designing a Target Product Profile for a ‘commercial’ malaria serology test. On my journey back I read a recent arXival by Olamaie et al. on the topic of cluster modelling with electron pressure profiles. This one caught my eye because the authors describe the use of Bayesian model selection to determine the number of knots (in a simple interpolating function for the radial profile shape) necessary to support the complexity of the available data. In this case it is imagined that the positions of the knots and the amplitude of the cluster profile at each knot location are free parameters and that the cluster profile is linearly interpolated between knots. While this produces a model with discontinuous gradient I actually don’t mind that aspect since these types of models tend to nevertheless produce smooth posterior aggregates (e.g. posterior median curves; see, e.g., the voronoi tesselation GP model of Kim et al.).
Where I see some issues with the method is in the choice of priors for the knot amplitudes which are uniform and independent of knot position or ordering. An ordering is discussed for the radial location of each knot but as far as I can see this doesn’t affect the prior on each knot’s amplitude, and indeed is there simply to avoid an identifiability ‘problem’ (similar to the labelling in mixture models; i.e., this is only a ‘problem’ for some computational methods of marginal likelihood estimation rather than being an inherent flaw in the Bayesian construction). My concern is that the use of these vague uniform priors is inconsistent with our expectations of a broadly decreasing profile; applying these priors to real data will ultimately skew the posterior in model space towards small numbers of knots where signal-to-noise is low. In particular, I would suggest that this construction will produce an overly simplified representation of profiles in both the cluster core and cluster outskirts. To play “Devil’s Advocate” (& based on my experience with Sersic profile fitting) I would suggest that the ordinary parametric forms may well perform better in this type of application—or at least there are no examples in the manuscript to even attempt to convince me the other way.
If I were conducting the analysis I would be interested in models taking some prior information from the available theoretical parametric forms; for example, to build a model from GP deviations about a canonical profile. If such a model was not considered fast enough to fit generally then perhaps at least to use such a model to learn some more refined priors on the knot construction.
I read two interesting papers recently: the first from astro ph was by Kern et al., entitled “Emulating Simulations of Cosmic Dawn for 21cm Power Spectrum Constraints on Cosmology, Reionization, and X-Ray Heating“. The type of emulator approach described in this paper is similar to the one I adopted for my prevalence-incidence paper in that it uses a training library of simulations to predict the simulator output in latent variable space (here the bispectrum). This is in contrast to emulators in the Bayesian optimisation family (such as the Bower et al. example for galaxy SAMs) in which the emulator aims to predict in log-likelihood space. For my purposes in the prevalence-incidence paper the motivation for choosing the former approach was efficiency, since the same library was to be used for a number of different sites so it seemed ‘obviously’ to be better to focus on predicting the latent variables than the log-likelihood. On the other hand, for this bispectrum case there’s only one dataset to be compared against so the relative merits of one scheme versus the other are not obvious to me. At face value it seems that the decision is principally shaped by the authors’ preference to do MCMC with the emulator rather than using a GP approximation to the posterior which means foregoing the sequential design efficiency of the latter.
Emulator choice aside, the authors seem to have devised a clever implementation for constructing the training library and output model: using a two-step procedure to refine the library design in a region of non-trivial posterior mass, and reducing the scale of the latent variable emulation problem via a reduced PCA projection of the latent data space. The coefficients of the PCA projection are interpolated between training points with a Gaussian process operating on the 11 dimensional parameter space. The authors note that modelling the mean function of the GP is unnecessary because the data are centred about zero before constructing the principal component weights; though I would suggest that substantial improvements could actually still be possible with some sensible parametric terms in the mean function. Past experience with emulators suggests that in high dimensions the use of mean predictors can help take pressure off learning the GP kernel structure; and indeed in the history matching emulator these terms take on a great importance. (In my work on openMalaria I’m having a lot of success applying stacked generalisation with GPs.)
The second interesting paper I read this week was “Space & space-time modeling using process convolutions” by Dave Higdon (suggested by Tom Loredo in the comments on a earlier post). This is an older paper on an alternative approach to constructing Gaussian process models as the convolution of a white noise process by a kernel; with the idea being to approximate the process by sampling the white noise process only at a fixed set of finite points. The advantage of this method over other GP constructions is that it’s easy to produce a very flexible field by allowing the smoothing kernel to be freely parameterised without the PD constraints on a kernel defined in covariance space. Extensions to the model to produce non-Gaussian fields or non-stationary Gaussian field are straightforwards. More interestingly for me is the potential to produce meaningfully correlated fields such as might describe the incidence rates of two infectious diseases sharing a common host vector. There’s a paper on this by Boyle & Frean which suggests a sensible implementation of this idea.
A recent paper on the arXiv that caught my attention was this one by entitled Accelerated Parameter Estimation with DALE by Daniel & Linder (2017), in which the authors present a code for determining confidence contours on model parameters for which the likelihood is computationally expensive to evaluate. Important to note is that the authors are focused here on Frequentist confidence contours, rather than Bayesian credible intervals. Hence, their objective is to trace out an n-1 dimensional shell (or set of shells) inside the n dimensional parameter space of a given model corresponding to some specified proportion of the maximum likelihood. This is a fundamentally different objective to that of identifying credible intervals in Bayesian inference in which there is (typically) no simple prescription for the likelihood (or posterior density) that defines a credible interval given the maximum likelihood (or density at the posterior mode). Though the authors suggest an algorithm in an earlier paper that purports to produce approximate credible intervals from a Frequentist driven exploration of the parameter space it is also acknowledged therein that this is a very crude approximation. For this reason I find it rather surprising that much of this recent paper is focused on a comparison against MultiNest which is designed specifically as a Bayesian tool with completely different objectives: evaluation of the model evidence (marginal likelihood) and posterior sampling. This confusion of objectives is compounded by numerous incorrect characterisations of methods like MCMC throughout the article, such as the idea that MCMC is focused “on finding the maximum of the posterior probability”.
All that aside, what is the DALE algorithm? Well, basically it’s a guided stochastic search algorithm designed to efficiently map out a given likelihood contour through a mix of directed and random moves. The direction is through an objective function based on a distance from the target likelihood value and an objective function to maximise distance from previously sampled points. This is achieved through a Nelder-Mead procedure instead of the Gaussian process-based search strategy used in earlier papers by the same authors and collaborators, including Daniel et al. (2014) and Bryan et al. (2007). Interestingly the full author list on those papers contains Jeff Schneider who’s more recently been involved in a lot of progressive work on Bayesian optimisation with Gaussian processes, but in Bryan et al. (2007) they spend a long time dumping on Bayesian methods, both the use of MCMC for posterior sampling and the use of Bayesian credible intervals in general! Despite their dogmatic stance in that paper the authors do actually do something interesting which is to look at a non-parameteric confidence ball around the observed CMB power spectrum (based on even earlier work by Miller and Genovese). In today’s paper though the focus is simply on Neyman-Pearson generalised likelihood ratio contours.
If I was reviewing this paper I would ask for a comparison of the new code not against MultiNest but against the authors’ earlier Gaussian process-based codes. The interest then is under what types of likelihood function costs does one outperform the other. At face value one would imagine that the Gaussian process-based code is more efficient for very expensive likelihoods (say, 1 min or more) while the new version would be more efficient for very cheap likelihoods (say <0.1 sec), but their relative performance in between is more difficult to speculate and likely depends significantly on the dimensionality of the problem. If a comparison against Bayesian methods is desired then I think it would make more sense to also compare against a Bayesian method also designed for genuinely expensive likelihoods such as the Bayesian optimisation codes used for semi-analytic model fitting by Bower et al. (2010); and I would in that case set the target to the level set credible interval identified by first running the Bayesian analysis.