I read two interesting papers recently: the first from astro ph was by Kern et al., entitled “Emulating Simulations of Cosmic Dawn for 21cm Power Spectrum Constraints on Cosmology, Reionization, and X-Ray Heating“. The type of emulator approach described in this paper is similar to the one I adopted for my prevalence-incidence paper in that it uses a training library of simulations to predict the simulator output in latent variable space (here the bispectrum). This is in contrast to emulators in the Bayesian optimisation family (such as the Bower et al. example for galaxy SAMs) in which the emulator aims to predict in log-likelihood space. For my purposes in the prevalence-incidence paper the motivation for choosing the former approach was efficiency, since the same library was to be used for a number of different sites so it seemed ‘obviously’ to be better to focus on predicting the latent variables than the log-likelihood. On the other hand, for this bispectrum case there’s only one dataset to be compared against so the relative merits of one scheme versus the other are not obvious to me. At face value it seems that the decision is principally shaped by the authors’ preference to do MCMC with the emulator rather than using a GP approximation to the posterior which means foregoing the sequential design efficiency of the latter.
Emulator choice aside, the authors seem to have devised a clever implementation for constructing the training library and output model: using a two-step procedure to refine the library design in a region of non-trivial posterior mass, and reducing the scale of the latent variable emulation problem via a reduced PCA projection of the latent data space. The coefficients of the PCA projection are interpolated between training points with a Gaussian process operating on the 11 dimensional parameter space. The authors note that modelling the mean function of the GP is unnecessary because the data are centred about zero before constructing the principal component weights; though I would suggest that substantial improvements could actually still be possible with some sensible parametric terms in the mean function. Past experience with emulators suggests that in high dimensions the use of mean predictors can help take pressure off learning the GP kernel structure; and indeed in the history matching emulator these terms take on a great importance. (In my work on openMalaria I’m having a lot of success applying stacked generalisation with GPs.)
The second interesting paper I read this week was “Space & space-time modeling using process convolutions” by Dave Higdon (suggested by Tom Loredo in the comments on a earlier post). This is an older paper on an alternative approach to constructing Gaussian process models as the convolution of a white noise process by a kernel; with the idea being to approximate the process by sampling the white noise process only at a fixed set of finite points. The advantage of this method over other GP constructions is that it’s easy to produce a very flexible field by allowing the smoothing kernel to be freely parameterised without the PD constraints on a kernel defined in covariance space. Extensions to the model to produce non-Gaussian fields or non-stationary Gaussian field are straightforwards. More interestingly for me is the potential to produce meaningfully correlated fields such as might describe the incidence rates of two infectious diseases sharing a common host vector. There’s a paper on this by Boyle & Frean which suggests a sensible implementation of this idea.