Oxford PhD student, Rohan Arambepola, from the Malaria Atlas Project has today arXived a manuscript describing his investigation of the potential for causal feature selection methods to improve the out-of-sample predictive performance of geospatial (or really, “geospatiotemporal”?!) models fitted to malaria case incidence data from Madagascar. Broadly speaking, common feature selection algorithms (such as the LASSO) aim to improve out-of-sample predictive performance by restricting the set of covariates used in modelling to only those identified to be acting as the ‘most important’ contributors to the model predictions, hopefully reducing the tendency of ‘less important’ features to inadvertently fit observational noise (i.e., over-fitting). Causal feature selection algorithms, on the other hand, focus on identifying the set of features with the closest direct causal influence on the observational data, with the hope being that a bonus will be that this feature set is also well-tuned for out-of-sample prediction under the chosen model.
In our malaria mapping application we find that while the causal feature approach performs comparably to standard methods for ‘simple’ interpolation problems—i.e., imputing missing case counts for a particular health facility in some particular month given access to data from nearby health facilities in those months, as well as data from the particular facility in earlier and later months—it tends to notably out-perform standard methods for pure forwards temporal prediction—i.e., prediction of future data given previous observations at a facility up to a particular stopping month. Post-hoc rationalisation confirms that this is what one expected all along 🙂 but it is revealing to see these lessons play out in practice upon analysis of a real-world dataset! Comparisons of the feature sets selected by the different approaches (see below) is also interesting from an epidemiological perspective.
Now a key step in this project was that Rohan had to develop an efficient computational framework for conducting causal inference from the heterogeneous and spatially correlated malaria data and covariate products (e.g. satellite based images of land surface temperature and vegetation color). The solution described in the arxival is based on recent developments in hypothesis testing via kernel embeddings, combined with the PC algorithm for causal graph discovery, and the spatio-temporal pre-whitening strategy proposed by our friend/sometime collaborator, Seth Flaxman. This code is now well-tuned and—in my opinion—just waiting for some kind of clever use in astronomical analyses!
So, I would be keen to hear from anyone who has an astronomical problem that might benefit from a causal inference procedure. In terms of causal feature selection I would imagine that Rohan’s technique would improve the performance of models trying to forecast galaxy/QSO/exoplanet population characteristics in domains (greater redshift, deeper imaging, etc) just beyond the core coverage of existing data: such as when planning a future survey with a new instrument. Regarding the causal graph discovery, the robustness of Rohan’s technique against unmeasured spatially correlated causal factors suggests it could be used to examine questions related to galaxy formation and evolution in the field versus cluster environments: e.g. (very loosely) what are the directions of influence between environment, galaxy mass, galaxy star formation rate, morpholgy and bar type, given that environment, mass and SFR are all influenced by object location and history within even larger scale structures? In any case, the code is in good shape for sharing which we would be happy to do with anyone having an idea in this domain they’d be interested to try it out on.
I noticed an interesting paper arXived just before Xmas proposing to model stellar spectra as “sparse, data-driven non-Gaussian processes”. The non-Gaussian part is explained to mean that because a prior is placed on the covariance function and various hyper-parameters, when these are marginalised over the resulting posterior distributions for the latent spectral profiles are highly non-Gaussian. I don’t personally care for the use of the term “non-Gaussian process” to describe this situation, since it can confuse thinking around how the hierarchical GP model is behaving in the sense of the hyper-parameters learning a data-adaptive smoothness penalty for the GP which then sets the characteristic performance condition under that penalty and the data (see e.g. my favorite paper of last year). However, in this particular stellar spectra fitting scenario instance I think the proposed model should not even be thought of as having anything to do with a Gaussian process in the first place. The reason being that the model is better classified as a high dimensional Gaussian mixture model.
When we think of a stochastic process we typically think of a potentially infinite collection of random variables indexed by either a countable (discrete time) or uncountable (continuous time) set. Even when we map at pixel resolution with a GP or model a fixed period of financial tick (discrete time) data with one, we can mathematically extend the same model to an arbitrary resolution or project to infinite extent without changing fundamentally the behaviour of the model or its learning rate with respect to the available data. In this paper, the index set is a finite collection of spectral pixels and their covariance function is learned using a Wishart prior from having thousands of multiple instances observed, assuming no wavelength dependent kernel structure. So, although one can technically call this a model based on the Gaussian process, that would only be true in the sense that any model that uses a standard Normal distribution could also be labelled as such.
Aside from enjoyable pedantry, there is actual value in identifying the optimal description of the model class (here: high dimensional Gaussian mixture model) since it allows one to identify statistical theory and existing methods to understand and implement this model class. I must admit that it’s not a topic I have much experience in, but a quick google search returns papers by a number of known experts in high dimensional covariance estimation for instance [1, 2].
I realise that the origin of the holiday name is not the sport, but it’s a great chance to catch up on two fight of the year contenders, both coming out of the World Boxing Super Series program (both available for free on youtube):
1) Taylor vs Prograis
2) Inoue vs Donaire
A quick note to explain why these fights are so compelling to re-watch: In both cases the less fancied fighter (by betting odds & general commentators’ opinions) comes in with a well put together strategy for the fight which they execute with skill and determination, the favorite is surprised at not having things go immediately their own way but is then able to make tactical adaptations to get themselves back into the fight.
A common criticism of malaria mapping work that I’ve been involved with runs to the effect that we don’t do the right thing when we correct for the difference between rapid diagnostic test (RDT) estimates of parasite prevalence and microscopic diagnosis estimates of parasite prevalence. For instance, this paper points out that in Mappin et al (2015) (where we examine the relationship between RDT and microscopy prevalence) the fitted model has “a major limitation” being that we don’t allow for over-dispersion in any unmeasured risk factors, and proposes that the solution is a model in which each of the RDT and microscopy prevalences takes its own geospatial random field as residual error term. This is silly because we already acknowledge large contributions of over-dispersion due to differences in diagnostic types, treatment history, and fever status; and it’s far from obvious that any further modelling of unexplained over-dispersion will achieve any meaningful improvement to fit for such a small sample size. Furthermore, when the authors of that paper come to demonstrate application of their preferred model to the dual malaria diagnostic problem (in that instance it’s microscopy vs PCR) they end up deciding that actually “estimating components of residual spatial variation that are unique to each diagnostic may be difficult”, and so resort to fitting a model without residual spatial variation about the diagnostic-to-diagnostic relationship.
Another paper claims that our models “did not consider that malaria prevalence from national and sub-national household surveys can be influenced by the uncertainty in the accuracy of diagnostic methods used during the surveys” and suggests that they have a method for estimating “true malaria prevalence”. However, as I’ve asked since 2 years ago in the comments of that paper, the authors do not have any clear definition of “true prevalence” in their model. The point is that RDT and microscopic diagnosis are measuring two different things: the former shows the presence of an antigenic response associated with current or recently clear malaria parasite presence in the blood, the latter identifies contemporary parasitaemia only. Hence, we cannot look at things like RDT positive, but microsopy negative as a failure of microscopic diagnosis.
Having acknowledged that diagnostic standardisation is not a trivial epidemiological or statistical modelling problem, it is still very important to consider, since failure to account for the difference between the two may lead to mis-estimation of temporal trends. Especially because the ratio of microscopic prevalence data points to RDT prevalence data points in literature review and national survey datasets has radically changed over the past few years. And the difference in prevalence estimates can more than a factor of two, as in the recent Tanzanian MIS surveys. Nevertheless, some studies looking at temporal trends don’t bother to attempt any adjustment, which I find rather surprising.
Long time readers of this blog will know that I’ve been bigging up the potential of Bayesian optimisation approaches for fitting expensive astronomical simulations for many years now. In particular, when likelihood function computations are computationally very expensive and possibly stochastic due to the need to run some kind of crazy physics simulator at the back end, then it makes sense to approximate the posterior in an iterative manner using Bayesian optimisation. And if you’re going to do that you should learn from the wealth of experience in the statistics community (and elsewhere in other fields doing applied statistics), because, yeah, this problem is not unique to astronomy.
Nevertheless I was not surprised to see a recent arxival (which pointed to another that I’d missed and which has the same flaws) in which Bayesian optimisation is presented as if it were a newly developed methodology invented by the authors. And since they don’t cite any of the relevant literature it also isn’t a surprise to see them fall into all the traps that could have been avoided. For instance, we know that the squared exponential kernel is a shitty choice for emulating a log-likelihood surface, that homoskedastic noise is a shitty assumption, that it all works a lot better if we throw in a linear kernel made of linear and quadratic version of the input parameters, and if (like we do in cosmology) have some low-cost-but-not-perfect versions of the physical simulators then we should throw those in the mix with multi-fidelity Bayesian optimisation. I would also suggest that for problems like recovering the CMB posterior there are gains to be made in building separate GPs for each of the log-likelihood contributions of the different observational components (e.g. angular power spectrum, polarization, etc) since each is more informative of lower subset of dimensions and can be optimally emulated with an automatic relevance determination kernel.
To be fair to the authors, one of the problems with our current academic system is that it rewards quantity over quality, so there’s little incentive to make a thorough investigation of a topic if a quick investigation can lead to a quick publication.
A recent arXival presents a method for estimating the marginal likelihood (or ‘evidence’) based on the technique of normalizing flows. Normalising flows (mentioned in a previous post on modelling the colour-magnitude diagram) are a model for density estimation in which a standard Normal distribution of some moderate dimension is mapped through a machine learning style invertible transform to give a highly flexible (but properly normalised!) density representation in the target space. The authors propose to estimate the marginal likelihood given an existing set of samples from the posterior (e.g. imagined recovered via MCMC) in two steps: (i) train a normalizing flow on the posterior, and (ii) use this density as the reference distribution in a bridge sampling estimator to approximate the marginal likelihood. Like the case discussed in my last blog post (on random forests for parameter estimation) the astronomers again confuse the specifics with the general.
First of all, for the normalizing flow to be a useful device we first need to have a good approximation to the posterior from a prior run of an MCMC style algorithm: for high dimensional problems one may as well assume this is to be HMC. Of course, for a multimodal problem (that is not a toy problem for which we know the number of modes and can check that our HMC run has been successful) it is not guaranteed that HMC will explore all the modes efficiently. Methods for marginal likelihood estimation like nested sampling or sequential Monte Carlo often start from a diffuse reference distribution (like the prior) to give the user a better chance at discovering all the modes, but at the cost of less efficiency in marginal likelihood estimation than if they started from a reference distribution closer to the posterior. All these methods can be run from a reference distribution other than the prior, so could easily be used with the normalizing flow as well … if the available posterior samples are trusted to be representative.
The question then is how much the normalizing flow improves on alternative reference densities; e.g. a (mixture of) multivariate t-distribution(s) based on the posterior mode(s) [requiring only maximisation and Hessian construction] or the same mixed with a KDE if there are posterior samples already available. Since normalizing flows are designed specifically for this purpose I’d imagine they do improve things, but it would be interesting to understand by how much. Moreover, one shouldn’t discount the value of samples taken along the nested sampling path (or the thermodynamic path) whenever the reference distribution is fatter tailed than the posterior itself, since these can be valuable when estimating posterior functionals (e.g. moments) and credible intervals.
Finally, I was not impressed by the numerical examples presented because the rival techniques were all ‘shown to fail’, but no investigation is made to understand in what mode they seem to be failing (is it the software implementation and/or a sub-optimal choice of tuning parameters?), no comparisons are made to problems for which benchmarks have been published elsewhere (i.e., how does the new method compare to an older method on a problem for which the older method has been shown to perform well; all examples are ‘adapted from’ but not replicates of), and no code is provided for readers to check these kind of details.