Oxford PhD student, Rohan Arambepola, from the Malaria Atlas Project has today arXived a manuscript describing his investigation of the potential for causal feature selection methods to improve the out-of-sample predictive performance of geospatial (or really, “geospatiotemporal”?!) models fitted to malaria case incidence data from Madagascar. Broadly speaking, common feature selection algorithms (such as the LASSO) aim to improve out-of-sample predictive performance by restricting the set of covariates used in modelling to only those identified to be acting as the ‘most important’ contributors to the model predictions, hopefully reducing the tendency of ‘less important’ features to inadvertently fit observational noise (i.e., over-fitting). Causal feature selection algorithms, on the other hand, focus on identifying the set of features with the closest direct causal influence on the observational data, with the hope being that a bonus will be that this feature set is also well-tuned for out-of-sample prediction under the chosen model.
In our malaria mapping application we find that while the causal feature approach performs comparably to standard methods for ‘simple’ interpolation problems—i.e., imputing missing case counts for a particular health facility in some particular month given access to data from nearby health facilities in those months, as well as data from the particular facility in earlier and later months—it tends to notably out-perform standard methods for pure forwards temporal prediction—i.e., prediction of future data given previous observations at a facility up to a particular stopping month. Post-hoc rationalisation confirms that this is what one expected all along 🙂 but it is revealing to see these lessons play out in practice upon analysis of a real-world dataset! Comparisons of the feature sets selected by the different approaches (see below) is also interesting from an epidemiological perspective.
Now a key step in this project was that Rohan had to develop an efficient computational framework for conducting causal inference from the heterogeneous and spatially correlated malaria data and covariate products (e.g. satellite based images of land surface temperature and vegetation color). The solution described in the arxival is based on recent developments in hypothesis testing via kernel embeddings, combined with the PC algorithm for causal graph discovery, and the spatio-temporal pre-whitening strategy proposed by our friend/sometime collaborator, Seth Flaxman. This code is now well-tuned and—in my opinion—just waiting for some kind of clever use in astronomical analyses!
So, I would be keen to hear from anyone who has an astronomical problem that might benefit from a causal inference procedure. In terms of causal feature selection I would imagine that Rohan’s technique would improve the performance of models trying to forecast galaxy/QSO/exoplanet population characteristics in domains (greater redshift, deeper imaging, etc) just beyond the core coverage of existing data: such as when planning a future survey with a new instrument. Regarding the causal graph discovery, the robustness of Rohan’s technique against unmeasured spatially correlated causal factors suggests it could be used to examine questions related to galaxy formation and evolution in the field versus cluster environments: e.g. (very loosely) what are the directions of influence between environment, galaxy mass, galaxy star formation rate, morpholgy and bar type, given that environment, mass and SFR are all influenced by object location and history within even larger scale structures? In any case, the code is in good shape for sharing which we would be happy to do with anyone having an idea in this domain they’d be interested to try it out on.