Code for causal inference: Interested in astronomical applications

Oxford PhD student, Rohan Arambepola, from the Malaria Atlas Project has today arXived a manuscript describing his investigation of the potential for causal feature selection methods to improve the out-of-sample predictive performance of geospatial (or really, “geospatiotemporal”?!) models fitted to malaria case incidence data from Madagascar.  Broadly speaking, common feature selection algorithms (such as the LASSO) aim to improve out-of-sample predictive performance by restricting the set of covariates used in modelling to only those identified to be acting as the ‘most important’ contributors to the model predictions, hopefully reducing the tendency of ‘less important’ features to inadvertently fit observational noise (i.e., over-fitting).  Causal feature selection algorithms, on the other hand, focus on identifying the set of features with the closest direct causal influence on the observational data, with the hope being that a bonus will be that this feature set is also well-tuned for out-of-sample prediction under the chosen model.

In our malaria mapping application we find that while the causal feature approach performs comparably to standard methods for ‘simple’ interpolation problems—i.e., imputing missing case counts for a particular health facility in some particular month given access to data from nearby health facilities in those months, as well as data from the particular facility in earlier and later months—it tends to notably out-perform standard methods for pure forwards temporal prediction—i.e., prediction of future data given previous observations at a facility up to a particular stopping month.  Post-hoc rationalisation confirms that this is what one expected all along 🙂 but it is revealing to see these lessons play out in practice upon analysis of a real-world dataset!  Comparisons of the feature sets selected by the different approaches (see below) is also interesting from an epidemiological perspective.


Now a key step in this project was that Rohan had to develop an efficient computational framework for conducting causal inference from the heterogeneous and spatially correlated malaria data and covariate products (e.g. satellite based images of land surface temperature and vegetation color).  The solution described in the arxival is based on recent developments in hypothesis testing via kernel embeddings, combined with the PC algorithm for causal graph discovery, and the spatio-temporal pre-whitening strategy proposed by our friend/sometime collaborator, Seth Flaxman.  This code is now well-tuned and—in my opinion—just waiting for some kind of clever use in astronomical analyses!

So, I would be keen to hear from anyone who has an astronomical problem that might benefit from a causal inference procedure.  In terms of causal feature selection I would imagine that Rohan’s technique would improve the performance of models trying to forecast galaxy/QSO/exoplanet population characteristics in domains (greater redshift, deeper imaging, etc) just beyond the core coverage of existing data: such as when planning a future survey with a new instrument.  Regarding the causal graph discovery, the robustness of Rohan’s technique against unmeasured spatially correlated causal factors suggests it could be used to examine questions related to galaxy formation and evolution in the field versus cluster environments: e.g. (very loosely) what are the directions of influence between environment, galaxy mass, galaxy star formation rate, morpholgy and bar type, given that environment, mass and SFR are all influenced by object location and history within even larger scale structures?  In any case, the code is in good shape for sharing which we would be happy to do with anyone having an idea in this domain they’d be interested to try it out on.

This entry was posted in Uncategorized. Bookmark the permalink.

1 Response to Code for causal inference: Interested in astronomical applications

  1. ecoquant says:

    I am keenly interested in applying causal inference to scientific questions as well, ever since I took a short course with Dr Beth Ann Griffin courtesy of the Boston Chapter of the American Statistical Association. Dr Griffin is a co-author of the R package twang which supports these.

    I am interested in a Bayesian approach, and there is literature here and there. But the two primary sources seem to be:

    (1) G. W. Imbens, D. B. Rubin, Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction, Cambridge, 2015.

    (2) A section (8.6) of Gelman, Carlin, Rubin, et al of Bayesian Data Analysis, 3rd edition, 2014.

    (3) M. A. Hernán, J. M. Robins, Causal Inference — What if, 2020, available online at the link. See also.

    There is an older (2004) Gelman-Meng book, too.

    I’m working through (1), which is both good and recommended, in particular, by Dr Rebecca Barter. There is no corresponding R package and there’s a lot of Frequentist stuff in I&R. There’s the causaleffect R package by Tikka and Karvanen, but that follows the Pearl school’s approach.

    So, with a couple of exceptions, to use I&R, y’need to roll your own. The exceptions include the CMatching package which, in part, borrows from I&R, and the ipwErrorY package, but these are but pieces. I’m not sure writing your own is such a bad idea. I also think that, since many competing analyses will be from a Frequentist perspective, it’s good to understand how to think about these problems in a Frequentist way.

    All that said, some of the questions which attend causal inference seem surprisingly slippery and philosophical. Just, for instance, look at the literature regarding Lord’s paradox. I’m hoping that I&R are correct, and the potential outcomes approach to causal inference clears away a lot of the cobwebs.

    But, we’ll see, and it’s really neat learning a bunch new!

    Oh, and, yes, my application areas are environmental policy, such as characteristics of hydrology informing policy choices. Was that flood event in that neighborhood a statistical fluke? Or is there a causal relationship between that flooding and getting X amount of rain on a piece of a local watershed?

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s