So you might have seen me picking fights on twitter with former colleagues at Oxford and probably think that I do this deliberately because I just like to fight. Actually, I would prefer not to have arguments with people but in this case the analysis is so poor and gives such a deliberately skewed impression of the uncertainties involved in this type of modelling—and was peppered with buzzwords like ‘herd immunity’ to make it tasty for the popular press to whom it was fed directly—that I couldn’t help but lose my shit. If you want to see the dumpster fire for yourself, it’s over here on dropbox. And to see replies like “these are exercises to generate much needed discussion” … sweet Jebus give me strength.
Since I am, by profession, a statistician employed the field of epidemiology—no longer an astronomer—it makes sense that I should now be devoting my spare research time on projects related to COVID-19 instead of blogging (mostly) about astronomy. (Spare time since I will be putting aside any of my malaria modelling commitments since this is still very much—in both a relative and an absolute sense—an important disease to control in sub-Saharan Africa & other endemic areas).
One topic that I’ve worked on in the past that seems to have some relevance these days is the challenge of map making with serological data. It has been proposed that once a serological test for past exposure to COVID-19 is made widely available that governments should conduct sero surveys to establish population-level estimates of key disease transmission properties. With a single serological survey one can estimate a baseline proportion of exposed individuals in a community and with a second (or more) survey one can begin to estimate rates of sero-conversion (from the unexposed to exposed class). If a key objective is to identify spatial patterns (or urban-rural differences, etc) in this sero-conversion rate then the survey must be adequately powered to achieve a proposed level of accuracy in spatial (or other) stratification. Anticipating a substantial cost of worked hours and resources to conduct the survey one cannot simply choose an enormous target sample size and be done with it. Instead, survey planners will need to conduct a sophisticated design analysis to achieve an efficient deployment of resources.
In the following days I will give details of some calculations which I expect might lead to the following conclusion: that it is worthwhile to (i) adopt an adaptive sampling design in which the sample size in the second round of studies is optimised by spatial stratification given the results of the first round, and (ii) in both step (i) and in the final analysis of the data it will be worthwhile to introduce geospatial hierarchical Bayesian models to achieve shrinkage of estimates (rather than simply taking a classical survey end-point estimator).
The initial motivation for both of these points can be made from a quick estimate of the expected information with respect to the log sero-conversion rate that can be gained after conducting two sero surveys in a given community. Immediately we see that the expected information (for fixed sample size of 100 at each survey) is strongly dependent on both the true (latent) prevalence of unexposed individuals at the first survey and the true (latent) log sero-conversion rate. Thus if we can estimate either or both of these parameters to reasonable accuracy after the first survey then it might be possible to adjust the sample size at the second survey to achieve a certain level of accuracy in estimating the log sero-conversion rate from the final dataset. Likewise, if we have a fixed resource budget and many different sero-conversion rates (e.g. by region) to estimate then it might be possible to optimally adjust the resources between regions at the second survey (rather than simply maintaining a fixed design). But, of course, there is a lot to do yet to investigate whether this intuition is correct …
[Not astro or stats]
Boosting the signal on a story that the cowards in the Australian government and top defence brass would rather see buried. A story that they’d rather lock up journalists and whistleblowers for telling than admit to a failure of military discipline and a failure of justice. There is strong evidence that Australian SAS members were murdering civilians in Afghanistan, and that their actions were either condoned by senior defence force leaders or else they simply turned a blind eye to the situation. Credit to those soldiers who have come forward to tell their stories and break the code of silence that allows this sick culture to fester within our otherwise great institutions.
Yesterday I had a ‘Matters Arising’ in Nature Astronomy published (non-paywalled shareable link here) on the topic given as the title of this blog post; with co-authors, Michael Burgess & Garry Angus. Aside from identifying some particular technical issues with a published study on the topic of MOND-inspired radial acceleration relationships in galaxies, we highlight the potential of model misspecification to lead to overly tight parameter constraints or overly confident assessments of model choice in Bayesian analyses—and we remind readers of the consequent value in a model testing and development cycle built around simple diagnostic tools for spotting posterior ‘funkiness’ (e.g. Gabry et al. 2019).
Oxford PhD student, Rohan Arambepola, from the Malaria Atlas Project has today arXived a manuscript describing his investigation of the potential for causal feature selection methods to improve the out-of-sample predictive performance of geospatial (or really, “geospatiotemporal”?!) models fitted to malaria case incidence data from Madagascar. Broadly speaking, common feature selection algorithms (such as the LASSO) aim to improve out-of-sample predictive performance by restricting the set of covariates used in modelling to only those identified to be acting as the ‘most important’ contributors to the model predictions, hopefully reducing the tendency of ‘less important’ features to inadvertently fit observational noise (i.e., over-fitting). Causal feature selection algorithms, on the other hand, focus on identifying the set of features with the closest direct causal influence on the observational data, with the hope being that a bonus will be that this feature set is also well-tuned for out-of-sample prediction under the chosen model.
In our malaria mapping application we find that while the causal feature approach performs comparably to standard methods for ‘simple’ interpolation problems—i.e., imputing missing case counts for a particular health facility in some particular month given access to data from nearby health facilities in those months, as well as data from the particular facility in earlier and later months—it tends to notably out-perform standard methods for pure forwards temporal prediction—i.e., prediction of future data given previous observations at a facility up to a particular stopping month. Post-hoc rationalisation confirms that this is what one expected all along 🙂 but it is revealing to see these lessons play out in practice upon analysis of a real-world dataset! Comparisons of the feature sets selected by the different approaches (see below) is also interesting from an epidemiological perspective.
Now a key step in this project was that Rohan had to develop an efficient computational framework for conducting causal inference from the heterogeneous and spatially correlated malaria data and covariate products (e.g. satellite based images of land surface temperature and vegetation color). The solution described in the arxival is based on recent developments in hypothesis testing via kernel embeddings, combined with the PC algorithm for causal graph discovery, and the spatio-temporal pre-whitening strategy proposed by our friend/sometime collaborator, Seth Flaxman. This code is now well-tuned and—in my opinion—just waiting for some kind of clever use in astronomical analyses!
So, I would be keen to hear from anyone who has an astronomical problem that might benefit from a causal inference procedure. In terms of causal feature selection I would imagine that Rohan’s technique would improve the performance of models trying to forecast galaxy/QSO/exoplanet population characteristics in domains (greater redshift, deeper imaging, etc) just beyond the core coverage of existing data: such as when planning a future survey with a new instrument. Regarding the causal graph discovery, the robustness of Rohan’s technique against unmeasured spatially correlated causal factors suggests it could be used to examine questions related to galaxy formation and evolution in the field versus cluster environments: e.g. (very loosely) what are the directions of influence between environment, galaxy mass, galaxy star formation rate, morpholgy and bar type, given that environment, mass and SFR are all influenced by object location and history within even larger scale structures? In any case, the code is in good shape for sharing which we would be happy to do with anyone having an idea in this domain they’d be interested to try it out on.
I noticed an interesting paper arXived just before Xmas proposing to model stellar spectra as “sparse, data-driven non-Gaussian processes”. The non-Gaussian part is explained to mean that because a prior is placed on the covariance function and various hyper-parameters, when these are marginalised over the resulting posterior distributions for the latent spectral profiles are highly non-Gaussian. I don’t personally care for the use of the term “non-Gaussian process” to describe this situation, since it can confuse thinking around how the hierarchical GP model is behaving in the sense of the hyper-parameters learning a data-adaptive smoothness penalty for the GP which then sets the characteristic performance condition under that penalty and the data (see e.g. my favorite paper of last year). However, in this particular stellar spectra fitting scenario instance I think the proposed model should not even be thought of as having anything to do with a Gaussian process in the first place. The reason being that the model is better classified as a high dimensional Gaussian mixture model.
When we think of a stochastic process we typically think of a potentially infinite collection of random variables indexed by either a countable (discrete time) or uncountable (continuous time) set. Even when we map at pixel resolution with a GP or model a fixed period of financial tick (discrete time) data with one, we can mathematically extend the same model to an arbitrary resolution or project to infinite extent without changing fundamentally the behaviour of the model or its learning rate with respect to the available data. In this paper, the index set is a finite collection of spectral pixels and their covariance function is learned using a Wishart prior from having thousands of multiple instances observed, assuming no wavelength dependent kernel structure. So, although one can technically call this a model based on the Gaussian process, that would only be true in the sense that any model that uses a standard Normal distribution could also be labelled as such.
Aside from enjoyable pedantry, there is actual value in identifying the optimal description of the model class (here: high dimensional Gaussian mixture model) since it allows one to identify statistical theory and existing methods to understand and implement this model class. I must admit that it’s not a topic I have much experience in, but a quick google search returns papers by a number of known experts in high dimensional covariance estimation for instance [1, 2].