(5) Big Data Bayes.
A lot of fuss is made about “Big Data” and the need for new statistical analysis tools to help us cope with the challenges it represents, but again there’s little mention in the astronomical discussion to this end of what this actually means. One particular direction that’s being explored in the statistics literature these days concerns the challenge of Bayesian inference when the sample size is so large that computing the likelihood function becomes a genuinely expensive operation. In this case the intuitive solution is to consider sub-sampling the data to estimate the acceptance step in MCMC; although this ultimately leads to a biased algorithm no longer targeting precisely the true posterior the bias is at least increasingly well understood (e.g. Korattikara et al. 2013, Bardenet et al. 2014) and in the “Big Data” regime pragmatism must sometimes trump perfection! Alternative strategies including methods for data tempering in which small subsets of the data are first used to focus the MCMC chain or SMC approximation towards the bulk of posterior mass (e.g. Chopin 2002, Manopolou et al. 2010), or else extreme parallel schemes exploiting the approximate posterior Normality (Scott et al. 2014).
Another challenge of “Big Data” can be an over abundance of candidate predictor variables such that identifying the most influential—and hence perhaps the most interesting for further study—becomes an overwhelming task (owing to the sheer number of possible parameter combinations), even when the base statistical model is a simple linear or logistic regression. The most pressing driver of methodologies in this area at present is biomedical analysis where many thousands of genetic markers could be potential predictors of a given medical condition, but with the richness and wavelength coverage of astronomical datasets constantly increasing it may be that astronomers will soon be facing the same issues. Contemporary statistical solutions to this problem include adaptive strategies for MCMC exploration over model space (e.g. Lamnisos et al. 2013), Bayesian lasso / “Occam’s window” type methods (cf. Bondell & Reich 2012, Lo et al. 2012), and Least Angle Regression (Efron teal. 2003).
Notes/References. Steve Scott has an interesting talk on the topic of “Big Data” Bayes available as a video from the website of the recent “Advances in Scaleable Bayesian Computation” conference.