Re: “Robust and scalable Bayes via a median of subset posterior measures” by Minsker et al.

It’s probably a bit lazy to blog (again) about a paper that was discussed at the Computational Stats Reading Group but I think this one is sufficiently important—in terms of exemplifying the direction that research in “Big Data” Bayesian inference is going—that I can get away with it this time. (But full credit to Thibaut Lienart for his able presentation last Friday.)

First of all, it’s important to note the setting of the problem, and how this differs from what we might usually think of as “Big Data” statistics from an astronomical perspective. In the context envisaged by Minsker et al. (and many other recent statistical papers on this topic) the likelihood function is quite standard and would be easily evaluated computationally for a reasonably-sized dataset, but the sheer volume of data available in this case means we are forced to turn to large-scale parallel computing techniques. If we simply farm this problem in the default way—by sending subsets of the dataset out to multiple cores, tell them to evaluate the likelihood for their portion, and then multiply all the returned partial likelihoods—we end up waiting on the slowest core at each MCMC step, as Steve Scott describes in his video from the Scalable Bayes workshop.

Instead, we would like to divide up the data and let each core run its own MCMC routine and then somehow combine these subset data posteriors in a sensible way. This is where Minsker et al. step in: they propose to combine these subset posteriors in such as way as to render the “median” posterior as our final target. The (highly technical) techniques Minsker et al. use to achieve this involve Reproducible Kernel Hilbert Spaces (much like Functional Regression e.g. Kadri et al.), which after a lot of math yield a simple ‘weighted sum’ formula for the “median posterior” (again, much like Functional Regression).

It’s a pretty cool ‘trick’, and the authors present a diverse range of intriguing numerical demonstrations, but what I’d ask my readers is, ‘Does astronomy need this specific type of “Big Data” Bayes?’ *I hope so*, but actually I’m struggling to think of an example where this technique could be applied. We do have a lot of large catalogues (e.g. star catalogues, galaxy catalogues, etc.), but in most applications we can quite happily throw away >99% of objects before going on to inference using our prior knowledge of the problem. A good example would be Dan Mortlock’s quasar studies (cited here at statisticviews.com as a prime example of Big Data astro-statistics) where the vast majority of quasar candidates are *a priori* identified as obvious non-quasars based on simple colour cuts. Likewise, radio astronomy and observational cosmology generate *huge* datasets (bigger than the big in Big Data!) but the key challenges seem to be in the computational processing to produce images and then reduce these to object catalogues.