Comparison of techniques for Bayesian parameter estimation …

Since today is the EKKA holiday in Brisbane & I’m too old to get excited about showbags (but still too young to get excited about fruit cake baking competitions) I have a bit more free time than usual to browse through astro-ph.  So, first, two brief mentions of things that caught my eye:

1) Another paper by Tempel & Libeskind on filament extraction with the Bisous model  ( ) which uses an innovative marked point process idea.  Bisous = kisses … cute name for a model.

2) A paper by Price-Whelan &  Johnston ( ) looking in a small way (compared to the recent Lux et al. paper which looks in a big way: ) at constraining the MW potential with kinematic stream data.  In a throw-away line they note that MCMC is a pretty bad optimizer … true, but if you’re going to use it for something like finding the posterior mode (they settle for median) then you may as well power it up ([pi L]^beta; beta -> ~10, 100, 1000?) progressively.  [This trick comes in handy for some design problems.]

But now to the meat & bread (for me) of today’s astro ph, the Allison & Dunkley paper ( ) on comparing the efficiency (w.r.t. number of required likelihood function evaluations) of some techniques for exploring the Bayesian posterior.  This review paper does quite a good job at explaining the basics of three useful techniques: random walk MCMC, (ellipse-based) nested sampling, and affine-invariant ensemble MCMC; which I suspect will be quite useful as an introduction for cosmologists new to this computational statistics part of the field.  On the whole though, I did find the comparison of their relative merits by application to a couple of simple likelihood functions a bit flakey … possibly because the strengths of each method are for completely different applications.  That is, although I certainly agree with the authors’ recommendation that for cosmological problems where the posterior is typically uni-modal and near-Normal [and where the priors are uniform] the [ellipse-based] nested sampling technique will be most efficient of these three, I think the merits of the other two techniques deserved demonstration on more complex case studies.

An important point that was not stressed (perhaps not understood) by the authors is that the ellipse-based nested sampling algorithm they described is valid only for uniform priors (or where the priors admit a tractable transformation to uniform). The point is that if the prior is not uniform then although one might sample the first set of live points from this prior if you follow the later step of their algorithm to draw uniformly from within the likelihood-constrained ellipse you will not be respecting this prior anymore.  I suspect that this limitation of ellipse-based nested sampling is perhaps the biggest turn off for statisticians, who have yet to embrace (and in many cases, understand) nested sampling.  (The exception being Chopin & Robert, rightly referenced by the authors.  Although their idea of nested importance sampling, which kicks ass for the above type of posterior, is not referenced.)

Another couple of minor points.  It was a bit unfair to say that regular MCMC offers no means to estimate the marginal likelihood (or “evidence”) when in fact we have the notorious harmonic mean estimator and the Gelfand-Dey stabilised version thereof, for instance.  In fact, for thin-tailed near-Normal likelihood functions these can sometimes perform quite acceptably.  (I’d be interested to know how they performed on the test case in this paper.)  Likewise when the posterior is Gibbs sample-able the Chib estimator can do rather well too.  The reference to NS whopping MCMC+thermodynamic integration (TI) for marginal likelihood estimation (Shaw, Bridges, Hobson) is again a bit unfair/naive, since with a recursive estimator (Vardi 1985, Geyer 1994, Kong et al. 2003, Cameron & Pettitt 2013) you can get great evidence estimates with MCMC exploration under only a few temperatures (rather than the many [tens, hundreds] that can be required for TI).

It’s worth noting that there were a number of popular and useful techniques not even mentioned by the authors (which is unfortunate given the review-like format of this paper): in particular population Monte Carlo (which has been used before in the cosmological context), ordinary importance sampling & adaptive [multiple] importance sampling; likewise Langevin / Hamiltonian MCMC methods.  Another point to make might be that the effective efficiency of an MCMC technique will depend on the information that you’re trying to extract from the posterior.  Here it’s the mean and covariance but one might be more interested in finding accurately the 95% credible intervals on certain parameters, for instance, or the posterior predictive for some transformation of the data.  For these objectives one might well find a different ordering of methods … 

Seems like I’ve run out of steam on this topic 🙂

Time to resume my search for useful results on the Radon-Nikodym derivative of stationary but non-zero mean, 2D Gaussian processes (on R^2).  If only I could time travel back to the 1950s/1960s I could find someone to ask about this!  (Capon, Shepp, Hajek [behind the Iron Curtain though.])  


This entry was posted in Astrostatistics. Bookmark the permalink.

3 Responses to Comparison of techniques for Bayesian parameter estimation …

  1. “Cameron & Pettitt 2013”

    I can’t find this paper. Could you please provide a link?

    As a Nested Sampling booster (not of the “elliptical” type, though), I need to see what this other method is all about. Getting the marginal likelihood from only a few temperatures seems a bit magical, but I’m sure that’s just because I don’t understand it.

  2. Yeah,
    The original idea goes back to Vardi (1985) [“biased sampling”; Gill et al. 1988] but it’s been rediscovered in other forms: Density of States (Swedsen & Wang 1987; Habeck 2012), and Reverse Logistic Regression (Geyer 1994, Kong et al. 2003); also Nested Sampling, and Importance Nested Sampling, are closely related.

    Apart from only requiring a few (minimium, two) temperatures I show in an upcoming paper (to be put on astro ph later this week; i hope) how Vardi’s original scheme can even apply to infinite-dimensional problems (in this case Dirichlet process prior; see also Doss 2009 for a precusor) not admitting densities with respect to the Lebesgue measure; thanks to the continuous (or not more than countably discontinuous) mapping usually offered by the likelihood function.

  3. Thanks for the link. I think I was just spelling Pettitt incorrectly. Oops.

    I think the Allison & Dunkley paper is something we should see more of. There’s lots of methods papers out there that compare themselves vs. one or two alternatives, showing that they’re better ( I know because I wrote one. I still think DNest is awesome but there are a couple of known weaknesses). But this depends on the choice of demo problems and the selection of the competitor methods.

    The other thing I feel like saying is that if you can get your posterior distributions to look as smooth as their Figure 6, the problems you’re solving probably aren’t that complex.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s