I enjoyed reading two recent papers (one recently revised, the other new) from the Brummy astronomy group: one presenting an adaptive subdivision of the posterior to facilitate efficient moves in RJMCMC (Reversible Jump MCMC) and the other presenting an adaptive strategy for tuning the temperature schedule for PT-MCMC (parallel tempering MCMC). Although I enjoyed reading both & I find the increasing frequency of discussion of these methods in astronomy to be encouraging, all the pedantic bones in my typing figures were ultimately sent into spasms. So the following is a bit of a rambling list of thoughts and technical corrections I need to write down.
RE the RJMCMC paper:
While the method presented is probably quite a useful solution for the sort of RJMCMC problems routinely encountered in astronomy (cf. MultiNest) the appeal to the wider statistical audience may be much less: in part because the authors begin implicitly with the classic astronomy assumption that the only interesting priors lie in R^n as densities with respect to Lebesgue measure (cf. “the model evidence is an integral of the unnormalized posterior probability distribution over the model parameter space”), such that their subsequently division of parameter space into hypercubes is a natural step.
Repeated references to the HME as being biased grate (since it is easily shown to be formally unbiased: expectation = the true Z); as does the lack of a qualifier on the statement that the HME is an infinite variance estimator (it’s not *always*, e.g. for a proper, bounded prior density on a compact domain etc. but it is still bad in such cases!). A bunch of errors in the description of MCMC also stand out: e.g. the condition stated after their Eqn 2.5 is (despite what’s claimed) neither a description of the equilibrium distribution nor detailed balance, instead it is the definition of the invariant distribution (Tierney 1994 explains these differences).
RE the adaptive PT paper:
The first thing that struck me was the (in my experience) unusual choice by the authors to target the auto-correlation time (ACT) of their T=1 chain during tuning, rather than a specific posterior functional(s) as in e.g., Miasojedow et al’s 2012 paper on adaptive PT and more generally in the literature on simulated tempering with dynamic weights (STDW; a very similar problem to dynamic PT): cf. examples in Liu’s (2001) “Monte Carlo Strategies in Scientific Computing”. IMO the relationship between the optimal schedule and the target criterion should have perhaps been more heavily emphasised.
Like all previous astronomy papers this one ignores the fact the draws from all temperatures in the parallel chains can be used (appropriately weighted) for estimating posterior functionals; either they can be combined directly weighting with respect to the known temperature at which each chain was run, or else as a pseudo-mixture using the reweighting mixtures scheme from Geyer 1994 (cf. Vardi 1985; Cameron & Pettitt 2014, Statistical Science). The same scheme can be used to estimate normalizing constants for all steps in the chain and therefore to estimate KL divergences for adaptively implementing an alternative tuning scheme from the literature discussed in the paper (thus solving a problem brought up by the authors).
The value of the above-mentioned KL divergence spacing scheme would presumably be greatest for the case that one wants to use the chains for thermodynamic integration (i.e. for a quite different optimality criterion to the T=1 ACT) where the error is driven primarily by the J-divergence (Lefebvre et al. 2010; see also Friel & Pettitt 2008; and Friel, Hurn & Wyse 2014). The last paper points also to the stepping stone estimator of the marginal likelihood, which is shared with SMC samplers; and indeed SMC samplers run on the thermdynamic path (e.g. del Moral et al. 2006) carry their own in-built adaptive method through the decision whether or not a refreshment kernel is necessary to run based on the effective sample size at each of many finely-spaced temperatures (similar in spirit to the ACT criterion).
And I couldn’t possibly conclude without reminding readers that one can also *temper on data* for iid observations: e.g. Chopin 2001 (SMC version); Cameron & Pettitt 2014 for a ‘PT’ version. Here a root-n schedule is generally successful. But tuning would be easily achieved and the partition space is nicely bounded and discrete.