I enjoyed reading two recent papers (one recently revised, the other new) from the Brummy astronomy group: one presenting an adaptive subdivision of the posterior to facilitate efficient moves in RJMCMC (Reversible Jump MCMC) and the other presenting an adaptive strategy for tuning the temperature schedule for PT-MCMC (parallel tempering MCMC). Although I enjoyed reading both & I find the increasing frequency of discussion of these methods in astronomy to be encouraging, all the pedantic bones in my typing figures were ultimately sent into spasms. So the following is a bit of a rambling list of thoughts and technical corrections I need to write down.

RE the RJMCMC paper:

While the method presented is probably quite a useful solution for the sort of RJMCMC problems routinely encountered in astronomy (cf. MultiNest) the appeal to the wider statistical audience may be much less: in part because the authors begin implicitly with the classic astronomy assumption that the only interesting priors lie in R^n as densities with respect to Lebesgue measure (cf. “the model evidence is an integral of the unnormalized posterior probability distribution over the model parameter space”), such that their subsequently division of parameter space into hypercubes is a natural step.

Repeated references to the HME as being biased grate (since it is easily shown to be formally unbiased: expectation = the true Z); as does the lack of a qualifier on the statement that the HME is an infinite variance estimator (it’s not *always*, e.g. for a proper, bounded prior density on a compact domain etc. but it is still bad in such cases!). A bunch of errors in the description of MCMC also stand out: e.g. the condition stated after their Eqn 2.5 is (despite what’s claimed) neither a description of the equilibrium distribution nor detailed balance, instead it is the definition of the invariant distribution (Tierney 1994 explains these differences).

RE the adaptive PT paper:

The first thing that struck me was the (in my experience) unusual choice by the authors to target the auto-correlation time (ACT) of their T=1 chain during tuning, rather than a specific posterior functional(s) as in e.g., Miasojedow et al’s 2012 paper on adaptive PT and more generally in the literature on simulated tempering with dynamic weights (STDW; a very similar problem to dynamic PT): cf. examples in Liu’s (2001) “Monte Carlo Strategies in Scientific Computing”. IMO the relationship between the optimal schedule and the target criterion should have perhaps been more heavily emphasised.

Like all previous astronomy papers this one ignores the fact the draws from all temperatures in the parallel chains can be used (appropriately weighted) for estimating posterior functionals; either they can be combined directly weighting with respect to the known temperature at which each chain was run, or else as a pseudo-mixture using the reweighting mixtures scheme from Geyer 1994 (cf. Vardi 1985; Cameron & Pettitt 2014, Statistical Science). The same scheme can be used to estimate normalizing constants for all steps in the chain and therefore to estimate KL divergences for adaptively implementing an alternative tuning scheme from the literature discussed in the paper (thus solving a problem brought up by the authors).

The value of the above-mentioned KL divergence spacing scheme would presumably be greatest for the case that one wants to use the chains for thermodynamic integration (i.e. for a quite different optimality criterion to the T=1 ACT) where the error is driven primarily by the J-divergence (Lefebvre et al. 2010; see also Friel & Pettitt 2008; and Friel, Hurn & Wyse 2014). The last paper points also to the stepping stone estimator of the marginal likelihood, which is shared with SMC samplers; and indeed SMC samplers run on the thermdynamic path (e.g. del Moral et al. 2006) carry their own in-built adaptive method through the decision whether or not a refreshment kernel is necessary to run based on the effective sample size at each of many finely-spaced temperatures (similar in spirit to the ACT criterion).

*And I couldn’t possibly conclude without reminding readers that one can also *temper on data* for iid observations: e.g. Chopin 2001 (SMC version); Cameron & Pettitt 2014 for a ‘PT’ version. Here a root-n schedule is generally successful.* But tuning would be easily achieved and the partition space is nicely bounded and discrete.

A few reasons I’m over parallel tempering, and everyone else should be too:

i) Phase changes

ii) In many problems the KL divergence from prior to posterior can be thousands, I don’t ever want to run thousands of parallel chains.

Could you elaborate or provide a reference on your remark “they can be combined directly weighting with respect to the known temperature at which each chain was run”?

I had a look at the references you gave and from what I could gather, they are mainly concerned with the second option you describe, i.e. the pseudo-mixture, which I find rather cumbersome to implement in practice.

So there’s a few ways to proceed to combine the draws at multiple temperatures for estimation of posterior functionals, depending on whether one estimates the ratio z_j/z going into the importance weights (w_{ij} = Z_j/Z*L(theta_{ij})^[1-t_j]/n [where n is the total number of draws at all temperatures] or else ‘a posteriori’ normalizes the weights from each temperature (such that w_{ji} = c_j*L(theta_{ij})^[1-t_j] [where c_j : sum_i w_{ij} = m_j/n with m_j the number of draws under the j-th temperature]). In the first case one could take as estimator for z_j the partial thermodynamic integral to the temperature t_j, or else use the stepping stone estimator (Xie et al., Fan et al.).

I can’t think of a theoretical paper describing this approach (perhaps because most of the parallel tempering work has been performed by advocates of the reverse logistic regression / density of states method they mostly seem to use that! since it will have small variance; also the philosophy is within the remit of Hesterberg’s ‘Weighted Average Importance Sampling and Defensive Mixture Distributions’). One would have to search through the applied stats literature for examples of the above weighting scheme. Worth noting is that, although the above method(s) will have larger variance than the full RLR solution, they will have smaller variance than using the posterior (t=1) draws alone [for most posterior functionals of interest].

Thank you for your reply. Do I understand correctly that the c_j will be all equal if I have an equal number of draws at each temperature? This is the case for the parallel tempered MCMC sampler in the emcee-Package (http://dan.iel.fm/emcee/current/).

I’ve tried to use the draws at all temperatures with their respective weights to construct an empirical posterior CDF for the parameters of my model from which I estimate the 0.16, 0.5, and 0.84 quantiles to report as the most likely parameters of my model given the data. But even the weights for the second-hottest temperature are on the order of 10^-1 and therefore have very little influence on the estimated CDF and the variance of my estimates.

Yeah, that’s right regarding the c_j’s.

For an L^0.1 temperature I can indeed imagine that its contribution to estimation of the 1 sigma confidence interval will be small, but it would be useful for other tasks like estimating 3 sigma CIs (obviously, since by definition most of the posterior draws will lie far within these quantiles), or the variance and other higher-order moments.