It will be no surprise to my readers that this paper on astro ph today concerning “mixed MCMC” by Hu, Hendry & Heng got my back up since three of my pet hates appear early on in the manuscript: (i) the citing of astronomical papers for statements on MCMC theory rather than the original statistical source papers (e.g. citing Alison & Dunkley 2013, and Veitch & Vecchio 2010 for the dimensionality scaling of MCMC); (ii) the super-loose use of terminology like “convergence” (in what sense? towards what?); and (iii) the “testing” of algorithms exclusively on (low dimensional) toy examples. However, this prejudice aside I actually find there are a couple of glaring flaws in the proposed idea.
First, in order to implement Hu et al.’s “mixed MCMC” one is supposed to already have “at least some rough prior knowledge about the posterior”, which turns out to be that you know the locations of all principal posterior modes, and even better that you have some idea of their relative importance. Already this vastly restricts the class of inference problems that the method might be applicable to—and, moreover, restricts it to only quite trivial cases where we can obtain the above information cheaply, e.g., via sampling directly from the prior. For such problems one could just as well apply a simple importance sampling strategy using a multimodal proposal distribution with peaks on each mode (and thereby get a decent marginal likelihood estimate for free). In this sense the class of problems amenable to “mixed MCMC” should be considered much smaller than those amenable to, e.g., MultiNest nested sampling or parallel tempering (two established techniques for dealing with multimodal posteriors).
Second, the scheme will presumably only work efficiently for multimodal posteriors in which the modes are of very similar width and shape (like in the toy example given; not surprisingly!). In jumping from mode to mode the authors use a proposal that is a symmetric perturbation about the current point plus difference between the two modes, e.g., theta_proposed = theta_current + theta_mode_b_centroid – theta_mode_a_centroid + N(0,sigma^2). If, for instance, the region of high posterior density around mode _a would be much wider than that around mode _b then we’ll spend a lot of time in the outskirts of mode _a which map via this proposal to regions of low posterior density around mode _b, meaning a lot of rejected proposals. As the dimension increases I would expect this to become a serious problem.
UPDATE: While reading through (classic paper) Tierney (1994) to follow up on some questions I had concerning the nature of convergence for functional of MCMC chains I noticed the description of autoregressive chains (Section 2.3.5) and their mixtures (also Section 2.4) which rang a bell in relation to the above “mixed MCMC”. The latter can be recovered within the mixture of autoregressive chains framework as the limiting case of a mixture of independence chains with proposal f()’s being symmetric about the (_b – _a)’s; but more interesting is to use the given proposal form in autoregressive mode to make proposals shrinking the current state towards a randomly chosen mode with perturbation f() centred on zero according to the width & shape of that target mode. This would address the second problem I identified above.