I noticed this paper on astro ph today by Hou, Goodman, & Hogg describing their “new” idea for marginal likelihood estimation using the geometric path. A quick inspection reveals that in fact the idea being presented is exactly the same as Fan et al. (Jan 2011)’s generalised “stepping stone” estimator, which in turn is an extension of Xie et al.’ (2010)’s naive “stepping stone” estimator by integration from a well-chosen reference density — following the ideas of Lefebvre et al. 2010 (and Friel & Pettitt 2008) regarding the connection between the error of thermodynamic integration along the geometric path and the J-divergence between the two end point distributions.

It’s odd that the authors didn’t mention any of these previous presentations of this marginal likelihood estimator. For one thing, if ever you go on the internet then you will find out that the stepping stone estimator is all up in that like lol cats, memes, and bad selfies. (Just type “geometric path” and “marginal likelihood” into google; see e.g. the review by Baele et al. 2012). For another thing, I sent Hogg & Hou a copy of my recursive marginal likelihood estimators paper (which discusses this stepping stone business and the choice of reference density for geometric path sampling) in October 2013 (and it had been on arXiv since many months before then). I even wrote on Hogg’s blog that I thought what I’d heard of their paper sounded a lot like Lefebvre’s ideas just a few days ago. At that time Hogg’s reply was rather cryptic: “*Yes, didn’t mean to claim novelty! Just invention*.” Overlooking the fact that the dictionary definition of invention gives novelty as its implication (e.g. “create or design [something that has not existed before]”: link), I would have at least thought they’d cite Lefebvre et al in their arXived version. (Strangely, they did cite my recursive estimators paper in the context of prior-sensitivity; which was nice!)

For those interested in using the stepping stone estimator for geometric path sampling from an auxiliary density it’s worth noting here a few other ideas overlooked by Hou, Goodman, & Hogg. First, although they state that “their” method has computational advantages over reversible jump MCMC, parallel tempering, nested sampling, and population Monte Carlo this is certainly not guaranteed in general and will very likely be untrue for the specific cases of multi-modal or fat-tailed posteriors. The point is that since the error of this method depends on the J-divergence between the two end point densities (prior and auxiliary) it can easily happen that one chooses a poor auxiliary with worse J-divergence against the posterior than the prior itself. Indeed for some posterior — auxiliary combinations one can end up in the infinite variance quagmire of the dreaded harmonic mean. As anyone who truly understands marginal likelihood estimation will tell you: it’s horses for courses, baby! One way to limit the potential damage caused by unwittingly adopting a poor auxiliary is via a defensive importance sampling strategy (Hesterberg 1995), i.e., taking instead of Hou et al.’s Laplace approximation-esque g(theta) a defensive h(theta) = alpha*pi(theta)+(1-alpha)*g(theta) [for 0 < alpha < 1; alpha small].

It’s also worth remembering, as I stress in my recursive estimators paper, that the stepping stone path is not only used by the stepping stone method, but existed previously as the preferred identity for estimating marginal likelihoods via the population Monte Carlo method (e.g. del Moral et al. 2006). In this case it becomes really the mechanics by which the tempered bridging distributions are explored that sets the difference between methods — again which is better will ultimately come down to implementation- and problem- specific considerations. Finally, in my opinion (forged from experience in practical applications, though I only have a heuristic argument to back this up) the biased sampling (Vardi 1985) summation presented in my recursive estimators paper will always provide a more accurate estimator of the marginal likelihood than the stepping stone given the exact same draws — the reason for this being that the pseudo-importance sampling character of biased sampling makes for more efficient use of the available information — *and *the biased sampling version gives direct access to an efficient estimator of prior-sensitivity*. So if you use Hou et al.’s method then you ought to try my (by which I mean Vardi’s, Geyer’s, Kong’s, etc.) summation!*

Sorry we didn’t cite previously this prior work — I don’t think we claimed our method is “new”. If we did, we will correct that; we just did the simplest thing we thought would work. Sorry if our paper pissed you off!

Let me add a little to David’s reply. The general idea, in my mind, goes back at least to the multi-histogram methods of Swendsen and his collaborators. I don’t think that our method is “exactly” the same as Swendsen or people since then, some of whom are unaware of Swendsen. The questions are:

1. Did we help others by showing them a method better than the method they would have used? In the present application, the people we were talking to were using diffusive nested sampling. The method described in our paper (whatever its origin) is better for the planet counting problem.

2. Did we add useful details? I think so. We choose \Delta \beta using an MCMC variance estimator in a way that gives definite variance bounds for the answer. We choose the initial Gaussian trial distribution from the posterior. We use the stretch move ensemble sampler. All those things are suggestions others using this general circle of ideas might be interested in.

I’m sure there are problems that our method is bad for. That’s true of any computational procedure.

A apologize for writing this comment without having read all the papers you refer to. I will have a more detailed look and try to have a more informed comment in a few days.

Hi David, Jonathan,

While it is true that you never say the method presented is “new” or your “invention”, you do refer to it multiple times as “our method” and by omitting any reference to its origins or to other equivalent methods one is left with the overwhelming impression that you are presenting it as a genuine innovation. To see its exact equivalence to Fan et al.’s generalised stepping stone (GSS) estimator just change the notation from theirs to yours as follows: f() -> L(), and pi_0() -> g() and compare their Eqn 1 to your Eqns 9 and 10. The choice of a Gaussian at the posterior mode is quite standard for such applications: apart from its past uses with the GSS it’s also the default choice for ordinary importance sampling & nested importance sampling. Convergence rates and choice of temperature sequence (which can be non-uniform) are also well discussed by Xie et al and Fan et al. While I do applaud your efforts to spread better statistical/computational ideas within the exoplanet community I think it’s equally important to give your audience an accurate reference list and context for these ideas to help them in their own efforts to explore the topic at hand.

cheers,

Ewan.

p.s. This is unrelated to the above, but it might be worth noting for readers new to Bayesian inference that the usual statistics term corresponding to the part of your analysis where you restrict ‘the periods of the companions to be in a monotonic order, for the sake of sampling’ is “forced identifiability”.

“1. Did we help others by showing them a method better than the method they would have used? In the present application, the people we were talking to were using diffusive nested sampling. The method described in our paper (whatever its origin) is better for the planet counting problem.”

The people we were talking to was ourselves, wasn’t it? 🙂 Still, nobody has tried Diffusive Nested Sampling on this problem in a way that I would recommend (i.e. do all models at once, don’t use stretch moves).

Hi Ewan,

Thank you for all your comments.

I went though Xie, Fan and Baele’s paper and summarized our difference with them as follows:

1. Xie and Fan set a beta-increasing schedule (beta distribution, equally spaced,… ) before their evaluations and they don’t change it during a run. We don’t pre-set a beta-increasing schedule. How beta will be increased from 0 to 1 is determined in the course of the evaluation.

2. Xie uses prior as their reference density. Fan and we sample the posterior first and use those samples to parametrize our reference density. But Fan uses the variances of individual dimensions and each dimension is independent in their reference density, while we uses the whole covariance matrix to parametrize our reference density.

We cited Xie and Fan in the updated preprint and also included the above distinctions.

Postscript: For problems on which the stepping stone from Laplace approximation works well you might also try the following trick which (in certain type of problems) can be extremely efficient and only requires samples from the posterior (Perrakis, Ntzoufras, and Tsionas 2013):

http://arxiv.org/abs/1311.0674

Might allow for a (computationally) quick improvement on the BIC for this problem too:

http://hoggresearch.blogspot.com.au/2014/01/sampling-in-galaxy-shape-space.html