A recent arXival presents a method for estimating the marginal likelihood (or ‘evidence’) based on the technique of normalizing flows. Normalising flows (mentioned in a previous post on modelling the colour-magnitude diagram) are a model for density estimation in which a standard Normal distribution of some moderate dimension is mapped through a machine learning style invertible transform to give a highly flexible (but properly normalised!) density representation in the target space. The authors propose to estimate the marginal likelihood given an existing set of samples from the posterior (e.g. imagined recovered via MCMC) in two steps: (i) train a normalizing flow on the posterior, and (ii) use this density as the reference distribution in a bridge sampling estimator to approximate the marginal likelihood. Like the case discussed in my last blog post (on random forests for parameter estimation) the astronomers again confuse the specifics with the general.
First of all, for the normalizing flow to be a useful device we first need to have a good approximation to the posterior from a prior run of an MCMC style algorithm: for high dimensional problems one may as well assume this is to be HMC. Of course, for a multimodal problem (that is not a toy problem for which we know the number of modes and can check that our HMC run has been successful) it is not guaranteed that HMC will explore all the modes efficiently. Methods for marginal likelihood estimation like nested sampling or sequential Monte Carlo often start from a diffuse reference distribution (like the prior) to give the user a better chance at discovering all the modes, but at the cost of less efficiency in marginal likelihood estimation than if they started from a reference distribution closer to the posterior. All these methods can be run from a reference distribution other than the prior, so could easily be used with the normalizing flow as well … if the available posterior samples are trusted to be representative.
The question then is how much the normalizing flow improves on alternative reference densities; e.g. a (mixture of) multivariate t-distribution(s) based on the posterior mode(s) [requiring only maximisation and Hessian construction] or the same mixed with a KDE if there are posterior samples already available. Since normalizing flows are designed specifically for this purpose I’d imagine they do improve things, but it would be interesting to understand by how much. Moreover, one shouldn’t discount the value of samples taken along the nested sampling path (or the thermodynamic path) whenever the reference distribution is fatter tailed than the posterior itself, since these can be valuable when estimating posterior functionals (e.g. moments) and credible intervals.
Finally, I was not impressed by the numerical examples presented because the rival techniques were all ‘shown to fail’, but no investigation is made to understand in what mode they seem to be failing (is it the software implementation and/or a sub-optimal choice of tuning parameters?), no comparisons are made to problems for which benchmarks have been published elsewhere (i.e., how does the new method compare to an older method on a problem for which the older method has been shown to perform well; all examples are ‘adapted from’ but not replicates of), and no code is provided for readers to check these kind of details.