On path integral evidence (marginal likelihood) computations …

I noticed the latest paper by Kitching & Taylor, entitled Path-integral evidence, on astro-ph recently and decided it was high time for me to take a careful read through the full series (i.e. including TK2010 and KT2011). After parsing the astronomers’ statistical writing into my own ‘Ewan-compileable’ language I believe that the general scheme presented for (perhaps approximate) analytic marginalisation over nuisance parameters (whether they be parametric descriptions of instrumental systematics or non-parameteric [i.e., here Gaussian process] ‘fields’/functions representing non-standard cosmologies) is (a) quite appealing for cases where the conditional likelihoods for nuisance parameters at given (non-nuisance) model parameters are close to Gaussian; and (b) shares an interesting convergence of philosophy with the Integrated Nested Laplace Approximation (INLA) scheme I’m so fond of.

If I understand correctly the astronomers’ method then the similarities and differences with INLA are as follows.
(1) Both schemes seek to make efficient (perhaps doubly-, and least singularly-)approximate inference for the posterior of a model consisting of some (typically) very high dimensional set of nuisance parameters, \psi, coupled to a low dimensional (non-nuisance) model parameter vector, \theta, of primary interest. In INLA the nuisance parameters are typically the mesh nodes of the latent Gaussian Markov Random Field and the model parameters of interest are its corresponding hyper-parameters and possibly the coefficients of some linear predictors; while in the astronomers’ case the nuisance parameters are the finite projections at the observed datapoints of a continuous space GP with fixed covariance structure and the model parameters of interest are cosmological parameters describing the mean of the main signal component at those points.
[Don’t make too big a fuss over the fact that INLA uses a GMRF and the astronomers use a continuous GP; see the GP -> GMRF SPDE connection: Lindgren et al. (2011).  Also, forgive the astronomers the (to a statistical eye) ‘hand wavy’ treatment of integration over functional priors and cancelling of infinities.]
(2) Both schemes seek computational efficiency by introducing an approximation to the conditional probability in \psi space at each given value of \theta, i.e, \pi(\psi|\theta,D), such that exploration of the (typically) lower-dimensional \theta space can be reduced to (e.g.) direct, brute force exploration.
(3) Whereas INLA approximates \pi(\psi|\theta,D) via a Laplace approximation about the mode \psi^\ast|\theta of \pi(\psi|\theta,D) the astronomers make their approximation using a second order Taylor expansion of the log conditional likelihood, L(D|\theta,\psi) about a single \psi_0 fixed for all \theta to the global posterior mode. (I’ve tried to illustrate this in the diagram below).

Presentation1
(4) If the conditional probability in \psi space would be exactly Gaussian then both schemes are exact, but since this assumption rarely holds then neither will be exact in practice. Also, for non-Gaussian likelihoods the astronomers’ solution risks returning a non-positive definite covariance matrix, which breaks the scheme; but for Gaussian likelihoods this approach could be faster than INLA since it is only necessary to compute a single vector of first derivatives and a matrix of second derivatives at each \theta, rather than having to do this after first finding \psi^\ast|\theta.
– In the introduction to Rue et al. (2008) [*the* INLA paper] the authors seem to suggest that this type of scheme, which is attributed to Breslow & Clayton (1993) / Ainsworth & Dean (2006) is from experience too inaccurate for the non-Gaussian conditional likelihoods one tends to encounter in applied spatial statistics (particularly, geostatistics/forestry/mining/epidemiology).

If my thoughts above are correct then I must say I am intrigued as to the performance of the astronomers’ algorithm for some of the models we investigate in malariology: in particular, although we have non-Gaussian response data we often tranform it such that we can approximately use a Gaussian likelihood (e.g. via the empirical logit transform) because even with INLA this tends to give us better performance in Bayesian coverage and cross-validation diagnostics. Also, with codes taking advantage of autodiff, like TMB (as used to great effect in the Global Burden of Diseases project), finding the global posterior mode and computing the necessary first and second derivatives should be super quick (especially if the field is reduced to a GMRF using the INLA mesh approximation). Of course, I’m not suggesting that it will be as accurate as INLA, but it might be competitive for first-order solutions in some BIG DATA or model-building/testing regimes … ??

????????? Needs more thought ….

Some feedback from friends:
– The same scheme *might* also pass under the umbrella of ‘the Laplace approximation’ in machine learning; e.g. Flaxman et al..
– The same scheme *might* also be able to be run in INLA by nominating strategy=”gaussian”.
But in both cases I am yet to satisfy myself that the implementation is the same as the astronomers’ one above with expansion about a global \psi_0 and not just a Gaussian approximation about each \psi^\ast|\theta
– Both schemes need to fill once a second derivative matrix at each \theta (?), which is likely the most expensive operation in any case, finding the conditional posterior mode being much cheaper.

Advertisements
Posted in Uncategorized | Leave a comment

Nested sampling identities everywhere …

The nested sampling identity for marginal likelihood estimation can be understood in two steps: the first of which is to observe that the expectation value of a non-negative random variable is identical to the integral of its survival function from 0 to its maximum value (proof is via ‘integration by parts’); the second step being to flip the axes of the integration to run from 1 to 0 (the range of the survival function).  Having first learnt of this identity from the nested sampling literature it’s always interesting to me to see it pop up in quite unexpected places.  Today’s example comes from a proof in Amanda Turner’s old masters thesis in which it is shown that if a sequence of probability measures on a metric space, P_n converges to a target measure, P, under the Prohorov metric then it converges in distribution as well.  The first step of the nested sampling identity is used here to transform the expectation of a bounded continuous function, \int f dP_n, into a limit over closed sets, \int_o^{||f||}P_n(f \ge t) dt, to which the definition of convergence under the Prohorov metric can be applied.  I.e.:

px

Of course, such applications of this identity well & truly pre-date the nested sampling algorithm (and they appear in all the classic stats textbooks: Billingsley, Feller, etc), but for me they will always be called the nested sampling identity.

Posted in Uncategorized | 2 Comments

Bayesian Statistics as a New* Tool for Spectral Analysis …

*warning your understanding of the word ‘new’ and that of the authors may differ!

I noticed this paper on astro ph today by J-M. Mugnes & C. Robert, the latter I initially thought might be Xian, but indeed it rapidly became apparent on reading the text that these two Bayesians could not be one and the same. I have to say it’s always surprising to me how each tiny subfield of astronomy can support multiple ‘first’ applications of Bayesian techniques, each time with a non-trivial ‘introduction to Bayesian statistics’ section. In describing their ‘new’ approach Mugnes & Robert ignore all past work on Bayesian fitting techniques for spectral lines for general astronomical purposes, noting only a single Bayesian study in the specific field of ‘stellar astronomy with high resolution spectra’ about which they write:

“More recently, Schoenrich & Bergemann (2014) developed a method which allow the combined use of photometric, spectroscopic, and astrometry informations for the determination of stellar parameters and metallicities. Nevertheless, none of these studies were applied on massive stars, nor did they performed a spectral analysis involving more than three parameters at the same time (the spectroscopic part of the work of Schoenrich & Bergemann 2014 focused only on the effective temperature, surface gravity, and global metallicity).”

So, how many parameters do you think are now analysed by Mugnes & Robert? Yeah, you guessed it, a grand total of four parameters! And in fact although the target of the spectroscopic likelihood in Shoenrich & Bergemann is three parameters, they do include additional photometric data with its own parameters and nuisance parameters, so the full dimensionality of their posterior is also greater than three.

Accepting that many astronomers will probably never feel the need to read literature outside of their tiny sub-field, no matter how relevant, I suppose readers of the blog would prefer some more interesting insights. Well, there are some here: it turn out that both Mugnes & Robert and Schoenrich & Bergemann use quite unusual likelihood functions which are better described as ad hoc goodness-of-fit criteria than representations of a sampling distribution approximating the observational process.

In the former the reduced, background subtracted spectral product is assumed to have iid Gaussian noise in each pixel, with a specific group of pixels a priori ascribed to each specral line. In addition to the assumed observational noise there’s also assumed to be an iid Gaussian ‘model noise’ to account for model imperfections which is, bizarrely for a Bayesian analysis, not assigned a prior and marginalized but is set empirically from the RMS under the best-fitting template. Another non-standard feature of Mugnes & Robert’s fit is that since they “want to treat all available lines equally” they interpolate each line (both broad and narrow) to the same number of bins. So ‘bin’ here shouldn’t even really be thought of as representing a pixel or group of pixels.

Schoenrich & Bergemann, on the other hand, introduce a type of ad hockery that I’ve not seen before: limiting the allowed precision of the posterior itself!

“To avoid over-confident estimates, we demand that either the temperature uncertainty σTeff > 80 K or the metallicity uncertainty σ[Fe/H] > 0.08 dex, and otherwise flatten the PDF by multiplying the χ2 distribution with a fixed factor until the condition is met.”

And that after criticising two earlier studies (Burnett & Binney 2010; Casagrande et al. 2011) for “simplifications of the observational likelihoods” …

I wonder if — no, not really, I know that — all of this could be rolled up into one principled approach with a single, simple idea: hierarchical Bayesian modelling.

Posted in Uncategorized | 4 Comments

On (un)falsifiability of paradigms with Bayesian model selection …

I noticed an unusual contribution on the philosophy of science with Bayesian model selection by Gubitosi et al. on astro ph the other day, in which some rather bold claims are made, e.g.

“By considering toy models we illustrate how unfalsifiable models and paradigms are always favoured by the Bayes factor.”

Despite the authors making a number of sniping comments about the sociology of “proof of inflation” claims in astronomy, their meta-reflections did not reach a point of self-awareness at which they were able to escape my own sociological observation: the bolder the claims made by astronomers about Bayes theorem, the narrower their reading of the past literature on the subject. Indeed, in this manuscript there are no references at all to any previous work on the role of Bayes factors in scientific decision making, even from within the astronomical canon (leaving beside the history of statistics); more precisely, it seems the authors have been spending a lot of research time on something they could have looked up online before starting.

Perhaps the most relevant past work on this topic in astronomy is the paper by Jenkins & Peacock (2011) called “The power of Bayesian evidence in astronomy” in which the frequentist properties (type I/II error rates) of model selection via the Bayes factor are examined for a number of examples. Computing the power of the Bayes factor under a given base hypothesis is directly analogous to computing the ‘falsifiability’ score suggested by Gubitosi et al.; only the power has a precise statistical interpretation, whereas the ‘falsifiability’ is somewhat nebulous (especially when the “reasonable range of data” over which the induced distribution of the evidence is to be computed is not explicitly identified as the null hypothesis).

“These results are sensible from the point of view of probability theory: “if you want to have a high probability of winning, then hedge your bets”. However, science is not about playing the lottery and winning, but falsifiability instead, that is, about winning given that you have bore the full brunt of potential loss, by taking full chances of not winning a priori.”

The above discussion of betting is similarly naive, ignoring whole swathes of papers on mathematical decision theory, loss functions and expected utility, and experimental design. I just can’t even. Depending on one’s choice of loss function one could write the exact same paper from the opposite point of view: that wacky theorems making very precise claims about a future observation are overly rewarded relative to the prevailing paradigm if the cost of producing the wacky theories is small and the loss of face in having them fail is small also relative to the reward of a lucky guess.

Some more specific points:

“If model M1 gets its prediction wrong it is penalised exponentially. In contrast, model M2 is penalised for not making a prediction merely as a power-law. We emphasise that this difference in penalisations is not related to the fact that one model has a Gaussian prior while the other has a flat prior. As we will see in Section IIIC, the same behaviour is observed for an intermediate model, and thus we conclude that this penalisation behaviour is the same regardless the particular form of the priors.”

(1) The authors seem confused about the difference between prior and likelihood. In the toy example presented one model has a delta function (dogmatic) prior and the other has a flat prior; neither has a Gaussian prior. The Gaussianity and exponential penality emerges from the Gaussian likelihood function, so it is no surprise that it takes the same form in their intermediate model, which simply adds Gaussian blurring to the original delta function.

“The denominator P(D) normalises the posterior distribution … “

(2) Interestingly, although the authors emphasise in their introduction that the evidence plays the role of normalizing constant for the likelihood-weighted prior (aka the unnormalized posterior), in the toy example given the delta function prior is in fact a rare measure for which this normalization role is not fulfilled. To understand why this is from technical point of view one needs to look at the measure-theoretic definition of conditional probability (e.g. Rosenthal’s “A First Look at Rigorous Probability Theory”), but informally it is simply because the posterior is identical to the prior under this model: theta is fixed and assumed exactly known.

Posted in Uncategorized | 4 Comments

Edgeworth expansions …

It’s interesting how the investigation of a given topic in statistics (in this example the topic will be the asymptotic behaviour of confidence intervals) can lead to the discovery that a key device for studying the said topic (in this example the device will be Edgeworth expansions) is also a key device for studying some unrelated statistical phenomenon (in this example the phenomenon will be non-Gaussianity in the cosmic microwave background).  In brief, Edgeworth expansions provide an asymptotic series approximation to a given distribution function in terms composed of polynomial combinations of cumulants multiplied by derivatives of the Normal distribution of increasing order.  For statistical purposes the Edgeworth series provides a powerful approximation to the distribution of a standardised sum of iid random variables, illuminating the rate at which it converges to the standard Normal as per the central limit theorem (e.g. Hall 1992).  While in cosmology the Edgeworth series provides a metric against which to evaluate or represent non-Gaussianity in the CMB (e.g. Juszkiewicz et al. 1995).

One of the first references I found online to help me get a handle on the Edgeworth series was the A&A paper “Expansions for nearly Gaussian distributions” by Blinnikov & Moessner (1998).  This paper has a nice description of the differences between three related expansions (Gram-Charlier, Gauss-Hermite and Edgeworth) and in particular a clear derivation of the Edgeworth expansion to arbitrary order (for a univariate random variable for which the density exists w.r.t. Lebesgue measure).  I say ‘clear’ but in fact it seems to me the authors’ proof falls over at the last hurdle.

In their Eqn 35 they define S_n = \kappa_n / \sigma^{2n-2} which allows the authors to write Eqn 34 in such a way that it looks like it contains a series of increasing positive powers of \sigma … after which they make a Taylor series of the exponential of this power series about \sigma=0.  This bothers me since the original characteristic function in Eqn 33 is ill-defined at \sigma = 0. It also seems they then claim that S_{r+2} \sigma^r = 0 when \sigma=0 (to get the 1 in the right hand side of Eqn 36), which is ignoring the dependence of S_{r+2} on \sigma from its definition, right? Likewise they treat S_n as no longer a function of \sigma when differentiating in Eqn 37.  None of this sits well with me.

Wouldn’t a correct derivation be to forget this expansion in terms of \sigma and just use directly the definition of the exponential (\exp(X) = 1 + X + X^2/2\mathrm{!} + \ldots) to expand Eqn 36 as a composition of power series from which the terms of each order can be grouped (the same method as used in their Appendix A)?

Posted in Uncategorized | 2 Comments

The KL divergence as test statistic …

I read this paper by Ben-David et al. on astro ph the other day and despite having some interest in the KL divergence (since KL estimation is similar to marginal likelihood estimation and the KL—or its symmetrized version, the J divergence—is a good rule of thumb for ranking instrumental densities in Monte Carlo computations) I was fairly unimpressed with this particular case study.  Basically, the authors propose to use the KL divergence as a test statistic for non-Gaussianity in the CMB as an alternative to the K-S test.  The motivation suggested by the authors being that “unlike the K-S test, the KL divergence is non-local“; which is obviously technically incorrect, since the ECDF used in the K-S test statistic depends (like the sample median, for instance) on the full sample, though one can level the criticism that it gives relatively less weight to the tails of the distribution than some other possible test statistics (e.g. the Anderson-Darling).

At the end of the day what bothered me most about this paper was the consistent lack of perspective—that is, a failure to recognise that both the K-S test and the proposed test based on the KL divergence are crappy solutions to the problem at hand since (1) the null distribution is not known analytically though it can be simulated from to produce an approximate PDF or CDF; (2) the pixel values from which the K-S and KL statistics are compiled do not represent iid draws, rather (at least under the null) they’re a realisation from a Gaussian process [this is point is noted in the paper but no thought is given to finding a more suitable test statistic for non-iid data]; and (3) all discussion is in terms of discrete distributions (which is non-standard for the K-S test, and usually unnecessary for the KL: except when you have to approximate the null distribution through simulation), meaning the continuous pixel values (temperatures) are ultimately binned here (which raises questions as to the sensitivity to the binning scheme).

There were also a number of technical points that were poorly expressed or misconveyed, some of which can be seen on my margin notes in the sample pic below.  For instance, the required absolute continuity of P w.r.t. Q (the symbol <<) is missed in the review and experiment with the discrete Gaussian example where in fact the expectation for \mathrm{KL}(P,\hat{Q}) is theoretically infinite for all N; though it rears its head in the Planck data analysis when zero bins have to be eliminated.  Likewise, the cumulative distribution and its empirical counterpart are normalised by definition.  The statement “Unlike the case of vectors (where the scalar product provides a standard measure), there is no generic “best” measure for quantifying the similarity of two distributions.” is also funny since in fact (1) there are many choices for a metric on a vector space, and (2) discrete probability distributions (their P_i and Q_i as densities on the integers) are in fact just normalised vectors, so there’s no reason why the inner product couldn’t be used as a test statistic too.

notes

Those wary of multiple hypothesis testing (e.g. geneticists) will be hearing alarm bells upon reading the closing thoughts of the paper: “However, the non-locality of the KL divergence and its sensitivity to the tails of the distributions still suggest that it is a valuable complement to the KS test and might be a useful alternative. Indeed, one should utilize a variety of methods and tests to identify possible contamination of the cosmological product. Obviously, any suggestion of an anomalous result would indicate the need for more sophisticated analyses to assess the quality of the CMB maps.

Posted in Uncategorized | 1 Comment

Training a cosmological likelihood emulator …

I had a read through this recent arXival by Aslanyan et al. presenting their ‘learn-as-you-go’ algorithm for speeding up posterior sampling via emulation of the cosmological likelihood function.  As the authors note, emulation is now a common strategy to deal with the long compute times for the exact likelihood—long being relative to the need to make many evaluations (~tens of thousands) during an MCMC run.  One common strategy is to use a Gaussian Process model as smooth interpolator for the likelihood (or alternatively an artificial neural network; which is more-or-less the same thing: a flexible model for learning from a set of training data, with the output function regularised in some sensible way).  In the present work Aslanyan et al. adopt a ‘simpler’ (presumably faster) polynomial interpolation method but aim for a competitive advantage in the decision making process for building the training set : that is, they aim to build it adaptively with respect to an error minimisation goal (as opposed to, e.g., building it step-wise in large, clunky batches via ‘history matching‘; GP’s being difficult to fit over large dynamical ranges).  I think this is a worthy ambition, although I couldn’t satisfy myself that the error modelling approach in the current version is correct.

In the ‘learn-as-you-go’ scheme the posterior density at each sampled point is assigned a local error estimate via cross-validation which is then adjusted to a global error estimate for the marginal posterior density in each parameter via some calculations [to follow the working use: log(1+x) approx= x in both directions] and CLT arguments (which I disagree with) presented in Section 5.  My first problem is that I don’t think that errors in densities are a sensible target: what’s important are the errors in either some specific posterior functionals (like the mean and variance-covariance matrix) or perhaps the credible intervals, or more generally the errors over some large class of posterior functionals.  Of course, this is a fiendishly difficult analysis problem: see, for example, the working in Deylon & Portier where a functional CLT is proved for a similar problem (importance sampling where the proposal density, though known precisely, is instead estimated via a leave-one-out KDE approach, to achieve theoretically a faster-than-root-n convergence rate).  My point is not that I expect the astronomers to make such a proof, but that it is important to recognise that posterior functionals are ultimately the target of inference and that the local error in density is irrelevant when one’s calculations will ultimately involve the single derived instance of the global density.

{In the same spirit I would point out that the Lyapunov CLT invoked in the present work is only valid for independent random variables, whereas errors in the emulation problem are mutually dependent (governed by the non-local discrepancy between the true likelihood and the interpolation polynomial).}

Posted in Uncategorized | 2 Comments