Interesting paper by Harrison et al. on astro ph today under the title, “*Validation of Bayesian posterior distributions using a multidimensional Kolmogorov-Smirnov test*“. The paper contains two proposed innovations: (1) use of the cumulative probability mass below a given likelihood as mapping for densities from R^n to [0,1] for the purpose of applying the K-S test in the multi-dimensional setting; and (2) a simulation-based approach using this test for the grand purpose: “validation of Bayesian posterior distributions of any dimensionality”. Indeed with regard to (2) the authors claim: “**This new approach goes beyond validating software implementations; it provides a sensitive test for all assumptions, explicit or implicit, that underlie the inference. In particular, the method assesses whether the inferred posterior distribution is a truthful representation of the actual constraints on the model parameters.**”

The proposed mapping of part (1) seems fine under some restrictions on the null hypothesis probability density as the authors discuss: e.g. no ‘likelihood plateaus’. Interestingly, this transformation is conceptually identical to that of nested sampling but this similarity isn’t discussed in the paper, despite one of the authors being a nested sampling expert.

Where I have a hard time agreeing with the authors is with regard to part (2). Straight out of the box I should note that I agree with Brendon‘s assertion that the term “validation” be put on the banned list for Bayesian papers, since it suggests the ability to make ‘absolute’ decisions when in fact the real-world Bayesian inference process should be one of continual model refinement with posterior predictive checks (+ experimental design -> experiment -> posterior update -> etc.). Nevertheless, ideas of posterior and/or model validation routinely crop up in certain applied sciences (e.g. this example from weather forecasting, or the DIP study [which I’ve blogged about previously] by Dorn et al.), so there is clearly a market for them.

The particular situation envisaged by Harrison et al. was not entirely clear to me, but via offline discussions I now think I understand. Basically, you have a ‘realistic’ model for the data generating process, which you believe in, and from which you can (easily/quickly) simulate mock datasets from; also, its likelihood function is tractable but you’d rather not go to the trouble of coding up a routine to do posterior inference for this ‘realistic’ model. Instead, you have a ‘simplified’ model for the data generating process with some key parameters matching those of the ‘realistic’ model, and you don’t mind coding up this ‘simplified’ model for posterior inference. Sooooo … now you do a “validation test” to see if the posteriors for the key parameters of interest you will recover from the ‘simplified’ model are acceptably close to what you would have got out from using the ‘realistic’ model. Got it?

*Now, the way the authors propose to do this is to simulate lots of mock datasets with known input parameters from the ‘realistic’ model and for each one fit the ‘simplified’ model and record how much posterior mass lies below the likelihood contour at which the true input parameters lie. The proposal is that if the collection of these posterior mass scores passes the K-S test against the uniform distribution then we have “validated” the ‘simplified’ model for use in place of the ‘realistic’ model.*

In my impression the only *sure-fire* way to make this process work is to know for sure that the ‘realistic’ model is the true model of the data generating process and to know exactly its true input parameters; which is clearly not a sensible scenario. We’d still have a good case for this method if the ‘realistic’ model had no unknown parameters except those in common with (and corresponding to those of) the simplified model, such that we could use our posterior from the ‘simplified’ model to refine our choice of input parameters for the “validation” algorithm. However, if the ‘realistic’ model contains some extra unknown parameter then we can try sampling from our prior but as my example below shows the testing strategy may fail miserably.

**Illustrative example
**Suppose our ‘realistic’ model is a Student’s t with df = 1 and unknown median, our ‘simplified’ model is a Standard Normal with unknown mean, and suppose our dataset is a collection of 50 draws from the ‘realistic’ model. Now, if we have an improper uniform prior on the mean then the posterior is another Normal with mean equal to the sample mean and standard deviation equal to the sample standard deviation divided by root 50. If we simulate 1000 mock datasets with true median’s set equal to draws from this posterior, then we can compute the ‘simplified’ model posterior from each mock dataset and use the Normal distribution function to find the mass cumulant corresponding to the input median. Testing this collection of mass cumulants against the uniform distribution via the K-S test gives a p-value near zero, allowing us to declare that the ‘simplified’ model is not appropriate for use in place of the ‘realistic’ one here. As one would hope in this case!

However, if we didn’t know that df=1 for the ‘realistic’ model but instead had a diffuse prior with a range of df in [1,200] and most of our posterior mass skewed towards the high end of this scale, then a repeat of the above procedure drawing our mock datasets from Student t’s with df’s drawn from our prior gives us the opposite conclusion: that we can use the ‘simplified’ model in place of the ‘realistic’ one! Sure, the prior is crummy in this example but the point is that unless we know a lot about the true data generating model the Harrison et al. method can easily break down. I.e., in contrast to their claims of providing **“a sensitive test for all assumptions, explicit or implicit, that underlie the inference”**.

Of course, ABC can save the day here: if we did a preliminary round of ABC to refine our prior for the unknown df against the actual observed data then we would find that the df is probably a lot closer to 1 than 200; using this proposal for our mock simulations gives a much better chance of rejecting the ‘simplified’ model in the “validation” process. Which all goes back to my first thought when I saw that the authors described the scenario of trying to fit a model for which one couldn’t (be bothered to) code up a posterior exploration algorithm for, but could easily simulate data from: Approximate Bayesian Computation.

An R code to demonstrate this example is pasted below.