An NS p-coin?

Update (27/9): I’ve slightly revised my thinking on the best use for this approach, from a p-coin for the posterior probability of two models to a pseudo-marginal MCMC correction to the Laplace approximation; hence the strike-throughs etc. below.


An interesting question that occurred to me some time ago but which I haven’t had the motivation to carry further until now is the following, prompted by a neat paper from Krzysztof L􏰀atuszynski on the Bernoulli factory (see also Pierre Jacob’s work for some background): namely,

“Can the upper and lower Riemann sums of nested sampling be used to approximate a p-coin for model selection pseudo-marginal MCMC, where the p here is the ‘pseudo’-posterior probability of one out of two models the Laplace approximation to a marginal likelihood versus the true marginal likelihood, p = \frac{Z_\mathrm{LA}}{Z_\mathrm{true}+Z_\mathrm{LA}}?”.

The motivation here lies in applications to approximate posterior simulation with unbiased estimators rather than exact accept-reject calculations (a la pseudo-marginal MCMC), and in particular for the case of Bayesian variable selection with many more models than observations (e.g. Lamninos et al.). In this sense, given two candidate models I’d like an initial approximation to the model space posterior based on Laplace approximations to the marginal likelihood of each model I’d like t0 run a p-coin that I could flip inside a pseudo-marginal MCMC algorithm which I could run over the top of my initial estimate to ‘de-bias’ the Laplace approximation to accept with long run probability equal to the true posterior probability of each model. And moreover I’d like this p-coin to be computable with typically less than a few noisy runs of nested sampling with only a few live particles instead of having to run an expensive nested sampling computation with many live particles to get a close approximation of each model’s marginal likelihood.

One way to do this would be to draw k from a negative binomial distribution using the above specified p-coin (k having mean 1/p) and plug k \times Z_\mathrm{LA} into the MCMC acceptance formula to produce a chain consistently targeting \pi_i^{c} \propto \{Z_\mathrm{LA}+Z_\mathrm{true}\}_i (for model index i). From (an empirically normalised) \pi_i^c one could subtract off (a normalised) \pi_i^\mathrm{LA} \propto \{Z_\mathrm{LA}\}_i for an estimate of \pi^\mathrm{true}_i —albeit an ugly one with negative probabilities assigned to some models. Or perhaps one might prefer to compare \pi_i^c against \pi_i^\mathrm{LA} using the KL divergence or other (semi-)metric as a diagnostic for where the Laplace approximation of the posterior may be inadequate. In any case, if this further information about the true posterior could be obtained cheaply it would be nice to have!

If my reading of L􏰀atuszynski is correct then I believe this can be done with his Algorithm 4 as described below …

To apply Algorithm 4 we need to be able to generate a sequence of random lower and upper bounds, L_n and U_n, converging to the true p. I propose to construct these from the lower and upper sums of the nested sampling integral (as illustrated in the diagram below) with successively more live particles, N(n), and a stopping point, c(n), explore prior mass increasingly close to the posterior mode as n increases. Here I will make the (unfortunately necessary, methinks) assumption that Given that from the existing Laplace approximation the maximum likelihood value is available for each model, L_\mathrm{max}, I will use this in the upper sum; for practical purposes imagine this to be recovered from a downhill gradient search (which only need be done once and then stored for further p-coin flips).

ns

If Z_{\mathrm{lower}}(n) and Z_{\mathrm{upper}}(n) are nested sampling marginal likelihood estimates at a given n for the model targeted by the current MCMC proposal then I would suggest L_n = \frac{Z_{\mathrm{LA}}(n)}{Z_{\mathrm{upper}}(n) + Z_{\mathrm{LA}}(n)} and U_n = \frac{Z_{\mathrm{LA}}(n)}{Z_{\mathrm{lower}}(n) + Z_{\mathrm{LA}}(n)}. Supposing (as in most NS descriptions) that the inverse likelihood survival function, L(X), is strictly decreasing then I believe this recipe fulfils trivially the first three essential conditions of Krzysztof‘s recipe:
(1) \mathcal{P}(L_n < U_n)=1 for every n=1,2,\ldots
(2) \mathcal{P}(L_n \in [0,1])=1 and \mathcal{P}(U_n \in [0,1])=1 for every n=1,2,\ldots
(3) \mathcal{E}L_n = l_n \nearrow p and \mathcal{E}U_n = u_n \searrow p.

Finally, I believe the relaxed fourth condition for Algorithm 4 is also true can also be satisfied (provided c(n) and N(n) are chosen to be ‘faster’ than the curvature of L(X) in some loose sense!) but here things get more complicated and I can only argue in a hand-waving manner at the moment.
(4 alt) \mathcal{E}(L_{n-1}|\mathcal{F}_{n,\infty}) = \mathcal{E}(L_{n-1}|\mathcal{F}_{n}) \le L_n almost surely, and
\mathcal{E}(U_{n-1}|\mathcal{F}_{n,\infty}) = \mathcal{E}(U_{n-1}|\mathcal{F}_{n}) \ge U_n almost surely.
That is, L_n and U_n are a super- and sub- martingale, respectively.

The way I imagine L_n as a random variable here is as L_n(w): \mathcal{R}^\infty \rightarrow \mathcal{R} equivalent to L_n(w_n): \mathcal{R}^n \rightarrow \mathcal{R}, meaning that each L_n carves out an increasingly fine sequence of sigma-algebras on the cylinder sets of \mathcal{R}^\infty. And that the pullback of a single point in r \in \mathcal{R}, i.e. w_n = L_n^{-1}(r), is a collection of cylinder sets equivalent to n! points in \mathcal{R}^n corresponding to all possible permutations of a canonical w^{(1)} w^{(2)} \ldots w^{(n)}.

All this being a convoluted way to say that I believe each \mathcal{E}(L_{n-1}|\mathcal{F}_{n}) \le L_n can be computed from the \Pi-weighted sum of all permutations combinations {N(n) \choose N(n-1)} of the discarded points drawn up to the current n subsampled to the size of the drawn points used up to n-1, where \Pi represents the probability of drawing a given ordering of likelihoods in each contributing run of nested sampling. In principle, one could write an algorithm to enumerate and weight all of these, but in my testing I’ve decided simply to cheat and make do with a brute-force estimate from simulating t_i series (see Skilling’s nested sampling paper).

As a (not yet water-tight) proof of concept I’ve coded up a simple example for the posterior probabilities of two Normal likelihoods under a Normal prior a \frac{Z_\mathrm{LA}}{Z_\mathrm{LA}+Z_\mathrm{true}} coin with exact sampling for replacement of points. Unfortunately due to Allowing for floating point and other approximation issues this code is pretty poor not certified for very small scale terms on the likelihood; e.g. the marginal likelihoods fail for standard deviations less than 1 and various other circumstances in which it should totally work. Likewise the nested sampling summation doesn’t use logarithms so it’s also susceptible here to numerical issues. Nevertheless it sort of works well for a couple of simple examples so it gives a feel for the problem and a tool to start playing with.  In the runtime print out one can get a feel for whether the N(n) and c(n) scalings are practically sufficient by noting the number of failures to the super/sub-martingale requirements (i.e., positive values in the 6th column, negative values in the 7th).


### R script to implement exact nested sampling for demonstration of Bernoulli Factory algorithm in self-calibration case: skew Normal example

library(skewt)

# exact NS for Normal prior – Skew t likelihood model
draw.nested.model <- function(N=10,c=5,prior.mu=0,prior.sd=5,likelihood.mu=1,likelihood.scale=0.1,likelihood.skew=2) {

L <- numeric(floor(c*N))
Live <- list()
Live$Particles <- rnorm(N,prior.mu,prior.sd)
Live$Likelihoods <- dskt((Live$Particles-likelihood.mu)/likelihood.scale,df=1,likelihood.skew)
for (i in 1:(floor(c*N))) {
L[i] <- min(Live$Likelihoods)
r.index <- which.min(Live$Likelihoods)
if (Live$Particles[r.index] > likelihood.mu) {
pos.two <- Live$Particles[r.index]
opt <- function(theta) {return(abs(Live$Likelihoods[r.index]-dskt((theta-likelihood.mu)/likelihood.scale,df=1,likelihood.skew)))}
pos.one <- optimize(opt,lower=-10000,upper=likelihood.mu)$minimum
} else {
pos.one <- Live$Particles[r.index]
opt <- function(theta) {return(abs(Live$Likelihoods[r.index]-dskt((theta-likelihood.mu)/likelihood.scale,df=1,likelihood.skew)))}
pos.two <- optimize(opt,lower=likelihood.mu,upper=10000)$minimum
}
plim <- pnorm(sort(c(pos.one,pos.two)),prior.mu,prior.sd)
Live$Particles[r.index] <- qnorm(runif(1,plim[1],plim[2]),prior.mu,prior.sd)
Live$Likelihoods[r.index] <- dskt((Live$Particles[r.index]-likelihood.mu)/likelihood.scale,df=1,likelihood.skew)
}
return(L)
}

# compute z.LA
compute.zLA <- function(prior.mu=0,prior.sd=5,likelihood.mu=1,likelihood.scale=0.1,likelihood.skew=2) {
obj <- function(x) {-dnorm(x,prior.mu,prior.sd,log=T)-log(dskt((x-likelihood.mu)/likelihood.scale,df=1,likelihood.skew))}
x <- nlm(obj,likelihood.mu,hessian=T)
z.LA <- sqrt(2*pi)*sqrt(1/x$hessian)*exp(-x$minimum)
return(as.numeric(z.LA))
}

# compute upper & lower bounds
compute.bounds <- function(N.collection,LXs,Lmax,z.LA) {

output <- list()

# Compute Nlive list
sortedL <- sort(unlist(LXs))
Nlive <- numeric(length(sortedL))
for (i in 1:length(N.collection)) {
Nlive[sortedL <= max(LXs[[i]])] <- Nlive[sortedL <= max(LXs[[i]])] + N.collection[i]
}
delXs <- -diff(c(1,exp(-(1:(length(sortedL)))/Nlive),0))
# sanity check: sum(delXs)=1
z.low <- sum(delXs[2:length(delXs)]*sortedL[1:(length(sortedL))])
z.high <- sum(delXs[1:(length(delXs)-1)]*sortedL)+Lmax*delXs[length(delXs)]
output$z <- c(z.low,z.high)

output$L <- z.LA/(z.high+z.LA)
output$U <- z.LA/(z.low+z.LA)

return(output)
}

# reorders the LXs
generate.pseudo.sample <- function(c.collection,N.collection,LXs) {

mock.ts <- mock.indices <- list()
for (i in 1:length(LXs)) {mock.ts[[i]] <- numeric(floor(c.collection[i]*N.collection[i]))
mock.indices[[i]] <- 1:floor(c.collection[i]*N.collection[i])
for (j in 1:(floor(c.collection[i]*N.collection[i]))) {mock.ts[[i]][j] <- max(runif(N.collection[i]))}
for (j in 2:(floor(c.collection[i]*N.collection[i]))) {mock.ts[[i]][j] <- mock.ts[[i]][j]*mock.ts[[i]][j-1]}
}
for (i in 2:length(mock.indices)) {mock.indices[[i]] <- mock.indices[[i]]+max(mock.indices[[i-1]])}

sortedL <- sort(unlist(LXs))
ts.order <- match(1:length(sortedL),sort.list(unlist(mock.ts),decreasing=T))

dummy.LXs <- list()
for (i in 1:(length(LXs)-1)) {dummy.LXs[[i]] <- sortedL[ts.order[mock.indices[[i]]]]}

return(dummy.LXs)
}

# Bernoulli factory
simulate.BayesFactor.coin <- function(prior.mu=0,prior.sd=5,likelihood.mu=1,likelihood.scale=0.1,likelihood.skew=2,Nfactor=10,cfactor=2) {

## Following NS version of Algorithm 4 from Latuszynski paper
# Initialize
G0 <- runif(1)
n <- 1
L <- U <- Lhat <- Uhat <- Lstar <- Ustar <- numeric(1)
L[1] <- 0
U[1] <- 1
Lhat[1] <- 0
Uhat[1] <- 1
Lstar[1] <- 0
Ustar[1] <- 1
decided <- F
LXs <- list()
Zs <- list()
Lmax <- dskt(0,df=1,likelihood.skew)
N.collection <- c.collection <- numeric(0)
z.LA <- compute.zLA(prior.mu,prior.sd,likelihood.mu,likelihood.scale,likelihood.skew)

while (!decided) {
# Obtain L[n] and U[n]
N <- sqrt(n)*Nfactor # min = 3 for simplicity
c <- n*cfactor
N.collection[length(N.collection)+1] <- N
c.collection[length(c.collection)+1] <- c
LXs[[length(LXs)+1]] <- draw.nested.model(N,c,prior.mu,prior.sd,likelihood.mu,likelihood.scale,likelihood.skew)

n <- n + 1
x <- compute.bounds(N.collection,LXs,Lmax,z.LA)
zz <- mean(x$z)
#cat(x$L,x$U,”\n”)
L[n] <- x$L
U[n] <- x$U

# Sampling approximating to estimate E{L[n-1]|F[n]}
if (n > 2) {
dummy.N <- list()
for (i in 1:(length(N.collection)-1)) {dummy.N[[i]] <- N.collection[[i]]}
dummy.N <- as.numeric(dummy.N)
Lmean <- numeric(1)
Umean <- numeric(1)
for (k in 1:100) {
LXs.dummy <- generate.pseudo.sample(c.collection,N.collection,LXs)
x <- compute.bounds(dummy.N,LXs.dummy,Lmax,z.LA)
Lmean[k] <- x$L
Umean[k] <- x$U
}
Lmean <- mean(Lmean)
Umean <- mean(Umean)
Lstar[n] <- Lmean
Ustar[n] <- Umean
} else {Lstar[n] <- 0; Ustar[n] <- 1}

# Algorithm 4
if (L[n] > Lstar[n]) {
Lhat[n] <- Lhat[n-1]+(L[n]-Lstar[n])/(Ustar[n]-Lstar[n])*(Uhat[n-1]-Lhat[n-1])} else {Lhat[n] <- Lhat[n-1]}
if (U[n] < Ustar[n]) {
Uhat[n] <- Uhat[n-1]-(Ustar[n]-U[n])/(Ustar[n]-Lstar[n])*(Uhat[n-1]-Lhat[n-1])} else {Uhat[n] <- Uhat[n-1]}
if (G0 <= Lhat[n]) {status <- 1; decided <- T}
if (G0 > Uhat[n]) {status <- 0; decided <- T}
}
cat(status,G0,Lhat[n],Uhat[n],L[n]-Lstar[n],U[n]-Ustar[n],z.LA,zz,”\n”)
return(status)

}

### Test coin
# p=0.525
x <- numeric(1000)
for (i in 1:1000) {x[i] <- simulate.BayesFactor.coin(prior.mu=0,prior.sd=5,likelihood.mu=0,likelihood.scale=1,likelihood.skew=2,Nfactor=5,cfactor=5)}
cat(“p=0.525 : <coin>[1000] =”,mean(x),”\n”)

# p=0.219
x <- numeric(1000)
for (i in 1:1000) {x[i] <- simulate.BayesFactor.coin(prior.mu=0,prior.sd=5,likelihood.mu=3,likelihood.scale=0.5,likelihood.skew=2,Nfactor=5,cfactor=5)}
cat(“p=0.219 : <coin>[1000] =”,mean(x),”\n”)

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s