## A most unusual algorithm …

I discovered a most unusual algorithm for finding the posterior mean in a recent arXival called “Temperature as a third dimension in column-density mapping of dusty astrophysical structures associated with star formation” by Marsh et al.  The target of inference is a latent pixelized and discretized image representation assumed to be the realisation of a point process on the pixel grid, which is observed through some convolution matrix in the presence of Gaussian noise.  The space of the discrete state image, $x \in \mathcal{Z}^n$ not $\mathcal{R}^n$, is considered prohibitively large to tackle with a posterior sampler, and being discrete it is not possible to apply a gradient-based mode finding algorithm.  Instead, the authors look towards the continuous family of likelihood-tempered posterior means, $\rho_t = \sum_x x L(y|x)^t P(x) / Z_t$ (my notational edits), and take advantage of the particular structure of their likelihood function to formulate a hierarchy of integro-differential equation for evolving from the prior mean to the posterior mean truncated to $\frac{d\rho_t}{dt} + \phi_1 \rho_t = 0$.  It is then a straight-forwards numerical recipe to construct a numerical solution for $\rho_1$!

At first I thought, “Isn’t this just some weird version of thermodynamic integration?”, but then I quickly realised it isn’t because at no step is any sampling required.  I then thought, “Isn’t this assuming posterior Normality?”, but it turns out the answer here (as we’ll see later) is again, “No, but it does require a Normal likelihood and Normal noise”.  I then had a look at the previous Marsh and/or Richardson papers building this methodology in the acoustic imaging and remote sensing literature to see who cites it and perhaps thereby determine a common name for the algorithm, but alas it seems if anyone else is using this technique they’re referencing some other source material.  Hence my first question for readers is whether you know of anything similar that I could read about?

Thanks to Ken Marsh who supplied me with a copy of Richardson & Marsh 1987 in which an early version of the algorithm is developed in detail I can now properly explain how the nature of the approximation in the integro-differential equation method.  The observational model is supposed as follows: $y = Fx + \nu$, with $x$ being a $p$-dimensional vector with binary elements representing the on/off state of a given pixel in the latent image, $F$ an $M \times p$-dimensional convolution matrix describing the response of the observational system, $\nu$ being an $M$-dimensional vector of zero-mean Gaussian noise with covariance matrix, $C_\nu$, and $y$ an observed $M$-dimensional image vector.  The prior for $x$ will penalise the total number of ‘on’ pixels, $N=N(x)=\sum_{i=1}^p x_i$, whilst remaining agnostic as to their arrangement, i.e., $P(x) \propto \exp(\alpha N)P^\ast(x)$ with $P^\ast(x) = \prod_i (p)^{x_i}(1-p)^{1-x_i}$ (an independent Bernoulli weighting) and $\alpha$ a hyper-parameter determining the ‘dilution’ of $x$ expected under the prior.

The first thing you’ll notice is that this is a bit of an odd model: a discrete (here binary) latent image subject to a real-valued convolution matrix and real-valued Gaussian noise.  However, if you can put that weirdness aside for now then the IDE algorithm can be derived by first noting that differentiation of the posterior in $\alpha$ yields $\frac{d}{d\alpha}P(x|y) = (N-E(N|y))P(x|y) = \sum_i (x_i-E(x_i|y))P(x|y)$, which is already an IDE but in the posterior rather than the posterior mean.  To get an algorithm targeting the posterior mean the authors construct a hierarchy of IDEs by multiplying both sides of the above equation first by $x_j$, then $x_j x_k$, then $x_j x_k x_h$ etc. and so on, and finally summing each side over all $x$.  For instance, the first member becomes $\frac{d}{d\alpha}E(x_j|y) = E(x_j|y) - E(x_j|y)^2 + \sum_{i \ne j}(E(x_ix_j|y)-E(x_i|y)E(x_i|y))$, which is otherwise self-contained except for the ‘two point correlation function term’, $E(x_ix_j|y)$, which is the target of the second member of the hierarchy.

Exact solution would of course require following each member of the hierarchy up to infinity, so the authors truncate the series after the second member by an assumption which I first thought was equivalent to posterior Normality but is in fact equivalent to supposing that the third-order copula is unity, i.e., $E(x_jx_kx_h) = E(x_jx_k)E(x_jx_h)E(x_jx_h)/E(x_j)/E(x_k)/E(x_h)$!  In this case the authors then suggest solving the two-level hierarchy of IDEs by numerical integration from a point equivalent to $\alpha = -\infty$ and stopping at an $\alpha$ such that $E(N|y) = E(N)$ which adds a level of empirical Bayes to the algorithm.

In conclusion, although this approach is currently developed only for a very particular hierarchical model one does wonder what novel techniques might be gained from attempting to extend it to other Bayesian inference problems featuring high dimensional, discrete latent variables.  ??? To be continued, I suppose …