## Some Bayesian unsupervised learning …

I noticed the latest IFT group paper on astro-ph (as a replacement) this morning—“SOMBI*: Bayesian identification of parameter relations in unstructured cosmological data” by Frank et al.  [* which I imagine to be pronounced thusly]—and, since I have some free time while waiting for simulations to finish, I thought I would comment on this one for today’s blog entry.

The idea of the SOMBI code/model is to first ‘automatically identify data clusters in high-dimensional datasets via the Self Organizing Map’, and then second to recover parameter relations ‘by means of a Bayesian inference within respective identified data clusters’.  Notionally we are to imagine that ‘each sub-sample is drawn from one generation process and that the correlations corresponding to one process can be modeled by unique correlation functions.’  In effect, the authors cut their model such that the identification of clusters is not informed by the nature of any inferred correlation functions; nor is there any shrinkage achieved by a hierarchical structure of shared hyperparameters during the regression on each subset.  This seems to me to be an unfortunate choice because it does turn out in the application given (and one would expect more generally) that some parameter relationships are not entirely dissimilar between groups (for which shrinkage would be sensible); likewise, two groups turning out to have similar relationships might actually be sensibly pooled back into one group in a model without cuts (and so forth).

For those interested the model for regression in each group is simply:
$y \sim N(X'B,\sigma^2)$,
$B_{1:M} \sim U(-\infty,\infty)$,
$\sigma \sim U(0,\infty)$
with $X$ being a design matrix consisting of the 1 to $m$th powers of the independent variable, $x$.  The normal-uniform likelihood-prior choice is used to allow explicit marginalisation and the empirical Bayes solution for $\sigma|m$  is adopted under $m$ chosen via the BIC.

One criticism I would make is that this work is presented in vacuo without reference to the myriad other approaches to automatic data clustering and/or variable selection/regression already existing in the stats and ML literature.  Certainly the BIC has a long history of use for selection of regression coefficients in linear (and GLM) models and its properties are well understood; but these are typically in the context of Bayesian models with priors that inform the expected scale of power terms and without the free parameter of unknown $\sigma$.  (For reference, Gelman et al. recommend a standardisation of inputs before construction of power terms under their Cauchy prior structure for logistic regression).  Alternative penalties for automatic covariate selection are (e.g.) the LASSO and ridge regression, and Bayesian model selection with Zellner’s g-prior.  Likewise, there are numerous Bayesian schemes which aim to achieve clustering and regression inference without cuts to the model, typically involving a clustering over some latent space (with dimensional reduction; e.g. Titsias & Lawrence).  In particular, it would have been useful to see such a comparison performed over a common dataset (e.g. one from the default R package) as is common in NIPS submissions.