A novel approach to likelihood-free inference: just use the likelihood!

A new arXival describes “a novel inference framework based on Approximate Bayesian Computation” for a modelling exercise in the field of strong gravitational lensing.  Since they acknowledge insightful discussions from the guys who seem to deliberate refuse to cite my work on ABC for astronomy, it’s no surprise that the statistical analysis again goes astray.  Basically, the proposal here is to use the negative log marginal likelihood—the marginalisation being over a set of nuisance parameters conditional on the observed data and a given set of hyper-parameters—as a distance in ABC.  Usually ABC is motivated because the likelihood is not available, but that doesn’t seem to be understood here.  The only thing that seems to stop the posterior collapsing to the mode is the early stopping of the ABC-SMC algorithm at an arbitrary number of steps.  There’s also some silliness with respect to the choice of prior for the source image (the component of the model that is marginalised out exactly) which is emphasised in the paper as disfavouring realistic source brightness distributions.

The only thing I found interesting here was the reference to Vegetti & Koopmans (2009)  [Note: this is clearly the paper the authors intended to cite, rather than the other Vegetti & Koopmans 2009: MNRAS 400 1583; this kind of mistake suggests the level of care going into these arXivals is less even than the few minutes I spend cranking out a blog post].  The Vegetti & Koopmans (2009) method involves construction of a source image prior via a Voronoi tessellation scheme with regularisation terms.  An interesting project would be to examine how the SPDE approach could allow for a more nuanced prior choice, introduction of non-stationarity etc (see Lindgren et al. 2011).

Actually there is another interesting thing that could be investigated with this model.  The authors choose to look at -2 \log P(d|\theta) for their distance: the factor of 2 is of course irrelevant in the ABC context and with respect to collapse on the posterior mode; but for likelihood-based inference it would represent an extra-Bayesian calibration factor, which could actually be chosen to reduce exposure to excess concentration in the mis-specified setting via e.g. the loss-likelihood bootstrap.

Posted in Uncategorized | Leave a comment

More dumping on the NeurIPS Bayesian Deep Learning Workshop …

Today I noticed another paper on astro ph that irked me, and again it turns out to be accepted at this year’s NeurIPS Bayesian Deep Learning Workshop.  This particular arXival proposes to explore a Bayesian approach to construction of super-resolution images, in particular, to explore uncertainty quantification since “in many scientific domains this is not adequate and estimations of errors and uncertainties are crucial”.  What irked me?  One thing was the statement: “to the extent of our
knowledge, there is no existing work measuring uncertainty in super-resolution tasks”.  That might be true if you consider only a particular class of machine learning algorithms that have addressed the challenge of creating high resolution images from low resolution inputs, but this general problem (PSF deconvolution, drizzling, etc.) has been a core topic in astronomical imaging since the first CCDs, and in this context there are many studies of accuracy and uncertainty.  Likewise, the general problem of how to build confidence in reconstructions of images via statistical models without a ground-truth to validate against is also well explored in astronomy.  The first ever black hole image (‘the Katie Bouman news story’) addressed this challenge through a structured comparison on images separately created by four independent teams using different methods.

Another thing that irks me is that I find the breakdown between types of uncertainty—“Epistemic uncertainty relates to our ignorance of the true data generating process, and aleatoric uncertainty captures the inherent noise in the data.”—to be inadequate.  Here this is really just proposing a separation between the prior and the likelihood, which runs against the useful maxim for applied Bayesian modelling that the prior can only be understood in terms of the likelihood.  That said, I also wouldn’t call this a Bayesian method since the approximation of the posterior implied by dropout is like zero-th order.  Don’t get me wrong: dropout is a great technique for certain applications but I think the arguments to suggest it has a Bayesian flavour are rather unconvincing though attractive to citations.

Posted in Uncategorized | Leave a comment

On unnecessarily general paper titles

Last week a postdoc in my lab received a rejection letter from a high profile stats journal, with the reason for rejection given being that the problem was already solved in existing software such as INLA.  Which was odd because we use INLA all the time at work and the whole reason we decided to embark on the project described in the rejected manuscript was because INLA did not offer a solution for this particular problem.  My suspicion is that the editor or associate editor did a quick google search on the topic and found a paper with an unnecessarily general title.  That is, a paper whose title suggests that a general problem is solved therein, rather than the very restricted problem that is actually examined.  (In this case the problem is combination of area and point data, which is trivially solved in INLA under the Normal likelihood with linear link function, but is not solved in INLA for non-Normal likelihoods with non-linear link functions.)

For this reason I would say that I’m more than a little skeptical about the clickbait motivation for the title given to this recent arXival: “Uncertainty Quantification with Generative Models”.  Which is sufficiently broad as to encompass the entirety of Bayesian inference and most of machine learning!!   And in which you would probably expect to find something more substantial than a proposal to approximate posteriors of VAE style models via a mixture of Gaussians, obtained by local mode finding (optimisation from random starting points) followed by computation of the Hessian at those modes.  But apparently this novel idea is accepted to the Bayesian Deep Learning workshop at NeurIPS this year, so what do I know?!

If I’m going to start beef with the machine learning community then I may as well say something else on the topic.  Recently it came to light that an Australian engineering professor was fired from Swinburne University for having published a huge amount of duplicate work: i.e., submitting essentially the same paper to multiple journals in order to spin each actual project out into multiple near-identical publications.  The alleged motivation for doing so was the pressure to juke ones own research output stats (total pubs and total cites).  Which is funny because I don’t know of many machine learning professors who don’t have the same issue with their publications: multiple versions of the same paper given at NeurIPS, AISTATS, ICLR, etc.,  and then maybe submitted to a stats journal as well!

And in other indignities, the third author on this arXival is on a salary that is over 2.5 times my Oxford salary.

Posted in Uncategorized | Leave a comment

Priors make models like styles make fights …

A ubiquitous saying in boxing analysis is “styles make fights”, which means that to predict what a match up will look like you need to think about how the characteristic styles of the two opponents might work (or not) with respect to each other.  Two strong counter-punchers might find themselves circling awkwardly for twelve rounds, neither willing to come forward and press the action.  While a match up between two aggressive pressure fighters might turn on the question of whether their styles only work while moving forwards.  As an analyst of Bayesian science my equivalent maxim is “priors make models”.  Well-chosen priors can sensibly regularise the predictions of a highly flexible model, achieve powerful shrinkage across a hierarchical structure, or push a model towards better Frequentist coverage behaviour.  For that reason I don’t understand why cosmologists are so keen on ‘uninformative’ priors.  It’s like throwing away the best part of Bayesian modelling.

Anyway, two papers from the arXiv last week caught my eye.  The first proposes a statistic for ‘quantifying tension between correlated datasets with wide uninformative priors’.  So, aside from the focus on a type of prior (wide, uninformative) that I don’t care for, I’m also puzzled by the obsession of cosmologists with searching for ‘tension’ in the posteriors of models with shared parameters fitted to different datasets (or different aspects of the same dataset), as an indicator of either systematic errors or new physics.  As this paper makes clear, there is a huge variety of techniques proposed for this topic, but all of them come from the cosmology literature.  How is it that no other field of applied statistics has got itself twisted up in this same problem?  An example use of this statistic is given in which a model with shared parameters is fitted to four redshift slices from a survey and the decision whether or not to combine the posteriors is to be made according to how much their separately fitted posteriors overlap.

The other paper I read this week concerns the coupling of a variational autoencoder model as generative distribution for galaxy images with a physics based gravitational lensing model.  The proposal for this type of model and the authors’ advocacy for modern auto-diff packages like PyTorch makes a lot of sense.  However, it seems that a lot of work is still to be done to improve the prior on galaxy images and the posterior inference technique, because for the two examples shown suggests that there is a serious under-coverage problem in the moderate and high signal-to-noise regime.  Also, the cost of recovering a small number of HMC samples is very high here (many hours); I don’t think that HMC is a viable posterior approximation for this type of model. Any why bother when the coverage is so bad?  Most likely a better option will be some kind of variational approximation that will be quicker to fit and will improve coverage partly by accident and partly by design through its approximate nature; i.e., deliberately slowing the learning rate.  Sounds crazy to some, perhaps, but remember that the variational autoencoder here is trained via stochastic gradient descent with a predictive accuracy-based stopping rule, which is just another way of slowing the learning rate or artificially regularising a model.

Posted in Uncategorized | Leave a comment

How to push back against bullies in academia

I’m sure most of us know of at least one notorious workplace bully (or other type of shithead) in our field; the kind of professor about whom there’s an open secret of their bad behaviour, but given that universities don’t give a toss about this issue there is no chance of them being formally sanctioned.  An effective solution I’ve recently come across is the following: simply refuse to have anything to do with their work.  If you’re asked to review one of their papers, decline and let the journal know that you don’t feel you can ethically review their work.  If you’re asked to review on of their grant applications, do the same.  If you’re invited to speak in the same session as them at a conference, decline and let the organisers know your reasoning.

On this topic, one thing that amazes me is the fact that most universities don’t do exit interviews with departing staff members.  If you want to test a hypothesis (e.g. there is or isn’t a problem with workplace bullying in a given group or department) you need to gather data.  Simply relying on a passive monitoring system (i.e., self-reporting of incidents by victims) is nothing short of a deliberate strategy to avoid seeing the problem.

Posted in Uncategorized | Leave a comment

Copula models for astronomical distributions

Yesterday I read through this new arXival by an old friend from my ETH Zurich days, in which is presented a package (called LEO-py) for likelihood-based inference in the case of Gaussian copula models and linear regressions with missing data, censoring, or truncation.  I never quite understand the demand for astronomer-specific expositions and software on topics like this, since as soon as one understands what a hierarchical Bayesian model is and how to code one up in a standard statistical programming language like Stan or JAGS, then the world is your oyster.  (Indeed, back when we were at ETH, an errors-in-variables logistic regression model for predicting the barred galaxy fraction as a function of noisy-estimated stellar mass was one of my first forays into this field; of course it never saw the light of day because my supervisor at the time—she who shall not be named—was completely opposed to any statistical methods beyond ordinary linear regression!)  The key contribution here, to my mind, is rather the emphasis on copula models which are certainly under-utilised in the literature.  If this package helps popularise copulae (copulation?) that will be a very good contribution.

Note: Whenever I think of truncated astronomical data analysis problems I’m reminded of the example (described in JS Liu’s Monte Carlo Strategies in Scientific Computing) of a permutation test for doubly truncated (redshift, log-luminosity) data developed by Efron & Petrosian (1999).

Posted in Uncategorized | Leave a comment

Realizing the potential of astrostatistics and astroinformatics

I had a quick read of the new whitepaper from which this post borrows its title.  The main recommendation is for greater funding towards astrostatistics/informatics education which I would generally support.  Except that I feel one of our main barriers to progress in the field is with regards to under-valuing deep scholarship in applied statistics (and applied mathematical/numerical methods generally): if I would imagine what more astrostatistics education looks like it’s presumably going to be giving a wide audience a very basic introduction to Bayesian inference and showing them how to run an MCMC code and a nested sampling code, then send them on their way.  What I’d like to see is more focussed funding on advanced methods, e.g. pay 10s of PhD students per UK-sized country per year an extra stipend to allow them to focus for one calendar year solely on taking an advanced course in (one of) time series modelling, foundations of probability theory, stochastic processes, etc.  More challenging is to effect a cultural shift such that hiring decisions can become more meritocratic and less nepotistic.  Some of the astrostatisticians I’ve seen have leave the field for want of viable employment opportunities did everything that one would expect to achieve a tenured position: develop new astrostatistical methods, publish highly used astrostatistical open-source software, make applications to challenging applied astrostatistical problems, publish a bunch of highly cited papers.   Meanwhile, some of the actual morons you see get tenure, even at top ranked universities … urgh!

I also read a paper on a Bayesian approach to distinguishing signals from noise in gravitation wave detector data.  After spending the time to read it I realised that I’d accidentally allowed through noise as if it was signal: the method presented sidestepped all the interesting challenges of this problem just to show that Bayesian model selection in a simple well-specified scenario works pretty well.  The obvious challenge for BMS in the GW setting is with regards to introducing an effective noise model that is flexible enough to adapt to reflect the glitch distribution while allowing effective population level inference of faint GW signals at an efficient learning rate.  One could think of a mixture of a semi-parametric glitch model and a parametric quasi-coherent glitch model; with the former acting something like a Bayesian bootstrap, thereby avoiding to throw the baby out with the bathwater.

Edit: I should also add that I quickly skimmed another paper by the same group on the efficiency of parallel nested sampling which mentioned earlier work of theirs on reweighting chains under efficient reference likelihoods.  The latter is an approach I’ve been advocating for some time; and the original idea goes back to even before Hastings (1970).  But at this stage it doesn’t look like many people have realised one can also use a reference distribution for nested sampling other than the prior: e.g. one can run a reference distribution approach and when it’s effective sample size is small, improve on it by nested sampling.  E.g. start from a “prior” that is a mixture of the real prior and a parametric approximation to the reference likelihood posterior.

Posted in Uncategorized | 1 Comment