Banf & only Banf …

Thanks to Shane Walsh for the terrible joke (on the theme of iff = if and only if) in the title.

This week I’m in Banff for the “Validating and Expanding Approximate Bayesian Computation Methods” workshop at the BIRS Conference centre.  If you ever have the opportunity to attend a conference here I can thoroughly recommend it: amazing scenery and wildlife (I’ve seen elk wandering the campus; another participant has seen a grey wolf!) and great facilities (all meals and accommodation provided free on site; free pool, gym, climbing wall, etc).  The discussion in this particular workshop has been very interesting so far: a key theme being whether or not posterior regression adjustment is a good idea (for sure according to asymptotic theory when the model is well specified; see Li & Fearnhead) or a bad idea (for sure according to asymptotic theory when the model mis-specified; talk by David Frazier).

All the talks will be available online here; mine (reviewing some motivations for ABC in astronomy and epidemiology) is already online here.  (On my Mac I have to ‘download’ the video to make it play in my browser; there’s a break in the middle of my talk when we left to take a group photo, feel free to fast forward or else listen to see if anyone is caught making candid statements when they thought the camera was switched off!).

Adjusting the p-value when performing a likelihood ratio test on a random effect …

While reading Justice Aheto’s paper (mentioned in my last post) on modelling childhood malnutrition (in Africa, as opposed to what goes on at a Versace fashion show … bazingha!) I fell into a rabbit hole of reading on the topic of maximum likelihood asymptotics for constrained parameter spaces.  In Justice’s paper the model is an ordinary linear regression but (possibly) with a source of additional noise operating on distinct groups of the observations, i.e.:
$Y_{ij} \sim x_{ij}^\prime \beta + \mathbf{H_j} +\epsilon_{ij}$
where $x_{ij}^\prime \beta$ is the product of design matrix and slope coefficients, $\epsilon_{ij} \sim N(0,\sigma_\epsilon^2)$ is the usual noise term operating on each individual datapoint, and $H_j \sim N(0,\sigma_H^2)$ is the group-level noise.  In this epidemiological example the individual datapoints represent real individuals (children in the study) while the groupings represent households: the logic being that some of the factors driving malnutrition not captured by the design matrix may well act on all members of a household (e.g. all the children in the household suffer in the same way from its proximity to a swamp full of mosquitoes).  In astronomy this type of problem can (and does, thanks to an example from Marina Trevisan) occur when galaxies in a study come grouped into clusters: some of the cluster properties might be available as group-level covariates (like halo mass), but they won’t capture all the information about the cluster’s assembly history that might affect the properties of the galaxies it contains.

The line in Justice’s paper that caught my eye was in regard to performing a likelihood ratio test (LRT) to decide in a frequentist sense whether this group-level noise term should be preferred ($\sigma^2_H > 0$) over the null model ($\sigma^2_H=0$).  The quote: “To assess the significance of the household-level random effect, we divided the p-values from the LRT by two because the null hypothesis that $\sigma^2_H=0$ is on the boundary of the parameter space [citation to Self & Liang 1987]”.  That is, since the the null model lies on the smallest possible value of $\sigma^2_H$ allowed (negative variances aren’t allowed!) the decision threshold of the test statistic needs some adjusting.  At face value this made sense to me but since I rarely (if ever) find myself using the likelihood ratio test I was certainly intrigued.

Following up the references one finds that Self & Liang (1987) is a short but densely-packed theoretical paper offering proofs of the consistency and convergence in distribution of the MLE for these bounded domain cases; mostly by setting up regularity conditions on the MLE estimator (boring) and the topology of the boundary in the neighbourhood of the null parameter value (interesting) in order to draw on earlier work by Chernoff (1954).  All of which gives theoretical rigor to the intuition that the limiting distribution for the MLE will be (perhaps Multivariate-) Normal (as usual) but truncated/projected somehow on the domain of interest.  The topological conditions introduced (that the boundary in the neighbourhood of the null can be approximated by a cone) ensure that the nature of this truncation/projection can be made clear.

The first reason I fell down a rabbit hole is that the definition of an approximating cone they give wasn’t clear to me: in particular, because they use the little-o notation which I assumed referred to a limit at infinity but in fact must be referring to a limit at zero.  Maybe that’s obvious to their primary audience but it confused me and sent me on a long googling of stack exchange answers! The second reason I fell down a rabbit hole was that during my stack exchange reading I ended up on Geyer (1994) (his M-estimator paper not the reverse logistic regression one) which points out that the cone condition given by Self & Liang is in fact not sufficient for their proof and that they should have insisted on something called Clarke regularity.  Trying to understand the differences between these two and the pathological examples where one or other breaks down (Geyer’s Section 6) was the second rabbit hole.  Reader beware.

All that to say, there is a simple way to compute critical boundaries and p-value for these  cases: Justice Aheto describes the most common one for a single parameter boundary (divide your p-value by two: it turns out that the distribution under the null is a 50:50 mixture of a point mass on zero and a chi-sq with d.o.f of 1) but for more details there’s the excellent applied statistics description of the problem by Stram & Lee (1994).

Edit/Note: There’s also a ‘typo’ in Geyer’s explanations for the ordinary and derivable tangent cones: the convergence statement is repeated verbatim for each one, so obviously there’s something wrong: the one for the derivable cone should include that the sequence of $x_n$ admits a suitable gradient definition on a closed ball around the critical point.

Astro ph round up …

Some brief notes on papers I digested from the recent crop of astro ph submissions:

(1) Cibrika et al.’s CODEX Weak Lensing: Concentration of Galaxy Clusters at z∼0.5: A study of the halo mass-concentration relation with a stacked weak lensing analysis. As input to their likelihood function the authors estimate the precision matrix for a multivariate error term across radial bins designed to account for various noise sources in the stacking such as intrinsic profile differences at fixed mass.  In particular, a bootstrap procedure is used to produce an estimate of the covariance matrix to which is applied the de-biasing correction noted by Hartlap et al. (2007), because this estimator for the covariance matrix is seen to be noisy and known to be biased (though consistent).  My concern would be along much the same lines as Selentin & Heavens (2015) in the sense that if the covariance matrix estimator is noisy, then so too is the precision matrix estimator and its uncertainty should be carried through into the Bayesian analysis stage.  I would also note that the de-biasing formula used is only for the case of independent observations, an assumption violated by the bootstrapping approach (as pointed out in Hartlap et al.’s example).

(2) White et al.’s Prospects for Distinguishing Dark Matter Models Using Annual Modulation: A prospective analysis of the potential for direct detection experiments to distinguish particular dark matter models.  This is a good example of sanity checking Bayesian model selection approaches through calibration of frequentist style Type I/II error rates with mock datasets a la Jenkins & Peacock (2011).

(3) El Badry et al.’s The statistical challenge of constraining the low-mass IMF in Local Group dwarf galaxies: Similar to the above but looking at parameter estimation and model selection via the BIC for distinguishing log-Normal IMFs from power-law IMFs; cautions against drawing conclusions where the data just aren’t sufficient to support a strong conclusion either way: more-or-less where the presence or absence of a turn over just isn’t identifiable at the depth of the available imaging.

Some non-astro statistics:

– A recent episode of BBC Radio 4’s More-or-Less podcast featured work by Peter Diggle’s group at Lancaster in much the same vein as what I do at the Malaria Atlas Project, but in this case for childhood malnutrition.  Hopefully I will hear more about this paper by Justice Aheto at the (yet to be advertised) Lancaster Workshop on Disease Mapping in September. (Not to be confused with the Spatial Statistics 2017: One World, One Health conference also held at Lancaster this year, but in July!).

Some might say it’s bad form to blog about one’s failed job applications, you know, keep a stiff upper lip and all that.  On the other hand, to do so is deeply cathartic and may give others a smile when thinking back on their own rejections.  So here goes.

As I note on the ‘About Me’ page of this blog, I’m very happy in my work post-astronomy but I do hold some interest in the possibility of returning should the right opportunity to make a difference in the field of astrostatistics present itself.  By ‘right’ I mean two things: that the department has a genuine enthusiasm for astrostatistics (whether in terms of research, teaching, or both), and that the salary is sufficient to provide a reasonable standard of living.  With this in mind I decided to send in an application for an advertised position as lecturer in astrostatistics at the University of Cambridge.

To be fair, the salary attached to this faculty position (£38,896-£49,230) was woeful which might signal to a more perceptive potential applicant than myself that the sought candidate would be a junior researcher.  That said, one might also find it hard to imagine a junior researcher coping well with the responsibility of lecturing students at a high level in a complex discipline encompassing elements of both astronomy and statistics.  Anyway, since I’m happy with my current work at the University of Oxford in the MAP team—and in particular since I’m rather curious about the opportunities at our new ‘Big Data’ institute which we’ll move into in February of this year—I felt I could afford to state up front in my application letter that I couldn’t accept an offer below £55,000.  If that’s what sunk my application then I have no regrets about asking, but I would then be sad for potential students that the university would have such a miserly attitude towards the staff they employ to provide tuition.

Regarding my other (not-entirely-unrelated) condition—that the department has a genuine enthusiasm for the teaching of astrostatistics—I do also wonder.  On the one hand I’ve seen their past astrostatistics course plan and it’s pretty damn softball from a statistical point of view.  That is, there’s no mention in the syllabus of (and therefore I assume no time given in the course itself to) any substantive foundations for the probability theory upon which modern statistics is built: no measure theory, no Kolmogorov’s axioms, no probability triplets and random variables, no concepts of distributional convergence, no characteristic functions, etc. etc.—and then nothing for modern computational statistics beyond MCMC samplers: no importance sampling, no particle filters / SMC / population Monte Carlo, no samplers for complex stochastic processes (GPs, Dirichlet process, Markov Jump process, etc.) etc.  Instead it’s just the quick set of basics for getting familiar with a couple of statistical software packages; the same kind of course I see in fields that astronomers might otherwise view as being for the less mathematically gifted: zoology, social science, psychology.  While they do have some dudes at Cambridge who do strong work in astrostatistics, like Mike Hobson and Lindley Lentati, I might guess that they are not the ones steering the ship here.

Anyway, whatever, their loss …

Yeah, that felt good.

ps. I’m only assuming that I’m rejected because the job ad said they’d let successful candidates know by late December; if in fact I’m not rejected that would be rather amusing.
If anyone’s still reading here’s a little piece from my application regarding my thoughts on the importance of a strong astrostatistics program.

The development and application of complex statistical methodology has a long history in  astronomical research from Le Verrier’s predictions via perturbing functions for the position of Neptune to the theory & simulation of random fields in cosmology (e.g. Bardeen et al. 1986; Kamionkowski et al. 1997)—and more recently, the probabilistic detection of transiting exoplanets amongst noisy stellar light curves (e.g. Ford et al. 2007), and the search for gravitational waves via pulsar timing arrays (Cornish & Sampson 2016). Astronomical problems have also served as inspiration for a host of important statistical investigations; see, for instance, the role of Postman, Huchra & Geller’s (1986) ‘galaxy dataset’ in studies of Normal mixture models (Roeder 1990; Richardson & Green 1997), or the role of Efron & Petrosian’s (1998) ‘quasar dataset’ in the study of restricted permutations (Diaconis, Graham & Holmes 2001; Liu 2001). Despite this historical potential, the cross-over of statistical methodology between astronomy and statistics  (and vice versa) has been disappointingly limited to-date.

Innovations from astronomy, such as the method for conditional simulation of Gaussian processes identified by Hoffman & Ribak (1991), are left to be rediscovered independently by statisticians (see Doucet 2010). While algorithms first applied to, and largely developed on, astronomical problems, such as nested sampling (Skilling 2005, Mukherjee et al. 2006; subsequently MULTINEST and POLYCHORD; Feroz & Hobson 2008, Handley et al 2015), are often ignored by the statistics community (modulo Chopin & Robert 2010 in the case of nested sampling). On the other hand, many modern statistical techniques—for example, particle filtering & sequential Monte Carlo schemes (Del Moral et al. 2006), semi-parametric Bayesian algorithms based on the Dirichlet process (Escobar & West 1995; Roy & Teh 2009), and SPDE-based approximations for mapping continuous GPs to discrete GMRFs with sparse precision matrices (Lindgren et al. 2010, Rue et al. 2008)—are yet to be adopted by the astronomical community (with few exceptions: e.g. basic SMC samplers for marginal likelihood estimation along the thermodynamic path; Kilbinger et al. 2010).

Behind this paucity of knowledge exchange between the two communities is a lack of training: few astronomers receive any statistical tuition during either their undergraduate or post-graduate degrees. Instead they will typically learn on the job while re-coding the old algorithms from a supervisor’s paper, or while flipping through NUMERICAL RECIPES upon facing a problem that seems not to be solvable by a ‘chi-squared fit’. More recently, they may receive some further instruction from a self-styled ‘astro-statician’, a well-meaning post-doc who has managed to digest a portion of Gelman et al.’s BAYESIAN DATA ANALYSIS or Jaynes’ PROBABILITY THEORY: THE LOGIC OF SCIENCE but has very likely not found the motivation to delve deeper or to take an active interest in the statistical literature. Indeed, usually the latter two will go hand-in-hand: how is one to understand, say, Rao & Teh’s (2013) algorithm for posterior simulation from Markov Jump processes (perhaps with which to model state switching in variable X-ray sources) without the measure theory to tackle their description of the Borel sigma-algebra over paths? Or to phrase the description of the astronomers’ ‘path-integral evidence’ approach (Kitching & Taylor 2015) in a manner readily parsed by statisticians?

My vision is for the establishment of an ‘astro-statistics’ program (used here in the best sense to mean a genuine fusion between the astronomy and statistics departments) of teaching and research, in which astronomical problems motivate new statistical inquiries and new statistical methodologies reveal new possibilities for astronomical study.

Reconstructing an X-ray luminosity function from noisy data …

Last week I visited Basel (again!) to give an overview talk, entitled ‘a quantitative history of malaria in maps‘, at the STPH Winter Symposium.  The subsequent speakers were all very interesting and I learnt a lot about new vector control methods (e.g. eave traps) and the search for new anti-malarial compounds by brute force testing of 20,000 organic compounds from the Novartis ‘library’.  Back in Oxford now with a spare five minutes to mention a paper I read a couple of weeks ago …

In this arXival by Aird et al. the authors use a Bayesian model to reconstruct the X-ray luminosity functions of various populations of star formation galaxies.  The challenge is that the observed X-ray emission from the pointing to each galaxy is a mix of emission from a background of uncertain intensity and perhaps some photons from the target itself.  The galaxy X-ray luminosity function the authors aim to estimate thus appears as a latent function intermediate in their hierarchical model below the top level of the observed likelihood by which it is in places only weakly constrained.  The model adopted for this latent function is similar to that used by Warren et al. in their solar corona study (which I’ve blogged about here recently): namely, a mixture of regular functions (here gamma densities) spaced evenly on a grid.  (Similar also to the NPMLE mixture model set up of Feng & Dicker 2016.)  The prior on their weights is taken to be something like a moving average model with sum to one constraint.

Importantly, the authors recognise that the success of this type of Bayesian inversion problem is ultimately heavily dependent on the suitability of the chosen prior and model structure for the situation at hand, which they therefore test extensively with mock datasets drawn from a range of plausible luminosity functions outside their prior class of functions.  While this raises some theoretical questions as to how one can have a prior that doesn’t encompass all of one’s prior belief about the possible set of luminosity functions one might find in the wild, in practice this is a very common situation in that one often has to choose a model that can be readily mathematically described and fit to the data and simply remember going forwards that it is indeed only a model.  Hence, I’m very much in favour of such model testing procedures.

More hyper parameters for softening likelihoods in cosmological inference …

I noticed this paper, “Determining $H_0$ with Bayesian hyper-parameters”, on astro ph today by Cardona et al. The problem is one that arises often in astronomy whereby the observed data are assumed to have a Normal likelihood function in which the given standard deviation for each datapoint may in fact be ‘not quite right’. Sometimes (and indeed in today’s paper) this is called a ‘systematic error’, although I would disagree with that terminology because for me a systematic error is one that can’t be overcome simply by observing more of the same type of data. The motivating examples given are where the quoted standard deviations are correct in a relative sense but their absolute scale is incorrect (as per Lahav et al. 2000), or where most of the quoted standard deviations given are correct but there is a small population of extreme outliers not drawn from the supposed Normal (my suggested citation for this would be Taylor et al. 2015). For the former scenario the previously given solution was to introduce a Jeffreys prior on the unknown scale and to marginalise this out explicitly, while for the latter the previously given solution was to introduce a mixture model with latent variables representing labels for ‘good’ and ‘bad’ datapoints.

Interestingly though, in this case Cardona et al. have decided to use neither of the previous approaches and to instead suppose each quoted standard deviation is to be divided by the square root of its own (latent) independent uniform random variable, i.e., $\sigma_i^\mathrm{new} \leftarrow \sigma_i^\mathrm{old}/\sqrt(\alpha_i)$ for $\alpha_i \sim U(0,1)$. To me this seems like a strange choice because now the softening parameters, $\alpha_i$, on each likelihood contribution can only influence each other weakly through the product of likelihoods and not more directly through a hyperprior structure. In the Lahav et al. version they directly influence each other through the shared absolute scale parameter, while in the Taylor et al. version their influence each other not quite as directly through learning the proportion of data that is ‘bad’.  In principal one can get close to the Lahav et al. version in the Cardona et al. framework by introducing a beta distribution, $\alpha_i \sim \mathrm{Beta}(\beta_1,\beta_2)$ with $\beta_1,\beta_2$ shared, and the Taylor et al. version could be approximated by a mixture of two beta distributions.  And indeed one can imagine even more sophisticated schemes based on the Dirichlet Process or Dirichlet Process Mixtures as we explored in Cameron & Pettitt (2013).  Certainly this type of hyper parameter problem in astronomy requires a prior-sensitivity analysis to properly understand how the assumed error model influences the conclusions.

Xtreme deconvolution …

A sociological observation segueing into a note on Drton & Plummer.

I noticed on today’s astro ph a paper on ‘extreme deconvolution’, which is an astro-statistics term for fitting a Normal mixture model to noisy data; I’m not sure if the technique is extreme per se of if it needs to be applied to a large dataset (as in a pioneering example) to properly garner the complete appellation.  In this instance the application envisaged (and with a small example) is for modelling the multi-variate distribution of supernovae in the SALT2 dataset which I believe to be on the order of 800 objects.

So, the sociological observation I have is that the new generation of astronomers—let’s call them Gen Y, while defining my 1981 birth year as the (left closed, right open) boundary of Gen X—seem to view their wrapper scripts for running functions as a worthwhile contribution to the published literature.   That is, the paper at hand doesn’t present a new algorithm or methodology for performing extreme deconvolution, or add any novel contribution to thinking about extreme deconvolution methodologies in astronomy, rather it simply describes the authors’ wrapper script for running either of two existing extreme deconvolution models and computing conditional densities from their output.  The latter itself is simply the application of the well known rules for  manipulating Multivariate normals which are not in any sense difficult to implement—and certainly not to the extent that I could imagine anyone seeking a third-party application rather than just looking up wikipedia for themselves.  But I guess if editors are happy to publish it then c’est la vie.

Another thing that surprised me about this paper was that there are two methods for selecting the number of mixture components presented—the BIC and a cross-validation-based mean log-likelihood method—but the motivations of each are barely discussed and nothing is mentioned of the (order 1) nature of these approximations (especially so when we’re in the extreme deconvolution case and the observational uncertainties mean that n is not quite n as envisaged by Schwarz). Moreover, it’s observed that on the toy example the BIC points towards 5 components and the cross-validation method points towards something greater than 10, but you should probably use the BIC because it’s too expensive to add lots of components.  So, uh, yeah.

Having got that off my chest I thought it worth pointing out the existence of the recently ‘read’ (at the RSS) Drton & Plummer paper looking at the problem of model selection for the case of nested singular models, of which Gaussian mixtures (sans observational noise) is one of the given examples.  In particular, the authors offer a strategy for applying Watanabe’s method to this problem that side-steps the self-defeating requirement of knowing in advance which is the true model, but does not side-step the requirement to be able to identify the ‘learning rate’ of each model.