## Bayesian inference from a fixed set of simulations …

I noticed this recent paper on the arXiv by Patel et al. entitled, “Orbits of massive satellite galaxies – II. Bayesian Estimates of the Milky Way and Andromeda masses using high precision astrometry and cosmological simulations”, in which the authors consider an inference problem on the periphery of current ABC & pseudo-marginal methods.  In this case the authors are interested in estimating the virial mass of the Milky Way given observational estimates of some of the orbital elements (proper motion, radial velocity, etc.) with respect to Andromeda.  These data enter their model as (independent) Normal likelihoods for the observations given the latent (‘true’) values of those observables; the latter being imagined drawn from a prior which is represented by the drawing of Milky Way-like galaxy groups from a cosmological N-body simulation.  The catch here is that the simulation is expensive and is not able to be run directly by the authors, so all inferences must be made from a pre-compiled catalog containing a finite number of mock galaxy groups drawn from this prior.

This looks at face values like an ABC or pseudo-marginal problem in the sense that there is a reliance on mock datasets, but in fact there are (ignoring cosmological parameters, which are held fixed) no free parameters in the simulations: all the simulations are there for is to represent the prior on the latent variables of Milky Way-like galaxy groups: virial masses and orbital elements.  For the present paper the authors approximate the posterior for the Milky Way virial mass using an importance sample re-weighting scheme formed by weighing each mock galaxy by the likelihood of observing a subset of its orbital elements.   In one set of experiments these are the the satellite’s maximum circular velocity, its separation from the host, and its total velocity today relative to the host galaxy; and in another set of experiments these are the satellite’s maximum circular velocity and the magnitude of its orbital angular momentum.

In this scheme the posterior is behaving much like a kernel-density estimate in which the kernel acts on distance between the observed and latent orbital elements, using these to predict the virial mass.  As one adds (informative) observations the number of mock galaxy groups contributing non-trivially to the estimate decreases such that although the posterior estimate will typically contract it be also be noisier.  Indeed the authors recognise this effect and consider some of their decisions with respect to the effective sample size, although primarily their investigation here is in terms of sensitivity to the definition of a ‘Milky Way-like’ group (which is a hard threshold on the mock galaxy group catalog I’ve glossed over in the above).

Imaging the extension of this inference procedure to further applications, it would be helpful to have a general framework for deciding which orbital elements and other observational data to use and which to leave out.  Such a framework would need to balance the information gain of each possible orbital element against the reduction in effective sample size over the available simulations; and would presumably be based on cross-validation estimates of bias and coverage for mock galaxy groups in the simulation sample.  This is of course a very similar problem to summary statistic selection in ABC analyses and I’m particularly reminded of the paper by Nunes & Balding (2010) in which those authors run a preliminary ABC analysis with unrefined summary statistics to identify the important region of parameter space over which the summary statistics should be optimised.

Another way to attack this problem would be via a flexible parametric model for the prior as in the RNADE-based scheme of Papamakarios & Murray (2016).  Indeed this looks very similar to the example problem described (no slides available as far as I can see) by Iain Murray at the MaxEnt 2013 conference in Canberra.

## Vale Joe Hilbe (1944-2017) …

I had the sad news this week of the passing of Joseph Hilbe, an important statistician and educator in the field of regression models for count data and an enthusiastic advocate of astro-statistics as an emerging discipline.  Emilie (my wife) & I met Joe three years ago at the IAUS 306 (Statistical Challenges in 21st Century Cosmology) in Lisbon where he helped to establish the COIN (Cosmostatistic Initiative) project which in turn led to a series of three papers on the Overlooked Potential of Generalised Linear Models for Astronomical Data Analysis (1 de Souza et al., 2 Elliott et al., 3 de Souza et al.).  It was actually a lucky coincidence that Joe’s interests in GLMs aligned with the concerns a number of us had been having about how often we’d seen count data mangled into bins for astronomical regression studies, but I suspect he would have found a way to steer us on to his favourite topic in any case!  On a personal level I found Joe to be a good conversationalist, although I only learnt of his track and field exploits through wikipedia (so it seems he was a modest guy too).  Even in his 70s Joe was in high demand from various groups and companies looking for statistical analysis help, and one regret I have is that I never had the time to take on some of the jobs he was hoping to offload on us younger statisticians (I suspect they would have be quite lucrative relative to my academic salary).  All that to say, Joe will be sorely missed and for those who haven’t read his books yet a good place to start is his new one coming out in June this year: Bayesian Models for Astrophysical Data.

## Intrigue in the neutrino kingdom …

I noticed this brief note on the arXiv today by Schwetz et al. highlighting some of those authors’ concerns with a recent arXival by Simpson et al. regarding Bayesian evidence for the normal neutrino hierarchy.  The question examined by Simpson et al.—and in a recent paper by Hannestad & Schwetz—is to what extent the available data from neutrino oscillation experiments and cosmological modelling should lead us to favour the normal hierarchy (NH) over the inverted hierarchy (IH) of neutrino masses: the former being $m_1 \leq m_2 \leq m_3$ and the later $m_3 \leq m_1 \leq m_2$.  The available neutrino oscillation experiments place constraints on the squared differences between two of the three possible mass pairings; these constraints being slightly different for the NH and IH models.  The cosmological data, on the other hand, place an upper bound on the sum of masses ($m_1+m_2+m_3 \leq 0.13$ eV) which is close to a lower limit on the sum of masses under the IH scenario of $0.0982$ eV.  Hence, it is felt that future cosmological experiments may well rule out the IH scenario; notwithstanding the authors’ present interest in precisely how likely should we consider the IH scenario for the time being.

The approach taken by Simpson et al. is to (1) set equal a priori odds on the NH and IH scenarios, (2) specify a generative model for the neutrino masses under each scenario, and then (3) form posterior Bayes odds using the marginal likelihoods with respect to the observed data from both the neutrino oscillation and cosmology experiments together.  Their models are given hierarchical priors such that for the NH scenario:

$\{\log m_1,\log m_2,\log m_3\} = {X_{(1)},X_{(2)},X_{(3)}}|{X_1,X_2,X_3}$ (order statistics)
for $i=1,\dots,3,\ X_i \sim \mathrm{Normal}(\mu,\sigma)$
$\log \mu \sim U(\log 0.001,\log 0.5)$ (eV)
$\log \sigma \sim U(-4,3)$ (reading numerical values off their figure labels).

Their prior for the IH scenario is the same except for the assignment of order statistics in the first line.  Because the sample space is small and likelihood evaluations are cheap they are able to compute posterior Bayes factors and the marginal posteriors in the hyper-parameters $\{\mu,\sigma\}$ via brute force (the arithmetic mean estimator in the mass space over a grid of hyperparameters).

All this seems pretty straightforwards, and indeed when I skimmed this paper (the Simpson et al. one) when it first appeared on the arXiv I didn’t think anything particular other than, “seems sensible, not relevant for blogging because the marginal likelihood was so easy to compute”.  If I understand Schwetz et al.‘s note correctly, their complaint translates (in my interpretation) to be that Simpson et al. set their prior odds on the models before considering the neutrino oscillation data.  By contrast, in their earlier (but still quite recent paper) Hannestad & Schwetz decided to begin from the neutrino oscillation experiment likelihoods, letting the posteriors from non-informative priors updated by those constraints be their priors for examining the cosmological data with equal prior odds of each scenario assigned after the neutrino oscillation experiment.  By the nature of their analysis, Simpson et al. must go further in specifying a prior on their neutrino masses prior to observing any data.  Schwetz et al.‘s objection is that this introduces an artificial weighting of the evidence from the neutrino experiment.

For sure the Simpson et al. approach will be sensitive to the priors adopted (like all Bayesian model selection problems) but it doesn’t seem to me to be a silly thing to do in the context of forming betting odds on the hypotheses from the totality of available data.  Further prior sensitivity analyses might be warranted and the nuances of the analysis could be more clearly explained but I personally would conclude that the Simpson et al. analysis is a worthwhile contribution to the debate.

## Pseudo-marginal MCMC for cosmological statistics …

Having been (re-)inspired in pseudo-marginal methods by Patrick Muchmore’s talk at the recent ABC in Banff conference I’ve started evangelising to cosmologists who might be interested in fits based on multivariate Normal (Gaussian) distributions with estimated covariance matrices.  For example, in the case of estimating cosmological parameters from the (empirical) two point correlation function of a galaxy survey.  Here, although the power spectrum of the cosmological field is given by the model parameters, the covariance matrix of the observable two point correlation function (in a given set of radial bins) cannot be directly computed due to (e.g.) the irregular survey window, variations of imaging depth across the field, missing data due to bright star cut-outs, redshift fibre collisions, and so forth.

In such circumstances a sensible solution is to use mock datasets to side step the intractable calculation of exact likelihoods.  Sometimes this will be done with full cosmological N-body simulations [Option A] which will give the most faithful representation of the random field but is extremely expensive.  Typically one will only have the means to produce $\lesssim 100$ independent N-body realisations; these can be subsampled and bootstrapped to generate estimates of the observational covariance matrix.  Other times the mock datasets will be constructed by directly bootstrapping the original dataset [Option B], though of course this approach cannot capture variability (cosmic variance) on scales beyond the survey footprint.  Yet another approach employing mock datasets is that of approximating the cosmological field as a log-Normal process [Option C] which brings fast Fourier methods for field simulation into play.  For the following discussion I am interested in Option C as perhaps the only computationally feasible method of these three with which to address the dependence of the observational covariance matrix on the true (unknown) cosmological parameters.

It is well known (thanks to Hartlap et al.) that inversion of the sample covariance matrix does not provide an unbiased estimator for the precision matrix, though a small correction factor allows for simple construction of an unbiased estimate of the precision matrix.  This in turn provides an unbiased estimate of the log-likelihood (of a multivariate Normal), but not of the likelihood itself.  Unfortunately, it can be an impossible job to convert an unbiased estimator taking values on the entire real line into an unbiased estimator for a non-negative statistic.  That said, there is a different set of estimators—mostly developed in the 1950s, 60s, and 70s; sometimes for the purpose of studying bombing accuracy—that target directly the multivariate Normal likelihood.  These have been identified by Price et al. and Muchmore & Marjoram as key ingredients for constructing pseudo-marginal MCMC algorithms on multivariate Normal targets.

The achievement of the pseudo-marginal MCMC algorithm is that in the unlucky case that one cannot exactly compute the likelihood function at a given set of parameters one can still construct an MCMC chain targeting the same Bayesian posterior as in the exact case as long as one has access to an unbiased estimate of the likelihood.  Just plug the unbiased likelihood estimate for each new proposal into the Metropolis-Hastings accept/reject step (and, importantly, save and re-use the likelihood estimate attached to each accepted state).  While the pseudo-marginal MCMC chain will always be noisier and stickier (and hence, demand running of a longer chain) than the equivalent MCMC algorithm using exact (non-noisy) likelihoods it will—quite amazingly!—nevertheless represent the same posterior.  This is not guaranteed to be the case for, e.g., a noisy MCMC algorithm built by plugging in the likelihood estimate corresponding to the Hartlap et al estimate for the unbiased precision matrix.  Likewise, for plugging in the marginalised t-distribution likelihood given by Selentin & Heavens or the likelihood corresponding to the shrinkage estimator of Joachimi.  (These latter two should rather be considered as methods relating to Options A and B mentioned earlier.)

The reason the cost of producing mock datasets is an important consideration for this approach is that at each proposal of the MCMC chain a new unbiased estimate has to be computed, and for a multivariate Normal likelihood this will usually mean generating more than one mock dataset.  For instance, the estimator used by Price et al requires at least $d+3$ datasets where $d$ is the dimension of the observed summary statistic (typically ~20-50 for an empirical two point correlation function).  The Muchmore & Marjoram estimator can deal with just one mock dataset but will be very noisy in that regime.  Tuning of the pseudo-marginal algorithm should focus on adjusting this number of mock datasets used, along with the proposal distribution, to target a sensible acceptance rate (see Andreiu & Vihola and Sherlock et al.).  Efficient use of parallel computing architectures would be worthwhile when using this algorithm in anger; even brute force approaches such as batching the $n$ mock datasets to be produced over the available cores at each proposal.

## Banf & only Banf …

Thanks to Shane Walsh for the terrible joke (on the theme of iff = if and only if) in the title.

This week I’m in Banff for the “Validating and Expanding Approximate Bayesian Computation Methods” workshop at the BIRS Conference centre.  If you ever have the opportunity to attend a conference here I can thoroughly recommend it: amazing scenery and wildlife (I’ve seen elk wandering the campus; another participant has seen a grey wolf!) and great facilities (all meals and accommodation provided free on site; free pool, gym, climbing wall, etc).  The discussion in this particular workshop has been very interesting so far: a key theme being whether or not posterior regression adjustment is a good idea (for sure according to asymptotic theory when the model is well specified; see Li & Fearnhead) or a bad idea (for sure according to asymptotic theory when the model mis-specified; talk by David Frazier).

All the talks will be available online here; mine (reviewing some motivations for ABC in astronomy and epidemiology) is already online here.  (On my Mac I have to ‘download’ the video to make it play in my browser; there’s a break in the middle of my talk when we left to take a group photo, feel free to fast forward or else listen to see if anyone is caught making candid statements when they thought the camera was switched off!).

## Adjusting the p-value when performing a likelihood ratio test on a random effect …

While reading Justice Aheto’s paper (mentioned in my last post) on modelling childhood malnutrition (in Africa, as opposed to what goes on at a Versace fashion show … bazingha!) I fell into a rabbit hole of reading on the topic of maximum likelihood asymptotics for constrained parameter spaces.  In Justice’s paper the model is an ordinary linear regression but (possibly) with a source of additional noise operating on distinct groups of the observations, i.e.:
$Y_{ij} \sim x_{ij}^\prime \beta + \mathbf{H_j} +\epsilon_{ij}$
where $x_{ij}^\prime \beta$ is the product of design matrix and slope coefficients, $\epsilon_{ij} \sim N(0,\sigma_\epsilon^2)$ is the usual noise term operating on each individual datapoint, and $H_j \sim N(0,\sigma_H^2)$ is the group-level noise.  In this epidemiological example the individual datapoints represent real individuals (children in the study) while the groupings represent households: the logic being that some of the factors driving malnutrition not captured by the design matrix may well act on all members of a household (e.g. all the children in the household suffer in the same way from its proximity to a swamp full of mosquitoes).  In astronomy this type of problem can (and does, thanks to an example from Marina Trevisan) occur when galaxies in a study come grouped into clusters: some of the cluster properties might be available as group-level covariates (like halo mass), but they won’t capture all the information about the cluster’s assembly history that might affect the properties of the galaxies it contains.

The line in Justice’s paper that caught my eye was in regard to performing a likelihood ratio test (LRT) to decide in a frequentist sense whether this group-level noise term should be preferred ($\sigma^2_H > 0$) over the null model ($\sigma^2_H=0$).  The quote: “To assess the significance of the household-level random effect, we divided the p-values from the LRT by two because the null hypothesis that $\sigma^2_H=0$ is on the boundary of the parameter space [citation to Self & Liang 1987]”.  That is, since the the null model lies on the smallest possible value of $\sigma^2_H$ allowed (negative variances aren’t allowed!) the decision threshold of the test statistic needs some adjusting.  At face value this made sense to me but since I rarely (if ever) find myself using the likelihood ratio test I was certainly intrigued.

Following up the references one finds that Self & Liang (1987) is a short but densely-packed theoretical paper offering proofs of the consistency and convergence in distribution of the MLE for these bounded domain cases; mostly by setting up regularity conditions on the MLE estimator (boring) and the topology of the boundary in the neighbourhood of the null parameter value (interesting) in order to draw on earlier work by Chernoff (1954).  All of which gives theoretical rigor to the intuition that the limiting distribution for the MLE will be (perhaps Multivariate-) Normal (as usual) but truncated/projected somehow on the domain of interest.  The topological conditions introduced (that the boundary in the neighbourhood of the null can be approximated by a cone) ensure that the nature of this truncation/projection can be made clear.

The first reason I fell down a rabbit hole is that the definition of an approximating cone they give wasn’t clear to me: in particular, because they use the little-o notation which I assumed referred to a limit at infinity but in fact must be referring to a limit at zero.  Maybe that’s obvious to their primary audience but it confused me and sent me on a long googling of stack exchange answers! The second reason I fell down a rabbit hole was that during my stack exchange reading I ended up on Geyer (1994) (his M-estimator paper not the reverse logistic regression one) which points out that the cone condition given by Self & Liang is in fact not sufficient for their proof and that they should have insisted on something called Clarke regularity.  Trying to understand the differences between these two and the pathological examples where one or other breaks down (Geyer’s Section 6) was the second rabbit hole.  Reader beware.

All that to say, there is a simple way to compute critical boundaries and p-value for these  cases: Justice Aheto describes the most common one for a single parameter boundary (divide your p-value by two: it turns out that the distribution under the null is a 50:50 mixture of a point mass on zero and a chi-sq with d.o.f of 1) but for more details there’s the excellent applied statistics description of the problem by Stram & Lee (1994).

Edit/Note: There’s also a ‘typo’ in Geyer’s explanations for the ordinary and derivable tangent cones: the convergence statement is repeated verbatim for each one, so obviously there’s something wrong: the one for the derivable cone should include that the sequence of $x_n$ admits a suitable gradient definition on a closed ball around the critical point.

## Astro ph round up …

Some brief notes on papers I digested from the recent crop of astro ph submissions:

(1) Cibrika et al.’s CODEX Weak Lensing: Concentration of Galaxy Clusters at z∼0.5: A study of the halo mass-concentration relation with a stacked weak lensing analysis. As input to their likelihood function the authors estimate the precision matrix for a multivariate error term across radial bins designed to account for various noise sources in the stacking such as intrinsic profile differences at fixed mass.  In particular, a bootstrap procedure is used to produce an estimate of the covariance matrix to which is applied the de-biasing correction noted by Hartlap et al. (2007), because this estimator for the covariance matrix is seen to be noisy and known to be biased (though consistent).  My concern would be along much the same lines as Selentin & Heavens (2015) in the sense that if the covariance matrix estimator is noisy, then so too is the precision matrix estimator and its uncertainty should be carried through into the Bayesian analysis stage.  I would also note that the de-biasing formula used is only for the case of independent observations, an assumption violated by the bootstrapping approach (as pointed out in Hartlap et al.’s example).

(2) White et al.’s Prospects for Distinguishing Dark Matter Models Using Annual Modulation: A prospective analysis of the potential for direct detection experiments to distinguish particular dark matter models.  This is a good example of sanity checking Bayesian model selection approaches through calibration of frequentist style Type I/II error rates with mock datasets a la Jenkins & Peacock (2011).

(3) El Badry et al.’s The statistical challenge of constraining the low-mass IMF in Local Group dwarf galaxies: Similar to the above but looking at parameter estimation and model selection via the BIC for distinguishing log-Normal IMFs from power-law IMFs; cautions against drawing conclusions where the data just aren’t sufficient to support a strong conclusion either way: more-or-less where the presence or absence of a turn over just isn’t identifiable at the depth of the available imaging.

Some non-astro statistics:

– A recent episode of BBC Radio 4’s More-or-Less podcast featured work by Peter Diggle’s group at Lancaster in much the same vein as what I do at the Malaria Atlas Project, but in this case for childhood malnutrition.  Hopefully I will hear more about this paper by Justice Aheto at the (yet to be advertised) Lancaster Workshop on Disease Mapping in September. (Not to be confused with the Spatial Statistics 2017: One World, One Health conference also held at Lancaster this year, but in July!).