## Errors-in-variables, or not …

One can see from this new arXival that errors-in-variables models have not yet become widely known within astronomy yet, though astronomers are trying to find ways to deal with this sort of modelling scenario.  The errors-in-variables regression problem occurs when you want to regress $Y$ against $X$, which would usually be described as $Y_i = f(X) + \epsilon_i$ with $X$ a precisely measured covariate and $f(\cdot)$ some kind of model taking input $X$—e.g. linear regression, $f(X) = X^\prime \beta$, or Gaussian process regression, $f \sim \mathrm{GP}_\theta$—but unfortunately $X$ is now observed with error, so our model now features an extra layer describing the relationship between the true but latent (hidden) $X$ and the available, noisily-measured $\tilde{X}$, e.g. $\tilde{X}_i = X_i + \xi_i$.  If the error term $\xi_i$ is substantial then ignoring it (and fitting the base model) leaves us exposed to model misspecification errors.  A simple Bayesian solution to this problem is to introduce a further layer describing the population distribution of latent $X$‘s (one that features hyper-parameters allowing shrinkage is a good choice) and then integrate out all the latent variables via posterior simulation (e.g. MCMC).

The astronomers’ approach here is in fact not to bother adding a model: they simply spread the uncertainty in the $\tilde{X}$ out by drawing mock $X_i$ for each data point independently and then take a finely-binned non-parametric estimator.  There are some nice advantages of modelling in this context, even if a semi-parametric functional prior (like a Gaussian process) is decided to be used.  One of these advantages is that you get a ‘structural shrinkage’ of the noisy $\tilde{X}$‘s towards values that ‘make sense’ given their corresponding $Y$‘s and the assumed functional form.  There are some challenges to fitting such a model in the case of a Gaussian process EIV regression: without a nugget term there are multiple ‘crossing points’ that a sampler moving the $X$‘s must negotiate at which the covariance matrix becomes non-invertible (i.e., when $X$‘s tie).  A nice solution to this is to use a random Fourier feature representation of the GP.

P.s. One of the canonical examples of EIV regression in astronomy is Kelly et al.; that model can be made fancier in a few fun ways: one is to replace the finite mixture of Normals for the population distribution with an infinite mixture model.

## Prescience

I mentioned the concept of a Korean remake yesterday and today on the arXiv one appeared, almost as if I predicted it: this time it’s H0liCow Redux via GPs.  It’s quite a cottage industry: obtain new cosmological data, whack a GP on it, and claim to be adding value by offering an ‘unbiased’, ‘model independent’ fit.  Personally I feel like ‘model ignorant’ would be a more honest term here: if you’re not going to show me coverage of your GP model (here squared exponential kernel!) under mock data from different cosmological scenarios then I’m not going to trust your credible sets.

(I notice here also the use of re-fitting without one of the lensing systems thought to be ‘at tension’ with the others, to which apply the same concerns about a misunderstanding of Bayesian inference and coverage that I’ve mentioned previously. )

## Some more abuse of Gaussian processes …

Two arXivals today show us ways to mis-use or mis-apply Gaussian processes.  The first is another in this crazy series of papers trying to use GPs as extrapolators to estimate $H_0$ from $H(z)$ datapoints.  Yes, GPs can kind of extrapolate, but it’s really not their strength: as you get further and further from the domain of observed data you get the GP returning to the mean.  The only interesting thing is the rate at which it reverts, which is controlled by the GP kernel hyper-parameters: unfortunately, as we’ve discussed before, these are very difficult to estimate well for many types of experimental design.  As far as I can see from the scant details in this submission: the authors only look at how small the credible intervals on $H_0$ are from their GP fits, not at whether or not the true value from the model in the mocks is actually contained within them with any reasonable Frequentist coverage.  Interestingly, they do see that their credible intervals increase if they switch from a squared exponential kernel to various flavours of Matern, which recalls the observation from asymptotic theory that we should probably be using the Matern kernel if we actually want decent coverage.

The other GP paper today was presenting some ‘novel’ work that looks a lot like the Korean re-make version of an already published paper I was involved in via the COIN collaboration.  The idea there is to use spatial GPs as an interpolator to fill in gaps in irregularly sampled galaxy data (in our case it was IFU data) to construct a complete model ‘image’ (with uncertainties); plus maybe gain a little shrinkage at the observed locations.  Why I believe our version is a lot better is because we jointly fit a radial parametric function along with the GP, whereas today’s paper fits the radial term first and then separately fits the GP to the residuals.  There’s no computational need or theoretical justification to cut the fit at this point, especially if the errors in the first step are not going to be propagated coherently to the second.  Also, the first fit here is likely to give as sensible an estimate as most badly misspecified model fitting exercises; especially when you look at how irregular the ‘design’ of sampling locations is for some of these galaxies.

## Gaussian process with 10 datapoints

This arXival (accepted in Icarus) proposes to offer insights into the relative support for a periodic (seasonal) signal in Martian methane levels with respect to an aperiodic alternative.  The method adopted is to fit various flavours (kernels) of a Gaussian process regression and examine the posteriors.  With just 10 datapoints this is a very silly exercise since posterior coverage is nonsense and the sensitivity to prior choice incredibly high; neither of which are considered despite input from a professor who regularly uses and advocates GPs plus two referees.

(*) An aside: I once tried to convince the students of this professor that a wider class of stochastic process models could be considered beyond GPs (facilitated by sequential Monte Carlo methods for posterior sampling), but their response was that since a GP can fit anything it was all they needed for any job.  I needn’t go on.

Posted in Uncategorized | 4 Comments

## Free-knot linear splines: Value of asymptotic theory

Another recent arXival proposes to use a local linear regression as a type of ‘non-parametric’ Bayesian model, although this time the location of the break points is learnt during fitting, which creates a free-knot linear spline model.  The application here is reconstructions of the primordial power spectrum.  Again I don’t generally advocate for such models, but if I did I would want to see a hierarchical prior structure that could promote shrinkage, such as towards a data-driven mean slope and neighbouring slope difference distribution.  Interestingly it seems that the asymptotics literature has something to say on this topic: in particular, that the prior choice on knot locations should not be a Poisson process (such as the independent draws from the Uniform distribution adopted in this arXival), rather it should favour a regular spacing (see Remark 5 in Belitser & Serra).

## BIC is trash

Just a quick reminder that the BIC is not really an approximation to anything meaningful.  Definitely not what’s suggested in this arXival.

In a recent arXival it is proposed to estimate the relationship between redshift and the log comoving density of ionising photons with a “non-parametric” Bayesian approach, such that “$N_\mathrm{ion}(z)$ can take any value at any redshift”. In fact, the authors then introduce the following model structure: at the midpoints of 11 redshift bins we have a unique value of $\log N_\mathrm{ion}$ and in-between we have a linear interpolation (evaluated at a discrete set of grid points; so more of a fine step interpolation). The maximum variation in $\log N_\mathrm{ion}$ between redshift bins is assigned a uniform prior between -1 and 1 dex. So, if the bounds of the uniform prior play a negligible role in constraining the observed fit, what we have is a series of local linear regressions in which information is shared only between observations within each redshift bin, with a prior favouring logarithmic trends in $N_\mathrm{ion}(z)$. I can’t say whether or not this is a good model for the observed data but it’s certainly not a non-parametric model in the strict sense: the hard linear regression within fixed redshift bins is a strong structural parameterisation. It’s also not the kind of model I would propose for this purpose.