Errors-in-variables, or not …

One can see from this new arXival that errors-in-variables models have not yet become widely known within astronomy yet, though astronomers are trying to find ways to deal with this sort of modelling scenario.  The errors-in-variables regression problem occurs when you want to regress Y against X, which would usually be described as Y_i = f(X) + \epsilon_i with X a precisely measured covariate and f(\cdot) some kind of model taking input X—e.g. linear regression, f(X) = X^\prime \beta, or Gaussian process regression, f \sim \mathrm{GP}_\theta—but unfortunately X is now observed with error, so our model now features an extra layer describing the relationship between the true but latent (hidden) X and the available, noisily-measured \tilde{X}, e.g. \tilde{X}_i = X_i + \xi_i.  If the error term \xi_i is substantial then ignoring it (and fitting the base model) leaves us exposed to model misspecification errors.  A simple Bayesian solution to this problem is to introduce a further layer describing the population distribution of latent X‘s (one that features hyper-parameters allowing shrinkage is a good choice) and then integrate out all the latent variables via posterior simulation (e.g. MCMC).

The astronomers’ approach here is in fact not to bother adding a model: they simply spread the uncertainty in the \tilde{X} out by drawing mock X_i for each data point independently and then take a finely-binned non-parametric estimator.  There are some nice advantages of modelling in this context, even if a semi-parametric functional prior (like a Gaussian process) is decided to be used.  One of these advantages is that you get a ‘structural shrinkage’ of the noisy \tilde{X}‘s towards values that ‘make sense’ given their corresponding Y‘s and the assumed functional form.  There are some challenges to fitting such a model in the case of a Gaussian process EIV regression: without a nugget term there are multiple ‘crossing points’ that a sampler moving the X‘s must negotiate at which the covariance matrix becomes non-invertible (i.e., when X‘s tie).  A nice solution to this is to use a random Fourier feature representation of the GP.

P.s. One of the canonical examples of EIV regression in astronomy is Kelly et al.; that model can be made fancier in a few fun ways: one is to replace the finite mixture of Normals for the population distribution with an infinite mixture model.

Advertisements
Posted in Uncategorized | Leave a comment

Prescience

I mentioned the concept of a Korean remake yesterday and today on the arXiv one appeared, almost as if I predicted it: this time it’s H0liCow Redux via GPs.  It’s quite a cottage industry: obtain new cosmological data, whack a GP on it, and claim to be adding value by offering an ‘unbiased’, ‘model independent’ fit.  Personally I feel like ‘model ignorant’ would be a more honest term here: if you’re not going to show me coverage of your GP model (here squared exponential kernel!) under mock data from different cosmological scenarios then I’m not going to trust your credible sets.

(I notice here also the use of re-fitting without one of the lensing systems thought to be ‘at tension’ with the others, to which apply the same concerns about a misunderstanding of Bayesian inference and coverage that I’ve mentioned previously. )

Posted in Uncategorized | Leave a comment

Some more abuse of Gaussian processes …

Two arXivals today show us ways to mis-use or mis-apply Gaussian processes.  The first is another in this crazy series of papers trying to use GPs as extrapolators to estimate H_0 from H(z) datapoints.  Yes, GPs can kind of extrapolate, but it’s really not their strength: as you get further and further from the domain of observed data you get the GP returning to the mean.  The only interesting thing is the rate at which it reverts, which is controlled by the GP kernel hyper-parameters: unfortunately, as we’ve discussed before, these are very difficult to estimate well for many types of experimental design.  As far as I can see from the scant details in this submission: the authors only look at how small the credible intervals on H_0 are from their GP fits, not at whether or not the true value from the model in the mocks is actually contained within them with any reasonable Frequentist coverage.  Interestingly, they do see that their credible intervals increase if they switch from a squared exponential kernel to various flavours of Matern, which recalls the observation from asymptotic theory that we should probably be using the Matern kernel if we actually want decent coverage.

The other GP paper today was presenting some ‘novel’ work that looks a lot like the Korean re-make version of an already published paper I was involved in via the COIN collaboration.  The idea there is to use spatial GPs as an interpolator to fill in gaps in irregularly sampled galaxy data (in our case it was IFU data) to construct a complete model ‘image’ (with uncertainties); plus maybe gain a little shrinkage at the observed locations.  Why I believe our version is a lot better is because we jointly fit a radial parametric function along with the GP, whereas today’s paper fits the radial term first and then separately fits the GP to the residuals.  There’s no computational need or theoretical justification to cut the fit at this point, especially if the errors in the first step are not going to be propagated coherently to the second.  Also, the first fit here is likely to give as sensible an estimate as most badly misspecified model fitting exercises; especially when you look at how irregular the ‘design’ of sampling locations is for some of these galaxies.

Posted in Uncategorized | Leave a comment

Gaussian process with 10 datapoints

This arXival (accepted in Icarus) proposes to offer insights into the relative support for a periodic (seasonal) signal in Martian methane levels with respect to an aperiodic alternative.  The method adopted is to fit various flavours (kernels) of a Gaussian process regression and examine the posteriors.  With just 10 datapoints this is a very silly exercise since posterior coverage is nonsense and the sensitivity to prior choice incredibly high; neither of which are considered despite input from a professor who regularly uses and advocates GPs plus two referees.

(*) An aside: I once tried to convince the students of this professor that a wider class of stochastic process models could be considered beyond GPs (facilitated by sequential Monte Carlo methods for posterior sampling), but their response was that since a GP can fit anything it was all they needed for any job.  I needn’t go on.

Posted in Uncategorized | 4 Comments

Free-knot linear splines: Value of asymptotic theory

Another recent arXival proposes to use a local linear regression as a type of ‘non-parametric’ Bayesian model, although this time the location of the break points is learnt during fitting, which creates a free-knot linear spline model.  The application here is reconstructions of the primordial power spectrum.  Again I don’t generally advocate for such models, but if I did I would want to see a hierarchical prior structure that could promote shrinkage, such as towards a data-driven mean slope and neighbouring slope difference distribution.  Interestingly it seems that the asymptotics literature has something to say on this topic: in particular, that the prior choice on knot locations should not be a Poisson process (such as the independent draws from the Uniform distribution adopted in this arXival), rather it should favour a regular spacing (see Remark 5 in Belitser & Serra).

Posted in Uncategorized | Leave a comment

BIC is trash

Just a quick reminder that the BIC is not really an approximation to anything meaningful.  Definitely not what’s suggested in this arXival.

Posted in Uncategorized | Leave a comment

Non-parametric N[ion]

In a recent arXival it is proposed to estimate the relationship between redshift and the log comoving density of ionising photons with a “non-parametric” Bayesian approach, such that “N_\mathrm{ion}(z) can take any value at any redshift”. In fact, the authors then introduce the following model structure: at the midpoints of 11 redshift bins we have a unique value of \log N_\mathrm{ion} and in-between we have a linear interpolation (evaluated at a discrete set of grid points; so more of a fine step interpolation). The maximum variation in \log N_\mathrm{ion} between redshift bins is assigned a uniform prior between -1 and 1 dex. So, if the bounds of the uniform prior play a negligible role in constraining the observed fit, what we have is a series of local linear regressions in which information is shared only between observations within each redshift bin, with a prior favouring logarithmic trends in N_\mathrm{ion}(z). I can’t say whether or not this is a good model for the observed data but it’s certainly not a non-parametric model in the strict sense: the hard linear regression within fixed redshift bins is a strong structural parameterisation. It’s also not the kind of model I would propose for this purpose.

The beauty of Bayesian models is that they can be built hierarchically with the ability to take advantage of shrinkage at a data-adaptive scale: hopefully thereby exploiting the bias-variance trade off to improve estimation accuracy. For instance, here we might add a hyper-prior structure to allow the model to learn an appropriate prior for the slopes: shrinking them towards their empirical mean. Or one could introduce a redshift-dependent hyper-prior over the slopes: shrinking together only those neighbouring a given bin. Since we’re focusing on the slope and we’re going to model at a sub-bin scale, why not switch to an integrated Gaussian process model (i.e., a Gaussian process for the rate of change), such as used in sea level reconstructions in the climate change literature. Of course, care needs to be taken to avoid the credible sets from these models becoming over-confident (as mentioned in the previous post), but the advantages of smoothing provided by this type of model are usually worth the effort of checking that it’s being well-behaved. (And this checking is something one should be doing for any model.)

One minor technical issue I noticed in the same arXival regards what seems to be a method for marginalisation over uncertainties in some photometric redshifts. The authors write that “we draw 1000 redshifts from the photometric redshift distribution for sources used in each measurement, maintaining detailed balance by using same drawn redshifts in every likelihood calculation, and use the median of the likelihoods calculated for the drawn redshifts”. It sounds to me like the drawn redshifts are serving as fixed sample points for an arithmetic mean estimate of the marginal likelihood: for this reason it would be the mean rather than median that is desired. In particular, if the median was chosen because the likelihoods over these sample points show a large dynamic range that’s probably a good sign that the current strategy is breaking down and that the unknown redshifts should just be treated as nuisance parameters and integrated out as latent parameters in the MCMC sampler. For reference, it is possible to construct an MCMC sampler that targets the true posterior given noisy arithmetic mean estimates of the marginal likelihood, but apart from using the mean the other important trick is to draw a new sample for each proposal while not updating the estimate of the marginalised likelihood at the current chain position. This is the pseudo-marginal MCMC method. The version the authors suggest would preserve detailed balance of the chain targeting a biased version of the posterior, so the detailed balance isn’t really helping.

Posted in Uncategorized | Leave a comment