I noticed the recent paper by Vallisneri & Haasteren on the arXiv entitled “*Taming outliers in pulsar-timing datasets with hierarchical likelihoods and Hamiltonian sampling*” in which the authors point out that Bayesian hierarchical models can, and perhaps should, be used to account for outliers when modelling pulsar-timing datasets with Gaussian processes. I wasn’t convinced that this point justifies an entire paper, since the idea of using such a model with a mixture of likelihoods for good and bad data points is well known already in astronomy as per Hogg, Bovy & Lang (as pointed out by V&H) and others (e.g. I would point to Taylor et al.); the only novelty here being that it’s described and applied to a single mock pulsar dataset. I was also not convinced by the authors’ arguments to show that the principled approach would surely beat the ad hoc alternative of 3-clipping, which was argued on the grounds that since the pulsar model includes estimation of the error term there is no single 3 threshold to work with. One would think that an simple marginalised estimator of model ‘discrepancy’ (marginalising over the unknown and underlying GP) for each data point would provide a tractable equivalent to threshold possible outliers. It would then be meaningful to compare the performance of this approach against the hierarchical model in terms of bias and runtime under a variety of bad data scenarios.

Having said that I was very interested to see that random Fourier features are being embraced by the pulsar timing community, as described in greater detail in an earlier paper by the same authors and another by Lentati et al. I was recently talking to Seth Flaxman about how odd it is that random Fourier bases aren’t more popular, with one possibility being that people find they need a huge number of components to get a close approximation to their targeted GP. But of course this is going to depend on the application, or whether you even care: like with the INLA/SPDE approximation we use a lot at MAP, in practical applications it may be sufficient to say ‘we adopt the approximate model as our model’ and then just make sure you do a damn good job testing its performance on mock data and out-of-sample cross-validation (etc.).

### Like this:

Like Loading...

*Related*

Thanks for reading us, Ewan!

Indeed the mixture model is well known, at least in some corners of astronomy. However, the main point of our paper is that sampling the hierarchical likelihood makes it possible to apply the mixture model even in the presence of time-correlated GPs; in the case of pulsar timing, we’ve only recently learned how to achieve such sampling efficiently. Furthermore, our method has already been applied to real datasets (not just mock data); even this short paper contains one example (Fig. 2).

This said, I’m interested in your suggestion about obtaining a marginalized estimator of model discrepancy for each point. I suppose it would be something like the squared residual, marginalized over all the model parameters. Since this requires sampling the GP weights (using the hierarchical likelihood, although in its “pure” Gaussian form), it seems to me that the complexity is similar to what we’re doing.

From a more philosophical standpoint, the mixture model seems cleaner than iterative deletion, and avoids the possibility that a few huge outliers bias the deletion process. (I admit though that this danger remains remote as long as the fraction of outliers is small.)

Hi Michele,

I was imagining the case of the linear mean GP in which the marginal likelihood at fixed noise parameter could be computed directly (at the cost of decomposition for the non-sparse covariance matrix), allowing the 1d noise parameter to be sampled (either stochastically or on a fixed grid); and in this simple case the marginal distribution of the latent signal at each location at fixed noise parameter is analytic too so one could sample from that to form something like the (within sample) marginal probability integral transform (PIT-statistic) of each data point, choosing the worst for clipping, and so on.

I certainly agree that the mixture model approach should be the best, not just theoretically, but it’s worth checking in case there is a much faster ad hoc method that’s almost as good. Looking at the above I’m not sure that my suggestion would be faster (or much faster) or that it would be almost as good: that needs a numerical experiment.

cheers,

Ewan.

Instead of mixture modeling, what about replacing the Gaussian observation model with something robust to outliers, like the Student-t distribution (http://www.jmlr.org/papers/volume12/jylanki11a/jylanki11a.pdf)? With standard GP regression this is going to mean you lose conjugacy (ideas on handling large datasets efficiently are in my ICML paper http://www.jmlr.org/proceedings/papers/v37/flaxman15.html where we use the Laplace approximation) but the nice thing with random Fourier features is that you’re not relying on conjugacy, so all you need to do is swap out the Gaussian observation model for a Student-t, e.g. you just change one line of Stan code here:

https://bitbucket.org/flaxter/random-fourier-features-in-stan