## Another MNRAS stats disaster …

Continuing in the tradition of great MNRAS stats cock-ups is the recently published paper by Rampino & Caldeira, entitled “Periodic impact cratering and extinction events over the last 260 million years“, which was announced to much fanfare with a press release quickly picked up by the usual crowd of content-farming ‘news’ websites.  The key question being asked of their two datasets—one a time series of impact cratering events over the past 260 Myr, the other a time series of mass extinction events over the same time—is whether or not there exists statistical evidence of periodicity.

The frequentist testing procedure applied by the authors to answer this question is conceptually straight-forwards.  Choose a test statistic suitable for probing periodicity in a time series of events and define your null hypothesis to be a distribution of aperiodic event times; compute the distribution of your chosen test statistic under your chosen null hypothesis (using simulated mock datasets) in order to identify an appropriate threshold value of the test statistic for rejection of the null hypothesis; then finally compute the test statistic for your observed dataset and either accept or reject your null hypothesis by comparison with the threshold above. Unfortunately, the authors fail in this endeavour in two ways.

The first mistake the authors make is to choose an unreasonable null hypothesis.  Following an earlier application of the testing procedure to an analysis of the magnetic reversals record by Lutz (1985) the authors adopt a Gamma distribution for the waiting time between events with shape parameter of two.  This implies a distribution for the full time series in which no two events will occur too close to each other.  In contrast to the canonical Exponential waiting time distribution (equivalent to a Gamma distribution with shape parameter of one)—which describes a ‘memoryless‘ (Poisson) process, the assumption of a Gamma distributed waiting time distribution supposes that each event ‘knows’ the time at which the last event occurred.  For magnetic reversals this is a reasonable model since one can imagine the physical system dictating the long term behaviour of the Earth’s inner core is one in which there is a certain degree of inertia such that large-scale rearrangements are unlikely to re-occur on an extremely short timescale.  However, I would argue that this is difficult to justify for large impact events caused by asteroid impacts from outer space.

Another implication of the non-memorylessness of the  Gamma distribution with shape parameter of two is that very long gaps between events are disfavoured; that is, its distribution is much ‘thinner tailed’ than that of the Exponential. One could imagine justifying a thin tailed waiting time for a system like that driving major eruptions from a given volcano, for which the build up of magma pressure demands ever more insistently for release.  However, I imagine one would find it more difficult to argue for such a demand in the population of potential impactor asteroids.

In this case I would argue that an appropriate model for the null hypothesis of impact cratering waiting times is the Exponential distribution; hence I posit that impact cratering follows a Poisson process.  That the human mind is prone to perceive ‘non-randomness’ amongst the realisations of the truly random Poisson process is well documented and receives attention in various ‘popular science’ (if this label can be used non-pejoratively) books including The Better Angels of Our Nature and The Signal and the Noise.  Hence the need for careful statistical testing to avoid over-interpretation of datasets deriving from this process.  A good place to start is of course by not modelling them as having Gamma distributed waiting times with a shape parameter of two!

The second major statistical error of Rampino & Caldeira’s paper concerns an issue oft-noted in contemporary discussions of non-reproducability in science, namely that of multiple hypothesis testing.  The confusion arises from the fact that the authors do not adopt a univariate test statistic, instead they examine a test statistic (which I will call an R-curve) that is a function of the trail period allowed to range from 5 to 50 Myr.  The mistake then arises from the fact that they compare the observed peaks in this function to the pointwise 95% confidence threshold from simulations under their Gamma distributed null hypothesis and treat any exceedence as ‘significant’ at the 95% confidence level.  The error of this interpretation appears to have been sensed by Lutz (1985) in which he illustrates a typical R-curve for a mock dataset drawn from the null against the pointwise 95% CI and notes that (despite being typical) this curve exceeds the 95% CI in some places, which should only happen rarely (less than 5% of the time) if the interpretation of the pointwise CI as a global CI was correct.  Indeed from simulations under both the Gamma and Exponential distributed waiting time versions of the null hypothesis one can confirm that a single peak exceedence of the pointwise 95% CI happens in 97.5% and 94.9% of mock datasets respectively—which is very different from 5%!

The correct way to proceed in this case is to transform the R-curve-to-pointwise-null-CI-comparison to a univariate test statistic.  One option is to simply look at the total number of peaks for which the R-curve belonging to the observed dataset exceeds the pointwise CI curve and compare this to the distribution of the number of times R-curves for mock datasets generated under the null hypothesis do so.  In an effort to compute a valid p-value for the impact cratering dataset I transcribed the impact time series given in the authors’ Table 1 and wrote a simple R script (see end of post below) to implement their null hypothesis simulation model (as best I could from the description in Section 2) and to compute the R statistic curve.  Strangely, despite the R statistic seemingly being well defined from their Equation 1, neither I nor a couple of colleagues were able to reproduce the curve shown in Rampino & Caldeira’s Fig 1a.  There is a mistake in part of the notation where they write $S= 1/N \sum a_i$ and then in the next line $S = (\sum \sin a_i)/N$, but this is not responsible for the difference between my curve and theirs.  Nevertheless I am confident to proceed as my pointwise 95% CI curve for the Gamma distributed waiting time null hypothesis travels very close to that shown in their Fig 1a.

Under the Gamma distributed null hypothesis I find that the R-curve for the observed dataset exceeds the pointwise 95% CI at 8 of its peaks, whereas only 2.6% of R-curves for mock datasets generated under the Gamma distributed null hypothesis have 8 or more such ‘significant’ peaks.  Hence, modulo the uncertainty concerning reproducability of the authors’ observed data R-curve it does seem that the impact cratering dataset is indeed unlikely to have been generated from the Gamma distributed null hypothesis.  Conversely, under the canonical Exponential distributed null hypothesis I find that the R-curve for the observed dataset exceeds the pointwise 95% CI at 5 of its peaks, which is less impressive since 21.5% of R-curves for mock datasets generated under the Exponential distributed null hypothesis have 5 or more such ‘significant’ peaks. That is, if one prefers a Poisson process for the null model of impact cratering events then the observed dataset is not statistically inconsistent with this hypothesis.

So, indeed the debate can go on as to which is the best hypothesis (Gamma or Exponential) and therefore whether or not there is statistical evidence for or against periodicity in the dataset (modulo the potential for mis-use of p-values in various ways).  But for me the problems in this paper are clear: (1) use of what I believe is an inappropriate null hypothesis with no justifications given for its choice, and (2) incorrect handling of multiple hypothesis testing.  The bigger picture though is that the editorial plus peer review pipeline at MNRAS is simply unable to properly assess the validity of statistical analyses submitted to the journal—without a substantial overhaul amongst the editorial board I foresee many more such cock-ups … most likely distributed as a Poisson process! 🙂

Obviously, because this is science I hope that my honest criticism of the MNRAS team is not taken personally but rather can be seen as what I believe is an important service to the community.  Having left the field it’s no skin off my nose if they were to take revenge on any future submissions I’m on, but I do worry that my outspoken criticism could affect my future coauthors.  If that turns out to be the case then it would be a great shame, but there are other journals.

Also worth noting is that last time I pointed out a massive cock-up to the editors at MNRAS I was fobbed off with this from Kim Clube: “Thank you for your email. The editor notes that you believe there are scientific errors in the paper you mention. In that case, the normal course would be for you to set out these errors in a paper which, if submitted to MN, would be subject to expert review.” Sorry, I do believe that the responsibility for proper statistical reviewing starts with the journal, and that the buck should not be passed back to the community who have better things to do than waste time writing rebuttal papers for silly errors that should be picked up by any stats-literate astronomer.

### R script to reproduce Rampino & Caldeira (2015)’s periodicity analysis and contrast with a Poisson process null hypothesis

## Simulate from their null hypothesis under their Gamma distributed waiting time
# Note: this is how I think the authors’ implemented this algorithm: the proper version, at least for the inhomoegenous Poisson process, would be Cinlar’s algorithm
# Note: although I disagree with the decision to condition on there being exactly 37 craters in the mock dataset I have followed the authors’ lead here

draw.mock.dataset <- function() {
finished <- F
while (!finished) {
time <- 500
impacts <- list()
while (time > 5) {
waiting.time <- rgamma(1,shape=2,scale=min((1.02*exp(time/79.2)),1.02*exp(260/79.2)))
if (time – waiting.time > 5) {impacts[[length(impacts)+1]] <- time – waiting.time}
time <- time – waiting.time
}
if (length(which(as.numeric(impacts)<260))==37) {finished <- T}
}
return(as.numeric(impacts)[as.numeric(impacts)<260])
}

## Simulate from my null hypothesis under my exponentially distributed waiting time (i.e. Poisson process)

draw.mock.dataset.alt <- function() {
finished <- F
while (!finished) {
time <- 500
impacts <- list()
while (time > 5) {
waiting.time <- rgamma(1,shape=1,scale=2*min((1.02*exp(time/79.2)),1.02*exp(260/79.2)))
if (time – waiting.time > 5) {impacts[[length(impacts)+1]] <- time – waiting.time}
time <- time – waiting.time
}
if (length(which(as.numeric(impacts)<260))==37) {finished <- T}
}
return(as.numeric(impacts)[as.numeric(impacts)<260])
}

## Function to compute the R statistic for a given dataset (mock or real) at a specific P

R <- function(data,P) {
a <- sin(2*pi/P*(data))
b <- cos(2*pi/P*(data))
S <- mean(a)
C <- mean(b)
R <- sqrt(S^2 + C^2)
return(R)
}

## Input observed crater dataset and plot R statistic from P = 5 to 50 as per their Fig 1a
# Note: I cannot see how the authors came up with their R curve in Fig 1a since I follow their description and find a similar result but with a notably different scale; however, repeating this process on the mock datasets I can indeed reproduce exactly their 95% CI so I am confident that there is no mistake here.

x <- c(15.1,15.1,35.3,35.7,36.4,37.2,42.3,45,46,46,49.0,49.0,50.5,58,64.98,65.17,70.3,74.1,76.2,81,89,91,99,115,120,121,123,128,142,142.5,145,165,167,169,201,214,254.7)

PP <- seq(5,50,length.out=1000)
RRR <- RR <- numeric(1000)
for (i in 1:1000) {RRR[i] <- R(x,PP[i])}
quartz(width=6,height=4)
plot(PP,RRR,type=’l’,ylim=c(0,0.4),yaxs=’i’,xlim=c(0,60),xaxs=’i’,xlab=”Period (myr)”,ylab=”R-statistic”,main=”Circular analysis of crater age periodicity”,col=”blue”)

lines(c(0,60),c(0.05,0.05),col=”grey80″)
lines(c(0,60),c(0.1,0.1),col=”grey80″)
lines(c(0,60),c(0.15,0.15),col=”grey80″)
lines(c(0,60),c(0.2,0.2),col=”grey80″)
lines(c(0,60),c(0.25,0.25),col=”grey80″)
lines(c(0,60),c(0.3,0.3),col=”grey80″)
lines(c(0,60),c(0.35,0.35),col=”grey80″)

lines(c(10,10),c(0,0.4),col=”grey80″)
lines(c(20,20),c(0,0.4),col=”grey80″)
lines(c(30,30),c(0,0.4),col=”grey80″)
lines(c(40,40),c(0,0.4),col=”grey80″)
lines(c(50,50),c(0,0.4),col=”grey80″)

lines(PP,RRR,col=”blue”)

## Draw 10000 mock datasets under their Gamma distributed null model and build up the (pointwise) 95% CI as described in their Section 2

pointwise.null.draws <- matrix(0,nrow=10000,ncol=1000)
for (i in 1:10000) {
data <- draw.mock.dataset()
for (j in 1:1000) {RR[j] <- R(data,PP[j])}
pointwise.null.draws[i,] <- RR
}
pointwise.null <- numeric(1000)
for (i in 1:1000) {pointwise.null[i] <- quantile(pointwise.null.draws[,i],0.95,na.rm=F)}

lines(PP,pointwise.null,col=”red”)
text(52,0.25,”95% CI: Gamma dist null”,cex=0.5)

## Draw 10000 mock datasets under my exponentially distributed null model and build up the (pointwise) 95% CI as described in their Section 2

pointwise.null.draws.alt <- matrix(0,nrow=10000,ncol=1000)
for (i in 1:10000) {
data <- draw.mock.dataset.alt()
for (j in 1:1000) {RR[j] <- R(data,PP[j])}
pointwise.null.draws.alt[i,] <- RR
}
pointwise.null.alt <- numeric(1000)
for (i in 1:1000) {pointwise.null.alt[i] <- quantile(pointwise.null.draws.alt[,i],0.95,na.rm=F)}

lines(PP,pointwise.null.alt,col=”red”)
text(52,0.35,”95% CI: Exp dist null”,cex=0.5)

## Expose fallacy that a single crossing of the 95% pointwise CI occurs only 5% of the time under their Gamma distributed null

test.stat <- numeric(10000)
for (i in 1:10000) {test.stat[i] <- as.integer(length(which(pointwise.null.draws[i,] > pointwise.null))>0)}
cat(“The pointwise 95% CI is crossed at least once in “,mean(test.stat)*100,”% of R statistic curves computed under the Gamma distributed null.\n”,sep=””)

## Expose fallacy that a single crossing of the 95% pointwise CI occurs only 5% of the time under my exponentially distributed null

test.stat <- numeric(10000)
for (i in 1:10000) {test.stat[i] <- as.integer(length(which(pointwise.null.draws.alt[i,] > pointwise.null.alt))>0)}
cat(“The pointwise 95% CI is crossed at least once in “,mean(test.stat)*100,”% of R statistic curves computed under the exponentially distributed null.\n”,sep=””)

## Compute a proper test statistic for the Gamma distributed null

test.stat <- numeric(10000)
for (i in 1:10000) {
highlights <- which(diff(sign(diff(pointwise.null.draws[i,])))==-2)+1
test.stat[i] <- length(which(pointwise.null.draws[i,highlights] > pointwise.null[highlights]))}
#hist(test.stat)
highlights <- which(diff(sign(diff(RRR)))==-2)+1
#points(length(which(RRR[highlights]>pointwise.null[highlights])),0,col=”red”)
cat(“There are “,length(which(RRR[highlights]>pointwise.null[highlights])),” maxima exceeding the pointwise 95% CI (Gamma null) in the R statistic curve for the observed dataset.\n A value at least this extreme occurs in “,length(which(test.stat>=length(which(RRR[highlights]>pointwise.null[highlights]))))/10000*100,”% of mock datasets under the Gamma distributed null.\n”,sep=””)

## Compute a proper test statistic for the Poisson distributed null

test.stat <- numeric(10000)
for (i in 1:10000) {
highlights <- which(diff(sign(diff(pointwise.null.draws.alt[i,])))==-2)+1
test.stat[i] <- length(which(pointwise.null.draws.alt[i,highlights] > pointwise.null.alt[highlights]))}
#hist(test.stat)
highlights <- which(diff(sign(diff(RRR)))==-2)+1
#points(length(which(RRR[highlights]>pointwise.null.alt[highlights])),0,col=”red”)
cat(“There are “,length(which(RRR[highlights]>pointwise.null.alt[highlights])),” maxima exceeding the pointwise 95% CI (Exponential null) in the R statistic curve for the observed dataset.\n A value at least this extreme occurs in “,length(which(test.stat>=length(which(RRR[highlights]>pointwise.null.alt[highlights]))))/10000*100,”% of mock datasets under the Exponentially distributed null.\n”,sep=””)

This entry was posted in Uncategorized. Bookmark the permalink.

### 3 Responses to Another MNRAS stats disaster …

1. Phillip Helbig says:

Also worth noting is that last time I pointed out a massive cock-up to the editors at MNRAS I was fobbed off with this from Kim Clube: “Thank you for your email. The editor notes that you believe there are scientific errors in the paper you mention. In that case, the normal course would be for you to set out these errors in a paper which, if submitted to MN, would be subject to expert review.” Sorry, I do believe that the responsibility for proper statistical reviewing starts with the journal, and that the buck should not be passed back to the community who have better things to do than waste time writing rebuttal papers for silly errors that should be picked up by any stats-literate astronomer.

Since you’ve already done the work, you essentially have a MNRAS paper for free. Why not submit something based on this blog post?

• That’s true, but in general it seems to me that rebuttal papers receive far fewer citations than the original. (Let’s see how http://arxiv.org/pdf/1510.05598.pdf fares!). I do intend to look through the citations to Lutz (1985) and see whether the misuse of this statistical procedure is common or otherwise; if it’s a widespread mistake then I will be motivated to write this up.

• Phillip Helbig says:

“That’s true, but in general it seems to me that rebuttal papers receive far fewer citations than the original.”

This is certainly true. Usually the original paper makes a Cool Claim, so people are interested in it. When a rebuttal paper shows that the claim was false, there is little reason to cite it. The reverse happens less often: the original paper is an uninteresting, or null, result, and the rebuttal shows that the correct result is actually a Cool Claim. This might be do in part to the fact that Cool Claims are more interesting and hence other people investigate them and rebut them if necessary. But the main reason is probably that if the result looks interesting, a) people rush it to press and b) editors aren’t sceptical enough, so too much low-quality stuff gets through, whereas it is sometimes difficult to publish a null-result paper, or one which confirms another result. Such publication bias is a real problem.

About a third of my publications debunk claims in the literature. I’m not aware of anyone having debunked my debunking. 😐