While reading Justice Aheto’s paper (mentioned in my last post) on modelling childhood malnutrition (in Africa, as opposed to what goes on at a Versace fashion show … bazingha!) I fell into a rabbit hole of reading on the topic of maximum likelihood asymptotics for constrained parameter spaces. In Justice’s paper the model is an ordinary linear regression but (possibly) with a source of additional noise operating on distinct groups of the observations, i.e.:

where is the product of design matrix and slope coefficients, is the usual noise term operating on each individual datapoint, and is the group-level noise. In this epidemiological example the individual datapoints represent real individuals (children in the study) while the groupings represent households: the logic being that some of the factors driving malnutrition not captured by the design matrix may well act on all members of a household (e.g. all the children in the household suffer in the same way from its proximity to a swamp full of mosquitoes). In astronomy this type of problem can (and does, thanks to an example from Marina Trevisan) occur when galaxies in a study come grouped into clusters: some of the cluster properties might be available as group-level covariates (like halo mass), but they won’t capture all the information about the cluster’s assembly history that might affect the properties of the galaxies it contains.

The line in Justice’s paper that caught my eye was in regard to performing a likelihood ratio test (LRT) to decide in a frequentist sense whether this group-level noise term should be preferred () over the null model (). The quote: “*To assess the significance of the household-level random effect, we divided the p-values from the LRT by two because the null hypothesis that is on the boundary of the parameter space* [citation to Self & Liang 1987]”. That is, since the the null model lies on the smallest possible value of allowed (negative variances aren’t allowed!) the decision threshold of the test statistic needs some adjusting. At face value this made sense to me but since I rarely (if ever) find myself using the likelihood ratio test I was certainly intrigued.

Following up the references one finds that Self & Liang (1987) is a short but densely-packed theoretical paper offering proofs of the consistency and convergence in distribution of the MLE for these bounded domain cases; mostly by setting up regularity conditions on the MLE estimator (boring) and the topology of the boundary in the neighbourhood of the null parameter value (interesting) in order to draw on earlier work by Chernoff (1954). All of which gives theoretical rigor to the intuition that the limiting distribution for the MLE will be (perhaps Multivariate-) Normal (as usual) but truncated/projected somehow on the domain of interest. The topological conditions introduced (that the boundary in the neighbourhood of the null can be approximated by a cone) ensure that the nature of this truncation/projection can be made clear.

The first reason I fell down a rabbit hole is that the definition of an approximating cone they give wasn’t clear to me: in particular, because they use the little-o notation which I assumed referred to a limit at infinity but in fact must be referring to a limit at zero. Maybe that’s obvious to their primary audience but it confused me and sent me on a long googling of stack exchange answers! The second reason I fell down a rabbit hole was that during my stack exchange reading I ended up on Geyer (1994) (his M-estimator paper not the reverse logistic regression one) which points out that the cone condition given by Self & Liang is in fact *not* sufficient for their proof and that they should have insisted on something called Clarke regularity. Trying to understand the differences between these two and the pathological examples where one or other breaks down (Geyer’s Section 6) was the second rabbit hole. Reader beware.

All that to say, there *is* a simple way to compute critical boundaries and p-value for these cases: Justice Aheto describes the most common one for a single parameter boundary (divide your p-value by two: it turns out that the distribution under the null is a 50:50 mixture of a point mass on zero and a chi-sq with d.o.f of 1) but for more details there’s the excellent applied statistics description of the problem by Stram & Lee (1994).

Edit/Note: There’s also a ‘typo’ in Geyer’s explanations for the ordinary and derivable tangent cones: the convergence statement is repeated verbatim for each one, so obviously there’s something wrong: the one for the derivable cone should include that the sequence of admits a suitable gradient definition on a closed ball around the critical point.