The KL divergence as test statistic …

I read this paper by Ben-David et al. on astro ph the other day and despite having some interest in the KL divergence (since KL estimation is similar to marginal likelihood estimation and the KL—or its symmetrized version, the J divergence—is a good rule of thumb for ranking instrumental densities in Monte Carlo computations) I was fairly unimpressed with this particular case study.  Basically, the authors propose to use the KL divergence as a test statistic for non-Gaussianity in the CMB as an alternative to the K-S test.  The motivation suggested by the authors being that “unlike the K-S test, the KL divergence is non-local“; which is obviously technically incorrect, since the ECDF used in the K-S test statistic depends (like the sample median, for instance) on the full sample, though one can level the criticism that it gives relatively less weight to the tails of the distribution than some other possible test statistics (e.g. the Anderson-Darling).

At the end of the day what bothered me most about this paper was the consistent lack of perspective—that is, a failure to recognise that both the K-S test and the proposed test based on the KL divergence are crappy solutions to the problem at hand since (1) the null distribution is not known analytically though it can be simulated from to produce an approximate PDF or CDF; (2) the pixel values from which the K-S and KL statistics are compiled do not represent iid draws, rather (at least under the null) they’re a realisation from a Gaussian process [this is point is noted in the paper but no thought is given to finding a more suitable test statistic for non-iid data]; and (3) all discussion is in terms of discrete distributions (which is non-standard for the K-S test, and usually unnecessary for the KL: except when you have to approximate the null distribution through simulation), meaning the continuous pixel values (temperatures) are ultimately binned here (which raises questions as to the sensitivity to the binning scheme).

There were also a number of technical points that were poorly expressed or misconveyed, some of which can be seen on my margin notes in the sample pic below.  For instance, the required absolute continuity of P w.r.t. Q (the symbol <<) is missed in the review and experiment with the discrete Gaussian example where in fact the expectation for \mathrm{KL}(P,\hat{Q}) is theoretically infinite for all N; though it rears its head in the Planck data analysis when zero bins have to be eliminated.  Likewise, the cumulative distribution and its empirical counterpart are normalised by definition.  The statement “Unlike the case of vectors (where the scalar product provides a standard measure), there is no generic “best” measure for quantifying the similarity of two distributions.” is also funny since in fact (1) there are many choices for a metric on a vector space, and (2) discrete probability distributions (their P_i and Q_i as densities on the integers) are in fact just normalised vectors, so there’s no reason why the inner product couldn’t be used as a test statistic too.


Those wary of multiple hypothesis testing (e.g. geneticists) will be hearing alarm bells upon reading the closing thoughts of the paper: “However, the non-locality of the KL divergence and its sensitivity to the tails of the distributions still suggest that it is a valuable complement to the KS test and might be a useful alternative. Indeed, one should utilize a variety of methods and tests to identify possible contamination of the cosmological product. Obviously, any suggestion of an anomalous result would indicate the need for more sophisticated analyses to assess the quality of the CMB maps.

This entry was posted in Uncategorized. Bookmark the permalink.

One Response to The KL divergence as test statistic …

  1. “The statement “Unlike the case of vectors (where the scalar product provides a standard measure), there is no generic “best” measure for quantifying the similarity of two distributions.” is also funny”

    I think it’s funny too but for different reasons. The KL divergence *does* have many unique properties that other measures of ‘difference’ do not have*. So in making this particular statement, they seem to be arguing against the point of their paper from a premise that is false! Odd.

    * These special properties probably aren’t relevant in this particular situation.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s