One of today’s arXivals examines the use of random forest regression to predict exoplanetary radii given a training set with approx 10 observed properties of exoplanets and their host stars. I don’t quite understand the motivation for a random forest approach in this instance because (a) the learned relationship seems not to be particularly complex and (b) the observables all come with a range of non-inconsequential heteroscedastic uncertainties. A regularised, low-order polynomial regression with errors-in-variables framework could handle this problem easily and would offer advantages in terms of interpretability and shrinkage on the uncertain observables.
Today I read an arXival on the topic of data compression or likelihood compression for cosmology, which translates to construction of approximately (or exactly) sufficient summary statistics via a transformation with reference to the assumed likelihood function. As Alsing & Wandelt point out, this type of transformation falls in the class of score function likelihood summaries. Both papers mention the possibility of using these summaries as low dimensional targets for likelihood-free inference methods comparing mock data simulations to the observed cosmological data. In this case the technique becomes the ABC-IS of Gleim & Pigorsch (see also here & here), and has a connection to indirect inference and the method of moments. I’m somewhat skeptical that a realistic application would meet the conditions needed for sufficiency (with respect to the simulation-based model) of the auxiliary summary statistics; but some of the general insights might cross over (e.g. the auxiliary model should aim to be as richly descriptive as the simulation-based model, even if structural/computationally simpler).
A much improved approach to simulation based inference for cosmology, or maybe some sleight-of-hand?
A recent arXival presents an alternative method for approximating the posterior of some cosmological parameters given an observational dataset for which it is also required to marginalise over a set of nuisance parameters describing the non-linear regime. Previous approaches have run expensive simulations of the cosmology at a set of test cosmological parameters and attached to these a round of simulations of the observable data under the non-linear regime model at a set of test parameters. Against this collection of mock data an emulator for the conditional mean (conditional on both cosmological and non-linear regime parameters) of the observable data is created; the emulator is then plugged into an MCMC sampler as a proxy model. The new approach instead approximates the marginal likelihood of the observed data at each set of cosmological parameters, marginalising over the non-linear regime part of the model, and then applies an emulator (currently now a simple variational approximation) to the posterior in cosmological parameter space.
This is an approach I’ve been encouraging cosmologists to explore in the past, although it wasn’t obvious to me that it would be a guaranteed improvement since it seemed in my consideration (for the case of cosmological covariance matrices) that the marginal likelihood approximation would still require mock data simulations at each test set of cosmological parameters. In this arXival the authors seem to side-step the need for mock data simulations from the non-linear model component by adopting a so-called tabulation strategy. Since I’m unfamiliar with that particular approach it is unclear to me whether this gives an unbiased approximation to the likelihood or is somehow a crude estimator, which means that some of the efficiency gain may simply be a sleight-of-hand.
In any case, since I’m again discussing this problem I may as well say the same thing I always say, which is ‘why not go further and do multi-fidelity Bayesian optimisation?’. Estimation of the marginal likelihood (using the tabulation trick and nested sampling) takes approx 0.3 hours to achieve a reasonable estimate. Presumably one could get the broad shape of the posterior by running nested sampling with much fewer live particles, and then decide where to refine estimates with more live particles in an automatic manner. Ditto for scheduling of further cosmological simulations.
A recent arXival presents a (the arXival authors’ favourite) family of conditional density estimation (CDE) methods as a system for (A) propagating uncertainties downstream in astronomical analyses and (B) conducting likelihood-free inference in astronomy. While some of the individual methods and tools have merits, I feel that the whole package is grossly over-sold in this context and that the examples presented don’t stand up to scrutiny. There are some applications where this might be the right approach, but there are plenty of applications (including cases for those suggested in the paper) where this will not be the right approach. For that reason it would have been better if the authors had provided a balanced comparison against alternatives and related methods. So, to specifics …
My main gripe is that the authors put a lot of emphasis on the advantages of the (estimator of!) their prefered CDE loss function, but nowhere here examine: (i) how close their estimator of the CDE loss is to the actual loss; (ii) how close their estimated-loss-minimising method is to the true posterior; or (iii) what implications this loss function might have on model building for the purpose of error propagation. [The first two of these are similar concerns to those raised on Xi’an’s ‘og in regards to an earlier paper by some of the authors presenting these ideas.] Instead we have in Example 1 the authors claiming that a goodness-of-fit test can’t distinguish a clearly nonsense model from a good model, but that their CDE loss can—when, in fact, the supposed goodness-of-fit test that fails here is the PIT test, which is not a goodness-of-fit test at all; it is a means to investigate an aspect of Bayesian uncertainty calibration, not really accuracy in location. In Example 2 the authors examine their RFCDE approach as a post-adjustment method for ABC posteriors and establish its supposed superiority according to their estimator of the CDE loss function. But if you visually compare (by inspecting their Figure 3) the RFCDE posterior approximation against the raw ABC posterior it is clear that their method’s contribution is to reform the ABC posterior so as to place vanishingly small amounts of posterior mass on the ground truth line of degeneracy (along which the generated data are indistinguishable) in some places (e.g. for all of ). Is that really an improvement?
With regards to the topic of error propagation, I’m not sure that approaches adding kernel density estimators actually improve on point cloud based posterior summaries or other representations (such as clever parametric summaries). A previous application of random forests to reconstruct the Bayesian posterior as a point cloud (the first step of RFCDE before the final KDE is added) in an astronomical context can be seen here. (Although note that the Figures and Results shown in that paper were not actually created with the exact method referred to in the paper, which is another discussion entirely!) The topic that really needs investigating in this area is how to construct efficient posterior representations that allow for prior-reweighting: such as when individual photo-z posteriors fit under a simple prior might be later reweighted to reflect a model with shrinkage based on spatial clustering. Draws from nested sampling help in this because they naturally include samples from a broader region of parameter space than a purely posterior focussed simulator. But indeed there can be challenges for storing the draws for fits to a large catalog of objects, so maybe we’re back to clever parametric summaries that allow some separation of the prior and likelihood contributions?
As Bayesian buzzwords continue to pop onto arXiv daily, one often runs
across ad hoc methods for Monte Carlo integration of posteriors. Of
course, there are ready-made tools that can be downloaded and easily
used for most problems, still, one occasionally finds the lack
literacy in statistics leading to methods which are, to use the words of
Ewan, without principle. Strangely, there is a concentration of
confusion concerning Bayesian methods in x-ray polarization
studies. Much of this comes from the classic misunderstanding of
credible regions and confidence regions. You’ll see astronomers
worried about not being able to detect zero polarization as it is on
a parameter boundary. This lead to some Bayesian magic here that touches a bit on the issue, but applies only to a very fragile and
specific type of polarization data: one where the data are directly
latent polarization and everything is beautifully Gaussian. In other
words, not x-ray data! The point of the work was to derive an analytic
posterior for these types of measurements. We’ll come back to this.
Nevertheless, the idea was picked up by x-ray astronomers and non-principled methods have taken over. For example, here is an application that ignores all the fragility inherent in Poisson distributed data to derive an “analytic” posterior assuming
that low count Poisson data are Gaussian; something that is simply
not true. But, as always, throw enough fancy statistics buzzwords in
and no one will notice. But the real problem with the use of the work
of Vallincourt for x-ray data is that Vallincourt derived a
posterior for the measurements. Assuming a prior and Gaussian
likelihood, the Rice distribution was derived. But the works above,
and these here use the Rice distribution as the likehood!
further priors on their parameters, do things like background subtract
Poisson data, and run some form of random number generator to obtain
trace plots. It is difficult to understand why these approaches are
wrong as they are simply made up via some form of heuristic intuition
and missing the core concepts of what a Bayesian analysis is. But the
words Bayesian are in the papers, so it must be sophisticated, right?
As a side note, we attempted to solve this issue here by deriving
the likelihood, a conditional Poisson likelihood, and performing Bayesian
estimation of the latent parameters. We, of course, recover the Rice
distribution; because that is the posterior.
And yet, this week we had a new heuristic approach to Bayesian
estimation for x-ray polarization here. There are many
misconceptions in the work, but we can concentrate on the issues with
statistics. First, there is a claim of doing Bayesian estimation to
derive confidence intervals. The fun part is the appendix on how the
Monte Carlo was performed. It appears the authors drew random values
of the data from Gaussian distributions (yes, it is Poisson data),
then perform a least-squares fit to this randomly drawn data for the
parameter values… all stop… what the <insert favorite Bayesian
buzzword here>. Thus, we have a long way to go in communicating how
perform Bayesian analysis. Even if you can download every tested
sampler, easily write a Metropolis-Hastings algorithm, or just google
for a few interesting Bayesian blog posts. Intuition is a great tool,
principled Monte Carlo integration techniques are not proven on
intuition alone. When in doubt, consult a statistician.
Trying to produce works that both illuminate the proper use of statistical methodology and perhaps tackle some relevant scientific question is a tough game. Any of us that get comments from the referee such as “why are you using this complicated <insert something from proper stats literature here> when you can just check chi2” knows that it takes an immense amount of momentum to move the field in a statistically sound direction. Yet, the ML buzzword hype is moving into statistics, and astro papers are becoming littered with words such as information criteria and marginalization. It is great that researchers are aware of these terms. But like any method in science, they cannot just be thrown around and the words matter. A few improper usages of these words and a trend begins and soon we look back and say “but whats the comparison between the BIC and AIC in your work?” as haphazardly as people ask “what’s your reduced chi2?”. For example, here are a few examples of the common comparison between the AIC and BIC: here, here, and here. The point is that these numbers aren’t telling you the same information (if the BIC is telling you anything!) about the model and data. But this takes a bit deeper reading in the stats literature. But, hey, the same can be said for reduced chi2 or Wilks’ theorem.
Still things can get completely out of hand: here is an example where everything from maximum likelihood, MCMC, distribution, etc are misused so badly that the terms become meaningless. However, with enough of the right stats buzzwords, a reviewer that is not an expert might just accept the sophistication of the work based on the language alone. We have to be careful and investigate both the methods we use, the questions they address and the words associated with them. Otherwise, we might as well use chi-by-eye fits as the random BIC numbers being spewed out are just as meaningless.
*Editor’s note [E.C.]: The silly title is entirely my own; ‘polishing the unpolishable’ is a euphemism for an English expression that describes one’s endeavours when fruitlessly trying to fix something that is fundamentally broken, i.e., ‘polishing a turd’.
An interesting arXival from last week describes a “Bayesian neural flow” (i.e., masked auto-encoder) model as a data-adaptive prior for estimation of the de-noised (latent) distribution of Gaia catalogue stars in colour-magnitude space. The advantage of neural flow models in this context (density estimation) is the ability to produce a highly flexible distribution model with readily computable normalisation (via the Jacobian); which is otherwise a problem for more commen semi-parametric distribution model (e.g. log Gaussian Cox process models). The authors demonstrate the potential use of this model as a prior (what I would call a ‘highly flexible Bayesian shrinkage prior’) for refining Gaia distance estimates. The learned colour-magnitude diagram is impressive, although as the authors illustrate with the visualisation of this distribution on a log-density colour scale, there is some weird looking filamentary, residual structure where the data are sparse. It seems to me that an outstanding problem for this type of model is how one might best introduce an additional penalty towards smoothness in the projected (colour-magnitude diagram) space of the neural flows to damp down such filaments. Of course, one can easily add an arbitrary smoothness penalty to the likelihood function, but ideally what we need is a penalty that’s reasonably data-adaptive. Inside the Bayesian paradigm the nearest methods circle back to Gaussian processes, while outside of the Bayesian paradigm the nearest comparable methods are those for choosing the penalty on spline fits or kernel smoothing.