An arXival from today has made me even more skeptical of the value of this notion of splitting machine learning uncertainty into the so-called aleatoric and epistemic contributions a la Kendall & Gal (2017): aleatoric = “noise inherent in the observations … for example sensor noise or motion noise, resulting in uncertainty which cannot be reduced even if more data were to be collected” vs epistemic = “uncertainty in the model parameters … This uncertainty can be explained away given enough data”. My original objection is that this creates an artificial distinction between what is essentially the prior and the likelihood. My additional objection is that these terms have been defined with insufficient clarity; such that it is no surprise that their definitions drift across downstream publications like words in a game of telephone. In today’s arXival we see that epistemic uncertainties are now supposedly accounting for model miss-specification: “errors due to things that can be known but are neglect in the current investigation, for instance, certain effects that are not modelled”, which is only a little further than the network structure uncertainty implied by the use in this earlier astronomical publication: “more flexible networks or more training can reduce them”.
In these particular astronomical applications things become even more squirrely because the exercise there is essentially training a network such that input lensing system images return output unimodal, diagonal-covariance matrix Gaussian approximations to the posterior in some (<10) model parameters. That is, the network makes the simplest possible neural density estimator with the width of the predicted Gaussian being the approximation of its “aleatoric” uncertainty and the jittering of this to improve its frequentist coverage properties via dropout to be the “epistemic” uncertainty. In neither of these publications do we get to see how these two sources of uncertainty compare with each other and what effect they’re having on the modelling: although it is stated without demonstration that “the total probability distributions could be highly non-Gaussian, due to the contribution of the epistemic uncertainties”. Of course, no comparisons are actually made to the full posteriors from a likelihood-based Bayesian analysis to examine the value of the resulting approximations, nor any tests to see whether the recovered posterior approximations are saturating the Cramer-Rao bound. Still, the medium is the message, so delivering inference via machine learning (whether that is necessary or not) is probably what you want to be doing these days to position yourself best for funding opportunities.
One final gripe about the way that the ‘dropout as variational inference’ meme is presented in the earlier astronomical publication: here it is made to seem that the rando choice of the layered Bernoulli distribution as a variational family is somehow just a natural thing to do and then ‘hey presto, dropout is Bayesian’. (As opposed to being ‘chosen’ a posteriori once the dropout technique was known.) See, e.g., the text around the line “The form of this variational distribution is an arbitrary choice.” Arbitrary = ‘based on random choice or personal whim, rather than any reason or system’! Unless … is that actually a clever dig at Gal & Ghahramani (2016)?!