At a recent epi-statistics tutorial I attended (which included an hour on ‘introduction to Bayesian inference’ for graduate epidemiology students) the speaker made what I felt to be an overly zealous argument to the effect that one should never report a maximum a posteriori parameter estimate, instead one should report the posterior mean. That is, I specifically noted down a paraphasing of his commentary in my journal as: “Don’t use MAP estimators, use expectations; the mode doesn’t tell you anything!“, along with paraphrasings of his two main arguments: “the mode is sensitive to parameter rescaling” and “the mode (& density) doesn’t trace the posterior mass in high dimensions“. For what it’s worth, here is my rebuttal.
Re: “the mode doesn’t tell you anything!”
But of course it does: it tells you where the maximum density of posterior mass is with respect to the base measure encoded in the prior! Is this necesarily invariant to a change of base measure? No, but that doesn’t have to be a problem: see point (i) below. Does this mean the bulk of posterior mass is located within a small neighbourhood of the MAP location? Not necessarily, but again, this doesn’t mean we have to throw away the MAP completely: see point (ii) below.
Sure, the density is only a representation of the posterior distribution of a random variable, , which is itself a mapping from the point set of physical model parameters to numbers on the real line; but under a few basic regularity conditions it can tell us a lot about physical reality. For a large class of models, supposing continuity of the prior and likelihood in the neighbourhood of the true parameter value, the Bernstein-von Mises theorem gives almost sure convergence of the MAP estimate to the truth as we gather more and more data—which is a thoroughly reassuring result! Yet the same is not guaranteed for the posterior mean. Another attractive property of the MAP estimator is that (if it exists*) it must lie within the prior support: i.e. in parameter space corresponding to an allowed model parameterisation. Not so with the posterior mean, which can lie in a region of parameter space with , for which there may simply be no physical interpretation: not very helpful as a point summary of our posterior!
*Yes, there can be cases (e.g. two equal peaks) for which the posterior mode is not unique, or cases (e.g. updating an flat prior on the population proportion given one Bernoulli distributed observable) where the mode lies at the boundary of the prior support. But, equally, there can be cases (e.g. for a Cauchy-like posterior) whereby the posterior mean doesn’t exist (though the mode does). This is a generic problem of posterior point summaries, rather than a problem with the MAP itself.
(i) Re: “the mode is sensitive to parameter rescaling”
Yes, this is true, if we make a non-linear transformation, , to our parameter, , we can shift the posterior mode in such a way that . For example, take and : thus while . BUT this sensitivity to parameter rescaling is also true for the mean! In our example, and . This is again a generic problem with point estimators.
(ii) Re: “the mode (& density) doesn’t trace the posterior mass in high dimensions”
Sure, even for a well-behaved Normal posterior, as the dimension increases the concentration of mass shifts from directly around the mode out to larger radii, so indeed the importance of the immediate neighbourhood of the posterior mode decreases. But we know this—it’s in the nature of high dimensional topologies—and in algorithms making use of the posterior mode (like importance nested sampling) the addition of information about the curvature of the posterior from the observed Fisher information (numerically, the Hessian) effectively acknowledges this point and helps us to begin designing effective proposals for whatever our purpose (e.g. posterior simulation, or estimation of a particular posterior functional). Likewise, for the demonstrable utility of Laplace approximations about the posterior mode of the GMRF component in INLA—if Havard Rue had stuck to the mantra that the mean is always more informative than the mode where would spatial statistics be now?!
In summary, while I agree with the broad thesis perhaps underlying the speaker’s statements—that no point estimator should be regarded as a meaningful summary of the full posterior—I strongly disagree with the assertion that the posterior mean is universally more informative or more useful than the posterior mode.