## In defense of the maximum a posteriori (posterior mode) …

At a recent epi-statistics tutorial I attended (which included an hour on ‘introduction to Bayesian inference’ for graduate epidemiology students) the speaker made what I felt to be an overly zealous argument to the effect that one should never report a maximum a posteriori parameter estimate, instead one should report the posterior mean. That is, I specifically noted down a paraphasing of his commentary in my journal as: “Don’t use MAP estimators, use expectations; the mode doesn’t tell you anything!“, along with paraphrasings of his two main arguments: “the mode is sensitive to parameter rescaling” and “the mode (& density) doesn’t trace the posterior mass in high dimensions“. For what it’s worth, here is my rebuttal.

Re: “the mode doesn’t tell you anything!”
But of course it does: it tells you where the maximum density of posterior mass is with respect to the base measure encoded in the prior! Is this necesarily invariant to a change of base measure? No, but that doesn’t have to be a problem: see point (i) below. Does this mean the bulk of posterior mass is located within a small neighbourhood of the MAP location? Not necessarily, but again, this doesn’t mean we have to throw away the MAP completely: see point (ii) below.

Sure, the density is only a representation of the posterior distribution of a random variable, $\theta$, which is itself a mapping from the point set of physical model parameters to numbers on the real line; but under a few basic regularity conditions it can tell us a lot about physical reality. For a large class of models, supposing continuity of the prior and likelihood in the neighbourhood of the true parameter value, the Bernstein-von Mises theorem gives almost sure convergence of the MAP estimate to the truth as we gather more and more data—which is a thoroughly reassuring result! Yet the same is not guaranteed for the posterior mean. Another attractive property of the MAP estimator is that (if it exists*) it must lie within the prior support: i.e. in parameter space corresponding to an allowed model parameterisation. Not so with the posterior mean, which can lie in a region of parameter space with $\pi(\theta)=0$, for which there may simply be no physical interpretation: not very helpful as a point summary of our posterior!

*Yes, there can be cases (e.g. two equal peaks) for which the posterior mode is not unique, or cases (e.g. updating an flat prior on the population proportion given one Bernoulli distributed observable) where the mode lies at the boundary of the prior support. But, equally, there can be cases (e.g. for a Cauchy-like posterior) whereby the posterior mean doesn’t exist (though the mode does). This is a generic problem of posterior point summaries, rather than a problem with the MAP itself.

(i) Re: “the mode is sensitive to parameter rescaling”
Yes, this is true, if we make a non-linear transformation, $\theta^\prime=g(\theta)$, to our parameter, $\theta$, we can shift the posterior mode in such a way that $g^{-1}(\mathrm{mode}[\theta^\prime]) \neq \mathrm{mode}[\theta]$. For example, take $\theta \sim \mathrm{Beta}(4,4)$ and $\theta^\prime = \theta^2$: thus $\mathrm{mode}[\theta] = 0.5$ while $g^{-1}(\mathrm{mode}[\theta^\prime]) = 0.4$. BUT this sensitivity to parameter rescaling is also true for the mean! In our example, $\mathrm{mean}[\theta] = 0.5$ and $g^{-1}(\mathrm{mean}[\theta^\prime]) = 0.53$. This is again a generic problem with point estimators.

(ii) Re: “the mode (& density) doesn’t trace the posterior mass in high dimensions”
Sure, even for a well-behaved Normal posterior, as the dimension increases the concentration of mass shifts from directly around the mode out to larger radii, so indeed the importance of the immediate neighbourhood of the posterior mode decreases. But we know this—it’s in the nature of high dimensional topologies—and in algorithms making use of the posterior mode (like importance nested sampling) the addition of information about the curvature of the posterior from the observed Fisher information (numerically, the Hessian) effectively acknowledges this point and helps us to begin designing effective proposals for whatever our purpose (e.g. posterior simulation, or estimation of a particular posterior functional). Likewise, for the demonstrable utility of Laplace approximations about the posterior mode of the GMRF component in INLA—if Havard Rue had stuck to the mantra that the mean is always more informative than the mode where would spatial statistics be now?!

In summary, while I agree with the broad thesis perhaps underlying the speaker’s statements—that no point estimator should be regarded as a meaningful summary of the full posterior—I strongly disagree with the assertion that the posterior mean is universally more informative or more useful than the posterior mode.

This entry was posted in Uncategorized. Bookmark the permalink.

### 26 Responses to In defense of the maximum a posteriori (posterior mode) …

1. hogghogg says:

I couldn’t agree more

2. Pierre Jacob says:

what’s an example of Bernstein von mises saying anything about the mode but not about the mean?

• Pierre: To construct a toy example I would start with something like this and dick around with the parameterisation to keep the mode at 0 but a non-trivial part of the mass accelerating towards infinity as more data comes in:
truth: x = 0
y ~ N(x,(x+0.1)^2) for -1 < x 1
y ~ N(1/x,(1/x+0.1)^2) for x > 1
pi(x) ~ Cauchy(0,1) on -1 < x : 0 otherwise

• Pierre Jacob says:

so that Bernstein von Mises does not hold?…

3. so that Bernstein von Mises does hold but the posterior mean is not consistent

• Pierre Jacob says:

I think you might be referring to asymptotic consistency of the MLE (and thus of the posterior mode), but not asymptotic normality (which is what’s usually called Bernstein von Mises: the posterior looks more and more like a Gaussian, thus the mean and the mode converge to the same value).

• Both! Pragmatically I think of Bernstein von Mises as speaking to the convergence of the MLE and hence posterior mode (under minimal assumptions), but also of the convergence to Normality which I understand to be in total variation distance, so only offering consistency of the mode and not necessarily the mean (i.e., since TV convergence doesn’t guarantee weak convergence (/convergence in distribution), unlike if the norm was Wasserstein etc.). Does this sound right to you?

Edit: forget remark about convergence in distribution implying convergence of means; obviously f(X)=X in unbounded in my example above.

4. Pierre Jacob says:

I’m a bit confused about this: if a distribution pi_n converges to a Gaussian pi^star in TV, I would have thought eventually the mean of pi_n would exist for some n large enough (with large enough probability), but you’re saying it does not need to exist for any n?

• I reckon so, if the small amount of mass not attached to the increasingly Normal part becomes shifted increasingly far towards infinity.

• Pierre Jacob says:

Yes I guess, the amount of that wandering mass has be exponentially decreasing but if it moves away even faster than that, you would never have any finite mean. Makes sense! I wouldn’t call that a practical argument in favour of posterior modes though 🙂

• I was hoping you’d say:

5. I just want to emphasize that despite the fact that we live in times where Bayesian anything is the coolest and most are jumping on the bandwagon, likelihood analysis is the crucial part, especially considering the quasi ubiquitous use of flat prior. But somehow this is not recognized as it should. Nice post.

6. betanalpha says:

“I strongly disagree with the assertion that the posterior mean is universally more informative or more useful than the posterior mode.” Cool. It’s easy to have an argument when you base your entire reasoning on strawmen, huh?

As you very briefly alluded to above, _any_ point estimate is rife with pathologies and are the source of almost all mistakes in statistical practice. Exactly because of this I never advocated for point estimates in the first place! The point I was making is that _expectations_ with respect to the _posterior distribution_ are well-defined, parameterization-invariant objects that should be the basis of any proper decision. And it turns out decisions are kind of important in clinical inference.

In particular, regarding (i) of course the mean changes! You’re looking at the expectation of an entirely different function! What’s important is that in either parametrization the expectation of a given parameter, say of theta: sample space -> R or g(theta): sample space -> R, are consistent. As they should be.

Ultimately the problem with trying to intuit MAP estimators is that they don’t tell you anything about how to integrate against the posterior, and hence compute expectations (which is what I spent almost the entire time talking about — glad you were listening to that part!). The seductive aspect of MAPs is that in the asymptotic limit of sufficiently simple models the typical set surrounds the mode with enough symmetry that any marginal falls close the mode; in particular, close enough that the curvature around the mode can be used to make higher-order corrections.* But add any realistic nonlinearity to your measurement model and that symmetry vanishes, leaving you with hugely biased expectation estimates and no way to even estimate that bias without going ahead and doing a full analysis.

This is why tools like INLA are so dangerous. If the Gaussian hierarchies faithfully model a given problem then INLA is awesome, but if any of those assumptions are poor (which they will be in any realistic model with complex measurement models and sparse data) then the answer can be hugely misleading (with little diagnostics available). I trust INLA in the hands of experts, but not in the hands of novices.

*Incidentally, will people _please_ stop saying things like “empirical Fisher information”? Removing the expectation over data completely destroys all of the important properties of the Fisher information — you’re just making a Laplace approximation.

• >> And it turns out decisions are kind of important in clinical inference.
Yes, I know, I actually work in that field! (cf. MAP’s collaboration with the WHO, and my personal collaboration with the Kirby Institute) 🙂

>> What’s important is that in either parametrization the expectation of a given parameter, say of theta: sample space -> R or g(theta): sample space -> R, are consistent.

Okay, if we agree here *invariant* not technically consistent (in the sense of the estimator of a parameter converging in probability to the truth as the data increases): as in my discussion with Pierre.

>> Ultimately the problem with trying to intuit MAP estimators is that they don’t tell you anything about how to integrate against the posterior

They tell you where to start looking: more so than the mean if e.g. the posterior is multimodal.

So while I think we agree on the limitations of point estimators I still think we disagree about the importance/significance of probability densities: IMO they are NOT just a computational convenience but actually give a lot of insight about our state of beliefs encoded in the mapping from reality to the real line via our random variable—all that any regularity conditions (such as continuity of likelihood in the neighbourhood of the truth) are doing is making sure our map is not arbitrarily silly.

• betanalpha says:

>> Okay, if we agree here *invariant* not technically consistent (in the sense of the estimator of a parameter converging in probability to the truth as the data increases): as in my discussion with Pierre.

Of course. Although formal consistency, given its assumption that a choice of small world has any hope of capturing the true data generating process, is an entirely different issue. In practice any simple model won’t have the complexity to capture the true data generating process so consistency is irrelevant. Similarly, as you make your model more complex in an attempt to capture the truth you push further and further away from the “large data” limit and any asymptotic considerations at all are questionable.

Asymptotics, to paraphrase Johnson, are the last refuge of the statistical scoundrel. 😀

>>They tell you where to start looking: more so than the mean if e.g. the posterior is multimodal.

>> So while I think we agree on the limitations of point estimators I still think we disagree about the importance/significance of probability densities: IMO they are NOT just a computational convenience but actually give a lot of insight about our state of beliefs encoded in the mapping from reality to the real line via our random variable—all that any regularity conditions (such as continuity of likelihood in the neighbourhood of the truth) are doing is making sure our map is not arbitrarily silly.

Formally our beliefs are encoded in a measure. Now I agree that measures are completely non-intuitive, but I don’t think that densities are necessarily any better. The problem is that all of our intuitions about how densities manifest properties of the corresponding measure are limited to low-dimensional cases (high-dimensional Lebesgue measures are fricken crazy), and outside of a few dimensions they are more misleading than helpful. Really it’s low-dimensional _marginal_ densities that offer useful intuition into a given problem, but marginalization requires high-dimensional integration which requires a principled, scalable algorithm in the first place!

Multimodality is an even more subtle issue that goes way deeper than “local MAP” vs “global expectation”, but that’s best left to a conversation in of itself.

7. “The point I was making is that _expectations_ with respect to the _posterior distribution_ are well-defined, parameterization-invariant objects that should be the basis of any proper decision. And it turns out decisions are kind of important in clinical inference.”

Depends on the loss function, it’s quite trivial to define the loss function which makes the MAP the optimal point estimate. “Mean is best” is not universal, end of conversation.

• betanalpha says:

Nope, in fact it’s incredibly nontrivial. In order to try to formalize a MAP estimator you need to construct $lim_{n \rightarrow \infty} E[ L_{n} (g, \hat{g}) ]$, but if you’re careful in each of your steps you’ll find that the limit and the expectation do not commute — the $L_{\infty}$ norm does not define a valid loss function! While the mean and median are well-defined and parameterization-invariant the MAP is not!

Also, thanks for parroting the point estimate strawman I never proposed!

• FYI I just edited your comment to turn on latex: {dollar sign}latex {dollar sign}

As you say it’s a strawman argument I guess there’s two points to clarify: (1) this is definitely the message I took home at the talk—“one should never report a maximum a posteriori parameter estimate, instead one should report the posterior mean”—so it must be worth your while to tweak your presentation then to convey exact what you mean for future talks; and (2) what do you propose people should report as the estimate (and credible interval) of a parameter (if you accept this is a valid thing that one might want to do to summarise their results in a paper)? Mean, mode, or median?

Interestingly, we just had a talk from Axel Finke on Monte Carlo Optimisation to find the MAP as the limit of a sequence of auxiliary distributions.

• betanalpha says:

> this is definitely the message I took home at the talk—“one should never report a maximum a posteriori parameter estimate, instead one should report the posterior mean”

Let’s be very careful here. In the general set up we construct a posterior measure that encodes our information (and assumptions) about a system and then we can compute expectations of functions with respect to that measure — that’s it. There’s not even a MAP estimator to argue about yet.

Perhaps the most important functions whose expectations we care about are loss (or utility) functions. Taking expectations of these allow us to make Bayes-optimal decisions, which is obviously hugely important in fields like epidemiology.

In practice one typically also defines $n$ coordinate or parameter functions that map our sample space into $R^{n}$. Given this map we can define the probability density of our posterior with respect to the Lebesgue measure on $R^{n}$ with which most people are more familiar. Now the problem with this density is that it’s not unique — we can construct an infinite family of parameter functions that all define different densities. Consequently anything that depends on the choice of parameter functions (such as the densities and hence MAP values) is not a proper probabilistic manipulation.

Now here’s where I think much of the confusion arises. Sometimes there are certain components of the parameter functions that have particular meaning with respect to the model, so those components serve double duty as functions of interest and parameters. As functions of interest, however, we are still limited to expectations as the only well-defined operations. MAP estimates essentially assume a limit where the posterior measure concentrates at a single point so that expectations can be approximated by point values, $\mathbb{E} [ g ] \approx g( \theta_{\mathrm{MAP}})$. But this limit is hard to construct mathematically (for example why the $L_{\infty}$ loss function is not well-defined) let alone being a reasonable assumption in practice.

So if a parameter does have meaning you’re still stuck with proper expectations, whether they be means or variance or quantiles (such as the median).

> what do you propose people should report as the estimate (and credible interval) of a parameter (if you accept this is a valid thing that one might want to do to summarise their results in a paper)? Mean, mode, or median?

This is a very different question. Really the (usually unwelcome) right answer is to not report such estimates — if you have a loss function then report the expected loss, otherwise report your entire model (probabilistic programming languages are great for this!) and data, perhaps even with code to enable readers to compute expectations of their own (hence tools like Stan).

Often the situation in practice is that one function has been singled out as important (i.e. all loss functions would depend only on that function) but an exact loss function has not been chosen. In this case we’re interested in the _marginal_ distribution of that function which, you guessed it, is another expectation! The mean, variance, and quantiles provide some intuition as to the shape of that marginal distribution and hence can be helpful, but you might as well just display the marginal distribution (via a KDE or a histogram, for example, which can also be posed as expectations!!!eleven!). Again, no single expectation will summarize the entire marginal distribution and hence no single expectation will be sufficient for reporting!

Hopefully this provides further intuition as to why the MAP value is a bad idea — in complex problems it’s often very far away from the mass of any marginal distribution.

> Interestingly, we just had a talk from Axel Finke on Monte Carlo Optimisation to find the MAP as the limit of a sequence of auxiliary distributions.

The important detail here being a limit of distributions! A limit which will depend on the chosen parameterization.

• What are you talking about? What’s n and why should I take the limit as it goes to infinity? Fancy irrelevant distractions. Consider a discrete example with two possibilities, that’s all you need.

• betanalpha says:

> What are you talking about? What’s n and why should I take the limit as it goes to infinity?

n is the order of the distance-based loss function — this is exactly the argument people make for the MAP from a decision-theoretic perspective as you advocated. Again, it works for any finite n but not in the limit of n going to infinity which would naively recover the MAP estimator.

> Fancy irrelevant distractions.

That’s one way to describe the mathematics trying to keep you from making mistakes! If the math weren’t relevant I wouldn’t be brining it up (and there is a wealth of irrelevant pedantry to be found in measure theory).

> Consider a discrete example with two possibilities, that’s all you need.

All you need in the discrete case, but only because of the uniqueness of the counting measure. The argument absolutely does not carry over to the continuous case.

• Pierre Jacob says:

Hey,

That’s an interesting point I never saw discussed before. Can you provide references?

Cheers,

Pierre

• Pierre Jacob says:

Sorry I meant: this point.

“Nope, in fact it’s incredibly nontrivial. In order to try to formalize a MAP estimator you need to construct lim_{n \rightarrow \infty} E[ L_{n} (g, \hat{g}) ], but if you’re careful in each of your steps you’ll find that the limit and the expectation do not commute — the L_{\infty} norm does not define a valid loss function! While the mean and median are well-defined and parameterization-invariant the MAP is not!”

• betanalpha says:

Pierre,

There don’t seem to be too many great papers that directly criticize the naive loss functions, they either assume their validity a priori or never consider them at all, so I’d just forward you to X’s great discussions on his blog, for example https://xianblog.wordpress.com/2011/04/25/map-mle-and-loss/, which are peppered with references.

• Pierre Jacob says:

Thanks a lot!

8. It doesn’t need to carry over to the continuous case. I’m only trying to say that there exist questions for which the mode is the answer and the mean isn’t. You stated above (by calling my post a straw man) that that wasn’t your claim at all, in which case there’s nothing to disagree about. Sorry for being overly grumpy about it.