Fugitive and cloistered virtues

There’s a new revolution, a loud evolution that I saw. Born of confusion and quiet collusion of which mostly I’ve known. — Lana Del Ray

While an unexamined life seems like an excellent idea, an unexamined prior distribution probably isn’t. And, I mean, I write and talk and think and talk and write and talk and talk and talk and talk about this a lot. I am pretty much obsessed with the various ways in which we need to examine how our assumptions map to our priors (and vice versa).

But also I’m teaching a fairly low-level course on Bayesian stats (think right after a linear regression course) right now, which means I am constantly desperate for straightforward but illuminating questions that give some insight into what can and can’t go wrong with Bayesian models.

So I asked myself (and subsequently my students) to work out exactly what a prior that leads to a pre-determined posterior distribution looks like. Such a prior, which is necessarily data-dependent, is at the heart of some serious technical criticisms of Bayesian methods.

So this was my question: Consider the normal-normal model $latex y_i\mid\mu\sim N(\mu,1)$, $latex i=1,\ldots,n$. What prior $latex \mu\sim N(m,s^2)$ will ensure that the posterior expectation is equal to $latex \mu^*$?

The answer will probably not shock you. In the end, it’s any prior that satisfies $latex m=ns^2(\mu^*-\bar{y})+\mu^*$. It’s pretty easy to see what this looks like asymptotically: if we fix s, then $latex m\rightarrow\mathcal{O}\left(\text{sign}(\mu^*-\mu^\text{true})n\right)$, which is to say the mean gets HUGE.

Alternatively, we can set the prior as $latex \mu\sim N(2-\bar{y},n^{-1})$, which is probably cleaner.

One thing that you notice if you simulate from this is that when the number of observations is small, these priors look kinda ordinary. It’s not until you get a large sample that either the mean buggers off to infinity or the variance squishes down to zero.

And that should be a bit of a concern. Because if you’ve got a model with a lot of parameters (or–heaven forfend!– a non-parametric component like a Gaussian process or BART prior), there is likely to be very little data “per parameter”.

So it’s probably not enough to just look at the prior and say “well the mean and variance aren’t weird, so it’s probably safe”.

(BTW, we know this! This is just another version of the “concentration of measure” phenomenon that makes perfectly ok marginal priors turn into hideous monster joint priors. In that sense, priors are a lot like rat kings–rats are perfectly cute by themselves, but disgusting and terrifying when joined together.)

So what can you do to prevent people tricking you with priors? Follow my simple three step plan.

1. Be very wary of magic numbers. Ask why the prior standard deviation is 2.43! There may be good substantive reasons. But it also might just give the answer people wanted.
2. Simulate. Parameters. From. Your. Prior. Then. Use. Them. To. Simulate. Data.
3. Investigate the stability of your posterior by re-computing it on bootstrapped or subsetted data.

The third thing is the one that is actually really hard to fool. But bootstraps can be complex when the data is complicated, so don’t just do iid splits. Make your subsampling scheme respect the structure of your data (and, ideally, the structure of your model). There is, as always, a fascinating literature on this.

Part of why we’re all obsessed with developing workflows—rather than just isolated, bespoke results—is that there’s no single procedure that can guard against bad luck or malicious intent. But if we expose more of the working parts of the statistical pipeline, there become fewer and fewer places to hide.

But also sometimes we just need a fast assignment for week 2 of the course.

1. Martha (Smith) says:

I don’t mean to cast aspersions on your post, but I must vehemently disagree with your statement, “rats are perfectly cute by themselves”.

2. jd says:

How many times have you seen “magic numbers” in the prior? (just curious. I have never seen this, but I also only have a few years experience. Mainly I see lots of default priors if they are even reported at all.)

“And that should be a bit of a concern. Because if you’ve got a model with a lot of parameters”
Does this depend on what you mean by a “model with a lot of parameters”? It seems like you are talking about a situation where you set a prior for each parameter individually. When I run a hierarchical model in brms, for example, disease ~ time + (time|state/county/site); I have a lot of parameters (and maybe few data points per group). But setting the prior involves a prior on the sd of group level intercepts and slopes, not each varying intercept or slope…like setting a prior on the partial pooling. Do you mean a situation where you have a lot of population level or ‘fixed’ effects?

For step 2, with a model with lots of varying effects, like disease ~ time + (1|group1) + (1|group2) + (1|group3), do you still simulate data from the prior? The reason that I ask is that it seems that the priors for the sd of group level effects are often much too large if you have a bunch of effects. Especially if the model is on the log scale, then those effects add up quick, and if you simulate data from those priors it is often relatively easy to get data that would be outside the realm of possibility. I think this could happen even if you put priors that were the same sd as the actual data, because when you simulate, it is possible to get all high or all low values from the groups.

• gec says:

> what you mean by a “model with a lot of parameters”?

Setting a prior on the upper level of a hierarchy automatically and implicitly sets priors for all parameters at lower levels. You provide an example of such a model in your second question

> a model with lots of varying effects

And you point out the danger involved in such a model, wherein seemingly benign prior choices at higher levels cascade downward into wonky predictions by virtue of their joint effects at lower levels. So this is a situation in which it is absolutely critical to simulate data from the prior because that’s the only way to understand the consequences of the priors you’ve set at the higher level.

To be clear, I’m not saying that complex hierarchical models are bad, just that simulation is a universally valuable tool for wrapping one’s head around them. And this is particularly so *because* it is generally not possible to set explicit priors at every level of the hierarchy.

• jd says:

Thanks.
For the question about simulating data from the prior (last paragraph), I was actually referring to a model with many group level effects but not hierarchical. Like disease ~ time + (1|group1) + (1|group2) + (1|group3), where groups are not nested. I had a model like this where I set what I thought were reasonable priors for the sd of each group level effect (reasonable in the sense that they constrained the sd of each group into something within the realm of possibility). However, when I simulated data from the model, sampling from the priors, then there was a lot of data that was much outside the realm of possibility. This was because while any one particular group in the real world might reasonably have an sd of say 2.5, not all groups would. When you simulate data from the priors though, it’s possible to get data from the model where the sample from the prior was 2.5 for every group, which in something like a negative binomial model, will add up fast to produce wonky data.
It just seems like the more group level effects you have, then the smaller the prior may need to get when trying to simulate data from the model sampling from the priors (especially for something like a negative binomial model).
Does that make any sense? I’ve actually asked this before somewhere else on this blog, but never got a response. So I figured it was a dumb question/observation or poorly explained by me.

3. You convinced me with the Gabry et al. RSS paper.

I’m almost done writing this and related material up for the Stan user’s guide. It’s going in a whole new part with chapters covering prediction, simulation-based calibration, prior and posterior predictive checks, cross-validation, and the bootstrap. The current version’s still a only a pull request until I finish the x-val chapter.

4. Ben says:

> is at the heart of some serious technical criticisms of Bayesian methods

I would very much like a follow up to: https://gelmanstatdev.wpengine.com/2019/12/03/whats-wrong-with-bayes/ if you feel inspired to type out those technical criticisms of Bayesian methods.

I like where this is going but it feels fishy that u* is fixed and isn’t going to u_{true} as n -> infinity or whatnot.

5. > Simulate. Parameters. From. Your. Prior. Then. Use. Them. To. Simulate. Data.
Here, when you point out that each point in that simulated joint distribution is a valid fake reality which can be used to check the veracity* of your current Bayesian analysis model (ideally somewhat different from the model that generated the joint distribution to reflect models always being wrong), most folks seem to get why/how unexamined priors (and data generating models) can be disastrous.

* veracity as habitual truthfulness (and away to avoid mention frequency of error explicitly).

6. Justin says:

Why stop there? I do

N(mu,s), with
mu~N(m1,s1)
m1~N(m2,s2)
m2~N(m3,s3)
m3~N(m4,s4)
m4~N(m5,s5)
m5~N(m6,s6)
m6~N(m7,s7)
m7~N(m8,s8)

Justin