## Simulation-based calibration: Two theorems

Throat-clearing

OK, not theorems. Conjectures. Actually not even conjectures, because for a conjecture you have to, y’know, conjecture something. Something precise. And I got nothing precise for you. Or, to be more precise, what is precise in this post is not new, and what is new is not precise.

Background

OK, first for the precise part (which is not new): Simulation-based calibration. You have a computer program to get posterior simulation draws p(theta|y). You want to check that this program is doing what it’s supposed to do. You perform this check by first drawing theta_tilde from the prior, p(theta), then drawing y_tilde from the data model, p(y|theta_tilde), then running the program to get a bunch of draws of theta from p(theta|y_tilde). Putting it together, the three things you’ve sampled come from a joint distribution, p(theta_tilde, y_tilde, theta), that is symmetric in theta_tilde and theta. Another way to put it is that posterior inferences based on the simulations of theta should have the nominal coverage, averaging over the prior. One way to do this checking would be to repeat the above process (drawing theta_tilde, then y_tilde, then inference for theta) 1000 times, and then check that the posterior 50% interval includes the true theta_tilde approximately 500 times, that the posterior 95% interval includes the true theta_tilde approximately 950 times, etc. What we actually recommend is to look at the quantile of theta_tilde from each fit and check that these quantiles are normally distributed.

This will not diagnose all problems. It’s only testing for a form of coherence. For example, if your putative posterior draws are really from the prior, or if they’re from a model only fit to a part of your data, you’ll have simulation-based calibration even though you’re not fitting the model you’re supposed to be fitting. As with tests in general, you can reject but never accept.

Anyway, simulation-based calibration exists. This is not new. But it has a couple of problems that limit its use in practice. First is the requirement of looping the procedure 1000 times or whatever so as to average over the prior. Fitting the model something 1000 times takes time. Even 10 times can be annoying. The second restriction is that we’re testing that the model is exactly drawing posterior simulations. But often we want to check approximate fits using ADVI or some other quick-and-dirty method. So what do we do then?

The theorems

Now for the parts that are new but not precise.

“Theorem” 1. Suppose we do simulation-based calibration but instead of starting by take draws theta_tilde from the prior, p(theta), we draw from an alternative distribution, g(theta), and then follow up by drawing y_tilde from the data distribution, p(y|theta_tilde), as before. The theorem is that the posterior simulations will still be approximately calibrated. How approximate this is would depend on the differences between g and p, in relation to the posterior distribution. In a real problem, we’d have simulations from the posterior, so we could possibly compute this measure. The rough idea is that the theta_tilde values we’d draw from g should represent reasonable values, nothing too far in the tails of the distribution.

“Theorem” 2. Suppose we have a hierarchical model so that the parameter vector theta can be divided into a long vector of local or latent parameters, xi, and a short vector of hyperparameters, phi, so that the full prior is p(theta) = p(phi)p(xi|phi). Now suppose we start by drawing phi_tilde from some alternative distribution, g(phi), then draw xi_tilde from p(xi|phi_tilde) and y_tilde from p(y|phi_tilde, xi_tilde). The theorem is that the posterior simulations will still be approximately calibrated for the parameters in xi in the limit as the dimension of xi increases. Here the idea is that we could have weaker conditions on g(phi), compared to theorem 1.

“Theorem” 3. I’m not sure, but something about evaluating approximate computation such as ADVI or embedded Laplace. Here you would not expect exact calibration even if you were drawing from the prior, so the theorem is something about an approximation to an approximation.

Don’t ask me to prove these theorems—I don’t even know what they are. As Lakatos taught us, the way things work is not that someone states a theorem and then it gets proved. Rather, the theorem, the proof, the conditions of the proof, and the counterexamples all come together.

Let me tell you about a piranha

Even though I don’t know what to do here, I trust my intuition that there’s something under all this hay.

Remember back in 2018 when I wrote, “Important statistical theory research project! Perfect for the stat grad students (or ambitious undergrads) out there”? That post was about the then-nonexistent piranha theorem. Well, since then, we stated and proved some of these:

The piranha problem: Large effects swimming in a small pond. (Christopher Tosh, Philip Greengard, Ben Goodrich, Andrew Gelman, and Daniel Hsu)

So, yeah, it happens. We do math and make progress.