Someone who would like to remain anonymous writes:

I have read your blog posts discussing the benefits of Bayesian inference and partial pooling from a multilevel modeling approach. Recently, I’ve begun thinking about designing a simulation to prove to myself that these methods provide superior performance.

Here’s what I have thus far: Perhaps I could simulate data for an experimental design with perhaps 10 groups. Perhaps 10,000 data sets? For each data set, I could use the lm R function to estimate group differences using the “business as usual” approach and the stan_lmer function from the rstanarm package as a way to regularize estimates by way of partial pooling and Bayesian inference. What I’m unsure about is how to evaluate the lm and stan_lmer estimates for each data set. How do I show that the stan_lmer estimates are superior? Or maybe the right strategy is to show that the lm estimates produce overfitting? Am I on the right track?

Any help would be appreciated. I want to prove to myself (rather than just assume) that partial pooling and Bayesian inference is a useful tool for a problem that I could theoretically encounter in my own work.

My reply: There are two general ideas that should work here:

1. Fake-data simulation. Simulate fake data under some realistic assumptions. Repeat this simulation many times, each time fitting your different models to the simulated data. Each time, compare the parameter estimates to the true parameter values, and you can see if your preferred method performs better on average.

2. Cross validation. Remove some of your data at random, fit each model to the rest of the data, then see how these fits do at predicting the held-out data. Here’s an example from one of our applied projects.