No;

I was all horns and thorns

Sprung out fully formed, knock-kneed and upright

— Joanna Newsom

Far be it for me to be accused of liking things. Let me, instead, present a corner of my hateful heart. (That is to say that I’m supposed to be doing a really complicated thing right now and I don’t want to so I’m going to scream into a void for a little while.)

The object of my ire: The 8-Schools problem.

Now, for those of you who aren’t familiar with the 8-schools problem, I suggest reading any paper by anyone who’s worked with Andrew (or has read BDA). It’s a classic.

So why hate on a classic?

Well, let me tell you. As you can well imagine, it’s because of a walrus.

I do not hate walruses (I only hate geese and alpacas: they both know what they did), but I do love metaphors. And while sometimes a walrus is just a walrus, in this case it definitely isn’t.

The walrus in question is the Horniman Walrus (please click the link to see my smooth boy!). The Horniman walrus is a mistake that you an see, for a modest fee, at a museum in South London.

The story goes like this: Back in the late 19th century someone killed a walrus, skinned it, hopefully did some other things to it, and sent it back to England to be stuffed and mounted. Now, it was the late 19th century and it turned out that the English taxidermist maybe didn’t know what a walrus looked like. (The museum’s website claims that “only a few people had ever seen a live walrus” at this point in history which, even for a museum, is really [expletive removed] white. **Update: **They have changed the text on the website! It now says “Over 100 years ago, not many people (outside of Artic regions) had ever seen a live walrus”!!!!!!!!!!)

But hey. He had a sawdust. He had glue. He had other things that are needed to stuff and mount a dead animal. So he took his dead animal and his tools, introduced them to each other and proudly displayed the results.

(Are you seeing the metaphor?)

Now, of course, this didn’t go well. Walruses, if you’ve never seen one, are huge creatures with loose skin folds. The taxidermist did not know this and so he stuffed the walrus full leading to a photoshop disaster of a walrus. Smooth like a beachball. A glorious mistake. And a genuine tourist attraction.

So this is my first problem. Using a problem like 8 schools as default test for algorithms has a tendency to lead to over-stuffed algorithms that are tailored to specific models. This is not a new problem. You could easily call it the *NeurIPS Problem* (aka how many more ways do you want to over-fit MNIST?). (Yes, I know NeurIPS has other problems as well. I’m focussing on this one.)

A different version of this problem is a complaint I remember from back in my former life when I cared about supercomputers. This was before the whole “maybe you can use big computers on data” revolution. In these dark times, the benchmarks that mattered were the speed at which you could multiply two massive dense matrices, and the speed at which you could do a dense LU decomposition of a massive matrix. Arguably neither of these things were even then the key use of high-performance computers, but as the metrics became goals, supercomputer architectures emerged that could only be used to their full capacity on very specialized problems that had enough arithmetic intensity to make use of the entire machine. (NB: This is quite possibly still true, although HPC has diversified from just talking about Cray-style architectures)

So my problem, I guess, is with benchmark problems in general.

A few other specific things:

* Why so small? *8 Schools has 8 observations, which is not very many observations. We have moved beyond the point where we need to include the data in a table in the paper.

* Why so meta? *The weirdest thing about the 8 Schools problem is that it has the form

$latex y_j\mid\mu_j\sim N(\mu_j,\sigma_j)$

$latex \mu_j\mid\mu,\tau\sim N(\mu,\tau)$

with appropriate priors on $latex \mu$ and $latex \tau$. The thing here is that the observation standard deviations $latex \sigma_j$ are known. Why? Because this is basically a meta-analysis. So 8-schools is a very specialized version of a Gaussian multilevel model. Buy fixing the observation standard deviation, the model has a much nicer posterior than the equivalent model with an unknown observation standard deviation. Hence, 8-schools doesn’t even test an algorithm on an ordinary linear mixed model.

* But it has a funnel! *So does Radford Neal’s funnel distribution (in more than 17 dimensions). Sample from that instead.

**But it’s real data!**** **Firstly, no it isn’t. You grabbed it out of a book. Secondly, the idea that testing inference algorithms on real data is somehow better than systematically testing on simulated data is just wrong. We’re supposed to be statisticians so let me ask you this: *How does an algorithm’s success on real data set A generalize to the set of all **possible data sets? *(Hint: It doesn’t.)

**So, in conclusion,** I am really really really sick of seeing the 8-schools data set.

**Postscript: **There’s something I want to clarify here: I am *not *saying that empirical results are not useful for evaluating inference algorithms. I’m saying that it’s only useful if the computational experiments are clear. Experiments using well-designed simulated data are unbelievably important. Real data sets are not.

Why? Because real data sets are not indicative of data that you come across in practice. This is because of *selection bias*! Real data sets that are used to demonstrate algorithms come in two types:

- Data that is smooth and lovely (like 8-Schools or an over-stuffed walrus)
- Data that is pointy and unexpected (like StackLoss, which famously has an almost singular design matrix, or this excellent photo a friend of mine once took)

Basically this means that if you have any experience with a problem at all, you can find a data set that makes y0ur method look good or that demonstrates a flaw in a competing method, or makes your method look bad. But this choice is opaque to people who are not experts in the problem at hand. Well-designed computational experiments on the other hand are clear in their aims (eg. *this data has almost co-linear covariates*, or *this data has outliers*, or *this data should be totally pooled*).

Simulated data is clearer, more realistic, and more honest when evaluating or even demonstrating algorithms.

So I think we need to make a clear distinction in purposes. Is the purpose of 8 schools to:

1) Demonstrate how well an algorithm works for sampling from a posterior distribution?

or

2) Demonstrate in about the simplest form possible how and why one should build a hierarchical measurement error model from real world considerations?

It seems to me your complaint is primarily about 8 schools as an instance of (1) not 8 school as an instance of (2), and yet, it also seems to me that 8 school is primarily aimed at (2), which is an educational objective, not an algorithmic one.

I don’t think 2 is true either. Again, the standard deviations are known. This is an extremely specific case of a hierarchical measurement error model.

I also don’t think it successfully shows very much about that class of models. What it does show could be demonstrated more clearly and more efficiently by comparing simulated data with three different values of $latex \tau$ (small, intermediate, and large)

It took me a long time coming from the wilds of machine learning to understand the point of the 8-schools example. That’s because, as Dan says, it’s an “extremely specific case of a hierarchical measurement error model.” That makes it very challenging for students as a first example of a hierarchical model.

I also prefer examples where we have some idea of what to expect in the posterior. With 8 schools, the posterior is so sensitive to the prior it feels like we’re just making up answers because we don’t have enough data to cross-validate meaningfully. So I never know what to make from any posteriors I’m shown from 8 schools.

P.S. I’m not asking for details, but I am curious if Spooky Horrible Goose either is the goose or derived their name from the goose mentioned in the body of the post.

Spooky Horrible Goose is Charlie’s halloween name. I imagine it’s related to the Untitled Goose Game that all the cool kids have been talking about.

My own personal favorite introduction model is either my own dropping balls or the Stan group’s lotka-volterra lynxes because these emphasize that Bayesian statistics is capable of allowing us to infer things about mechanistic models of how the world works. Too much of science these days is just “Measure some stuff and see if there’s a statistically significant difference between condition A and condition B” and we need to move towards “think about how things work, come up with simplified assumptions and see how well they can do to explain real world data”

I also like the golf putting example for the same reasons.

In these cases, the real-world aspect can be important, because the important part isn’t so much the statistics as the building of the predictive model from real-world considerations. Though it’s still very useful to simulate how well your model would fit if your model were really and truly correct, which only happens in simulation.

Thanks—that’s exactly why I wrote the Lotka-Volterra case study. I’d been talking to Michael Betancourt a lot. He’d run a course for physics students where they estimated gravity constants. I really like that example and we’d love to have a Stan case study on it. Michael reported that the students did what physics students are good at—they devised clever measurement schemes involving their phones.

Another great example of this is Mahar et al.’s lung clearance diagnostic model. It’s a physically motivated use of an exponential decay mixture.

Stan case study on my dropping paper balls experiment would be a good idea. It has some nice aspects where I intentionally included model error (the paper balls are nothing like smooth spheres, but they can be treated as such for this purpose and after inferring a best fit radius, it induces a sufficiently small model error that it doesn’t matter, all models are wrong, but some are useful kinda thing). Maybe I’ll look into writing it up.

“Because real data sets are not indicative of data that you come across in practice.”

Oh c’mon. That cannot possibly mean anything.

Sometimes the real world data is available in a spreadsheet, comprehensive and detailed, objectively gathered and presented.

It just sounds nutty to insist that a simulated data set is always better if the goal is to gain greater understanding of the real world data.

The presentation is not the point. Sometimes real data fits smoothly within the model assumptions, but you can’t know that before you fit the model and investigate. So build data that tests your model/algorithm in various ways so that you at least know what you are looking for when real data doesn’t happen to behave itself.

The only question that real data can answer when exploring a model or an algorithm (not, of course, when you’re actually interested in what the real data measures) is this: Does there exist at least one data set with properties that make this new method better than or different to existing methods.

The problem with real data is that it may not generalize well enough. The problem with constructed data is that you may get the tails of the distributions wrong (i.e., not representative of real data), and neglect bias, in which case any calculations that are sensitive to the tails or bias can be far off.

Making the whole thing harder is that it’s nearly impossible to know the tails well enough with most real data sets.

“when real data doesn’t happen to behave itself”

A thought experiment:

Your grad student tells you that she has an algorithm that will predict the next 100 entries in a real-world data set of 1000 existing entries. But she is having trouble getting the data set to test it. You say “No problem, you know the basic parameters, test your algorithm on a synthetic data set. That’s better anyway.”

She develops a method to generate 1000 synthetic entries intended to simulate the real-world data. She runs her algorithm and predicts the next 100 entries. She then generates 100 new synthetic entries using the method she developed, but the numbers don’t match very well.

The next morning, she gets an e-mail with the full, real-world, up-to-date data set. She runs her algorithm and generates a prediction. She waits until 100 real-world entries are added to the data set, and discovers that her algorithm did an excellent job of predicting them. As more data accumulates, the fit gets better.

Do you tell her:

1. It failed when it mattered, which was with the synthetic data. Move on to a new project.

– or –

2. Your method of generating synthetic data must have been faulty. Nice algorithm!

Now swap “real world” and “synthetic” in the thought experiment. The algorithm works on the synthetic data but not on the real data. Do you say:

1. The algorithm may well be fine, the data set must not be representative. [Grad student: Not representative OF WHAT?]

– or –

2. Your method of generating synthetic data must have been faulty.

I know that statisticians would want to dive in, figuring out more of the statistical properties of the real-world data to understand what went wrong. And they wouldn’t be happy until they could generate synthetic data that works. But does any of that add to the confidence in the algorithm, or does it just allow the statisticians to sleep better at night?

Dan wrote:

“Experiments using well-designed simulated data are unbelievably important. Real data sets are not.”

Can my thought experiment be shoehorned into this? I am genuinely interested in how that would happen.

Matt, I think Dan didn’t do a good job of putting the proper context on his post, for example see here and the follow up from Phil:

https://gelmanstatdev.wpengine.com/2019/10/15/a-heart-full-of-hatred-8-schools-edition/#comment-1141665

The point isn’t about how important real world examples are for testing *data models* it’s about how important real world examples are for testing *computational methods for sampling from posteriors*

You can’t test models of how electrons interact in a plasma using simulated data, you need to collect some measurements from real plasma. Similarly, you can’t test models for economic decision making using random number generators, you need to collect some data on people’s spending habits… But these are tests of models of real world processes, data models.

On the other hand, you *can* test a sampling algorithm on *any* probability density in N dimensions, and there’s no reason to think that posterior probability distributions from real world small sample information poor statistical problems out of textbooks are good for testing those algorithms.

At least that’s my interpretation of his point. I hope I’m not misstating it.

“Secondly, the idea that testing inference algorithms on real data is somehow better than systematically testing on simulated data is just wrong.” Well, I agree with this and love that you point it out that clearly here, but… how results on simulated data generated from artificial models generalise to the real data that we may want to analyse isn’t that clear either. I think we need both really, they complement each other. Even after doing both, in order to generalise to the data for which the investigated method really matters, we still need some untestable and pretty, let’s say, courageous assumptions. Or, in other words, quite a bit of uncertainty will still remain, always.

I think you’re right. A thing that I think would be more useful that 8-schools (or StackLoss or any of the others) is building field-specific directories of data sets. Because the structure of actual data in, say, neuroscience differs from the structure of actual data in, say, education. 8-Schools is a perfectly ok example of a certain type of educational data. We should have more and, using that, get insight on our models and algorithms in that specific field.

My problem is that when you rip these things out of context they become essentially useless. It is unclear what “population” of datasets you can generalize to.

(Going by memory on the problem) I got something very different out of the 8 schools problem. In trying to think about the it, I ended up realizing that you could tackle it from one extreme to the other. That is, all schools are sampled independently, and you can analyze rank their performance by their means to see which ones did well or badly. The implication would be that some schools carried out the intervention better or worse than others.

Or you could treat them as all being part of one and the same distribution, in which case the data seems to show that the same treatment effect can have quite different results.

Or, more likely, somewhere in between would be more likely. The key point for me was that the problem as posed gives us no way to choose the location on that continuum.

That last paragraph is what I meant in my comment above by the posterior being driven by the prior in 8 schools and us having no way to evaluate it.

From this perspective it could be an important example because it shows that different assumptions based on real world information results in different results, which is an indication that you should argue for and make relatively strong assumptions (ie. moderately informative priors, for examples priors that avoid zero or infinity on the between-schools standard deviation). Classical statisticians seem to see this as a flaw, whereas I see this as connecting the model to reality: we know that there will be variation among schools, tau = 0 is not realistic, neither is variation orders of magnitude larger than between-student variation, so we can choose a gamma type prior that places the high probability region on something like 1/10 to 10 x the student standard deviation and get some kind of reasonable estimates… as Phil said (and I’m assuming he’s correct, because I have never actually done numerical experiments on the 8 schools example)

So the point of the example is that *real world prior information is very important* and it shows you how to use it.

As far as that goes, I think it’s a useful example.

PS: I think the gamma(4, 3/4) prior gives you a most likely value of 4 and a 95% range from about 1.4 to 11.7 which seems reasonable.

That’s a really good point—although highly sensitive to the prior,

reasonablechoices of priors produce similar results.Real world knowledge is also why I like my baseball case study (and why it fails for most readers worldwide)—we have 100 years of real-world data to act as an informal but very strong prior.

Dan:

You write, “But it’s real data! Firstly, no it isn’t. You grabbed it out of a book.”

No. I put it into a book.

In this, as in many things Andrew, you are the exception :p

Dan:

Another thing. You characterize the 8-schools data as “smooth and lovely.”

Actually, the 8 schools data are smooth and lovely only in retrospect!

Here’s the background. Throughout the 1970s, lots of research was done on hierarchical regression modeling, often in education examples. Standard practice for both theory and application was to use point estimates for the hyperparameters. The 8 schools example (published in 1981, based on an experiment conducted in the late 1970s) was special because it was

difficult, as the natural point estimate of the group-level variance was zero. This motivated Rubin to perform a full Bayes analysis.So, I think that to describe the 8 schools as smooth and lovely is like saying that Hamlet sucks cos it’s full of cliches.

And it’s relevant that the 8 schools was a real application. Yes, Rubin (or someone else) could’ve run a simulation study of hierarchical modeling where the group-level variance was low enough that the point estimate would often be zero; indeed, Yeojin, Sophia, Jingchen, Vince, and I did this as part of our development of zero-avoiding priors for stable Bayesian point estimation of hierarchical variance parameters. But (a) I don’t know that Rubin’s example would’ve been so compelling with fake data, and (b) Let’s face it: Rubin published the 8 schools in 1981 and we did our work in 2013, and this work was very much inspired by struggles we’d had with the 8 schools over the year.

Let me put it another way. Remember that famous passage from Boswell:

The 8 schools is a stone that has been used to refute some methods. It’s not the only stone out there, that’s for sure. But a useful stone it’s been.

Thanks – I had remembered it being an example selected by Don to provide a good contrast between full Bayes and empirical Bayes.

Now, the problem I have with it for students is it makes things too simple and straight forward. The data is all in hand, there is no sense in varying quality between the studies, no sense of studies being missed, sigma_j is taken as known without discussion, etc. So my sense is they think they will understand something about meta-analysis and doing realistic hierarchical analyses when they don’t yet.

Now, the simple bootstrap is worse as it seem so straightforward and easy and works well for toy problems. I have seen countless people (often statisticians) doing silly bootstrapping on real problems thinking they should be simple too.

The haters are missing the point. Missing all the points!

I run into real-world problems all the time in which a multi-level model would be a great approach, and I always have to explain to someone, or several people, or a bunch of people, what that means. The 8 schools problem is great for this. The measurements are subject to substantial error, the true values are from some distribution with unknown mean and variance (and for that matter unknown distribution), the classical estimate is unrealistic, and I’m trying to estimate both the distribution of the true values and the true values associated with the individual observations. This general structure comes up all the time. The fact that it isn’t hard to compute…who tf cares? If anything that’s a feature, not a bug.

And Bob and Dan agreeing that its specificity makes it “very challenging for students as a first example of a hierarchical model”, well, I don’t know what students you’re talking to. When I was a student I didn’t find it “very challenging”, I found it extremely clear…and it’s not just me: I have used it to explain hierarchical models to many people and they have all understood it, or at least claimed to.

Echoing Tom P (above), one of the things I like is that the example invites pondering the two extremes — complete pooling and no pooling — and the fact that both of them seem unreasonable. It’s great for walking people through options that might seem tempting, like testing each school’s estimate against the mean, and if it differs by a ‘statistically significant’ amount then let it stand alone, otherwise give it the mean effect, and illustrating how crazy that would be in practice….and, by extension, how crazy it is to do it in many other real-world problems where that is in fact what some people do.

Dan and Bob, you both seem really focused on how it’s not a great _computational_ example, and given what you work on maybe that’s understandable, but you seem to be completely missing the fact that it is a great _conceptual_ example.

This came up because you asked a couple of days ago about something I said on a post about a computational method. So the reason and the context was computational. Also a few modelling reasons I listed above. I hope that clears up any lingering confusion.

Dan,

If you hate it as a computational example but like it in other ways, that didn’t come across to me. And maybe that’s not true either, maybe you just plain hate it. And that’s fine. If you claimed it’s a _bad_ example, that would be a factual claim with which I would take issue. But if you’re just saying you hate it, hey, that’s also a factual claim and it’s not like I think you’re lying about it. And it’s certainly fine with me: De gustibus non est disputandum, and all that. (Why on earth does autocorrect try to ‘correct’ that to xustibus. WTF?)

Me, mainly—I’m the student. But I’m reasoning partly by theory that there are several moving parts—measurement error and hierarchical modeling, which can be tackled separately and then combined. I also find the first hierarchicla model example in BDA (rats in chapter 5) very challenging because it assumes a lot of background in the way it manipulates Jacobians for priors, plugs in moment-based estimates, etc. I get it now, but it’s a tough first example for someone without a strong math-stats and applied-stats background.

By the way, I do wish the raw data for the 8 schools experiments were available. It would complete the circle to be able to fit the model to all the data at once, rather than working with these derived quantities.

One of the difficulties with the 8 schools example as it stands is that it’s not so easy for students to directly emulate, as in real-life examples you typically don’t want to treat the sigma_j’s as known.

Unfortunately, I don’t have any idea how to track down these data. About 15 or 20 years ago I asked Rubin if he had the raw data, and he looked and couldn’t find anything.

It’s true that one typically doesn’t want to treat the sigma_j as known, but it so happens that I am working with some data right now, tonight, in which I need, or at least strongly want, to treat the sigma_j as known. And I’ve run into it before, too. It may be more common than you think.

Its often done as a convenience in meta-analysis especially as the raw data is seldom available, but even with just summaries like mean and variance you can treat sigma_j as UNknown in the modelling. Now, if the samples were large it will make little difference but if they were small you can get multi-modality in the likelihoods and all sorts of numerical problems can occur.

Rather off-topically and unhelpfully for your metaphor, it crosses my mind that a taxidermist in late 19th century England, especially one with friends who kill walruses, would be very likely to have seen _photographs_ of walruses. So I don’t buy that excuse. Like translators (we get paid by the word), he was probably paid by the volume of the work produced.

(Photography was a major pain* in the late 19th century, but it was possible and lots of people put up with said pain, especially folks doing expedition sorts of things.)

*: Plates were only replaced by film (invented in 1884) in the very late 19th century. It really was a pain.

I have been waiting for this post for years, but I did not know it until now.

Now that I know that that it is not just me that finds the 8 schools example unsatisfying, I don’t need to blame it on my lack of statistical erudition.

Paradoxically, I like the 8 school example *more* after reading the post and the comments.

That walrus is hilarious, but my favourite piece of taxidermy is this lion:

https://www.atlasobscura.com/places/the-lion-of-gripsholm-castle-strangnas-sweden

I kind of like the Horniman Walrus. Seems like a pretty good first approximation. When you have no knowledge about walruses, it advances your knowledge significantly. After seeing the Horniman Walrus, I would have no problem picking out a real walrus from a lineup that included a polar bear, a penguin, a manatee, and a shark.

Real walrus for comparison: https://www.dkfindout.com/uk/animals-and-nature/seals-sea-lions-and-walruses/walrus/

The Horniman Walrus: https://www.horniman.ac.uk/collections/stories/horniman-highlights-tour

+1, certainly not an example of the worst taxidermy …

as opposed to this:

https://www.badtaxidermy.com/viewpost/326/Quaaaaaaaaaack

Minor correction: ‘a modest fee’ is actually ‘no fee’.

Tangent (which I feel is ok on a post by the #1 tangent writer):

I go to Horniman all the time with the kids and have a couple of friends who have worked there. BTW they say the walrus is soaked in cyanide and that’s why you’re not allowed on the (also embarrassingly unrealistic – it’s not even cold!) plastic iceberg… I digress. The walrus is an embarrassment to them but it’s also the only reason people across the world (even on this blog) have heard of a small museum in South London. So they can’t take it away and say ‘Remember when we had that ridiculous overstuffed walrus there for 100 years – weird!’ because they would lose visitors. The good thing is their other exhibitions are superb. They don’t want people who visit to remember the visit for the walrus (and they don’t). They’re starting from below 0 so can’t put on something weak. I guess what I’m saying is that if people are aware that it’s embarrassing to use the eight schools, they’re starting from below 0 and might have to work harder on everything else they do, like thinking about how to simulate data.

PS – apparently no museum staff have ever taken photos of themselves riding the walrus on their last day.