## Prior predictive, posterior predictive, and cross-validation as graphical models

I just wrote up a bunch of chapters for the Stan user’s guide on prior predictive checks, posterior predictive checks, cross-validation, decision analysis, poststratification (with the obligatory multilevel regression up front), and even bootstrap (which has a surprisingly elegant formulation in Stan now that we have RNGs in trnasformed data).

Andrew then urged me to look at figure 1 of Gelman, Meng, and Stern’s 1996 paper on posterior predictive assessement (I would’ve included a diagram if the journal didn’t own the copyright). I looked at figure 1 and got really confused about why the y and y_rep were on the same side of theta. Then Andrew said it was like a probabilistic program. Which I took to mean you could write in BUGS directed graphical modeling language. Below, I write down how I’d code prior and posterior predictive checks and cross-validation in a graphical modeling language.

Instead of arrows and boesLet me present all this in a unified way the way I think it’s clearest, which does away with arrows and boxes and just uses graphical modeling notation (like BUGS or Stan supports). Then Andrew can fill all of us in on what I’m missing

A simple regression example

Nothing interesting here, just a simple regression of a univariate outcome given a univariate predict. I’ll use explicit indexing to make it clear where there are multivariate quantities. I’m also just throwing down an arbitrary prior for completeness.

```a ~ normal(0, 2)
b ~ normal(0, 2)
s ~ lognormal(0, 1)
y[1:N] ~ normal(a + b * x[1:N], s)
```

The variables a, b, and s are parameters, whereas y and x are the observed outcomes and predictors. y and x will be known, and we’ll run something like Stan to get posterior draws

```a, b, s ~ p(a, b, s | x, y)
```

Posterior predictive checks

To get posterior predictive draws, we add a single line to the graphical model,

```a ~ normal(0, 2)
b ~ normal(0, 2)
s ~ lognormal(0, 1)
y[1:N] ~ normal(a + b * x[1:N], s)
y_rep[1:N] ~ normal(a + b * x[1:N], s)
```

Here y_rep is declared as a parameter in Stan, because it’s not observed. Also note that the same x values are used for both y and y_rep.

We observe y and x as before, but now get posterior draws for y_rep in addition to the regression parameters,

```a, b, s, y_rep ~ p(a, b, s, y_rep | x, y)
```

We just throw away the draws for the parameters and get draws from the posterior predictive distribution

```y_rep ~ p(y_rep | x, y)
```

Monte Carlo methods are so much easier than calculus.

Prior predictive checks

This one just drops the line with the data, but continues to use the same predictor vector x for the replications. The graphical model is

```a ~ normal(0, 2)
b ~ normal(0, 2)
s ~ lognormal(0, 1)
y_rep[1:N] ~ normal(a + b * x[1:N], s)
```

Our posterior draws in a system like Stan now look like

```a, b, s, y_rep ~ p(a, b, s, y_rep | x)
```

and we throw away the parameters again to get prior predictive draws,

```y_rep ~ p(y_rep | x)
```

Held-out evaluation and cross-validation

Suppose we divide our N data items up into a training set of size M and test set of size N – M. We’ll train on the training set then predict for the test set.

```a ~ normal(0, 2)
b ~ normal(0, 2)
s ~ lognormal(0, 1)
y[1:M] ~ normal(a + b * x[1:M], s)
y[M + 1:N] ~ normal(a + b * x[M + 1:N], s)
```

We’ll provide y[1:M] and x[1:N] as data (that is, the training set of y and all of x). Then we get draws from the posterior predictive distribution:

```a, b, s, y[M + 1:N] ~ p(a, b, s, y[M + 1:N] | x[1:N], y[1:M])
```

and we again just drop the parameters to get posterior predictive draws for evaluation

```y[M + 1:N] ~ p(y[M + 1:N] | x[1:N], y[1:M])
```

For cross-validation, you just provide different slices. Or random slices. I show how to do that in the forthcoming user’s guide chapters. I also show how to use the generated quantities block to make the predictive draws pure Monte Carlo draws and also cut down on computation time compared to using MCMC. But that’s just an implementation efficiency detail.

What about Gelman, Meng, and Stern’s diagram?

I’m still confused. But now my confusion is more about why there are multiple y_rep. I only use one in my approach. Then we get simulations for it which characterize the relevant predictive distribution. Also, I don’t understand why there’s only one theta in the posterior predictive diagram (1a), whereas there are multiples in the prior predictive diagram. To me, the only difference is that the edge for y doesn’t show up in the prior predictive check. It’s something you have, but not something that’s in the model. I think what Gelman, Meng, and Stern are doing here is trying to include y in the prior predictive model. My guess is that Andrew’s going to say that they know y at the point the prior predictive check is performed and all data should be on the table. Unfortunately, that requires what’s essentially a BUGS-style cut in the graphical model where you don’t let information from y bleed into theta. The multiple theta is an attempt to render it without cut. At least that’s my guess. Let’s see what Andrew says. (I could just ask via email, but it’s more fun to do my homework in front of a live audience.)

1. Phil says:

Oooh oooh oooh, teacher, teacher, I know the answer! Let me, let me!

I think you got the wrong end of the stick right from the start.

I feel stupid weighing in if Andrew is going to answer, even if my answer is right, which of course it might not be. But I’m going to answer anyway, because there’s worse things than feeling stupid.

Don’t divide up your data. Train on the entire dataset (all N points), then generate your draws for the entire dataset.

You fit your model and get posterior predictions for a and b and s. Great, now use those to simulate some fake data y_rep, a complete set. If y_rep doesn’t look a lot like y, there’s probably a problem. How could this happen, though, since you’re fitting the model to the data and then using the result to generate y_rep? Well, it could happen if your model is lousy.

Let’s take your example. Your model assumes y is normally distributed around a + bx. Suppose in fact the distribution is longer-tailed than a normal, like a t_3 or something. You’ll find that the distribution of y is narrower than the distribution of y_rep.

With your simple model it’s hard to think of issues with the model that you can’t discover just as well by looking at the data themselves, e.g. you’ve assumed the distribution of y about a + b*X is iid normal; if in fact there’s correlation between y[k] and y[k+1], well, you can test that in the raw data, you don’t need to look at the posterior predictive checks. But in a lot of models it’s really hard to anticipate all the ways your model can be wrong, and how to check for them, e.g. in a multi-level model with a bunch of different levels, parameters and hyperparameters, etc.

By the way, your [M + 1:N] needs to be [ (M + 1):N ] everywhere.

• Bob Carpenter says:

I just broadcast my ignorance on a high-traffic blog. It’s an effective, if sometimes embarassing, way to course correct :-)

I just added seven or so new chapters to the user’s guide, including one on prior and posterior predictive checks. It’ll be out in the next doc release, which is scheduled for April 20. I included detailed examples of how to code the Gelman, Meng and Stern examples of prior and posterior checks with a mixed check for hierarchical models. The final section of that chapter goes over the graphical modeling notation.

@Phil — what you suggest in paragraphs 4-5 is a posterior predictive check. As a graphical model, it looks like this:

```theta ~ p(theta)  // prior
y ~ p(y | theta)  // likelihood
y_rep ~ p(y | theta) // posterior predictive replications
```

where only y is observed (so we get posterior draws for theta and y_rep). The BDA-style density notation overload is meant to say that y and y_rep are conditionally i.i.d. given theta. In clean probability theory notation, we’d just say p_{Y_rep | theta} = p_{Y | theta}.

I go through an example of how to code this in Stan with y data, theta paramters, and y_rep generated quantities. Implemented that way in Stan, the procedure is exaclty as you outline in paragraphs 4-5. And that’s how it’s presented. I only went back and added the graphical modeling notation at Andrew’s urging. It did help me understand the mixed case of a hierarchical model where the group level parameters may either be replicated or not depending on whether you want to check predictions for existing groups or new groups.

The example I use is overdispersed count data, and it does exactly as described in paragraph 6. It’s a simple example where if you just took the data mean and data variance, you’d see the problem. But it’s just a hello-world type textbook example, like the rest of the manual. The point is to show how the technique works, not how it would apply to a complicated real-world example. Culturally, computer scientists do everything in this hello-world way, whereas I find statisticians prone to dive right into the gory details of real data.

P.S. [M + 1:N] parses as [(M + 1):N] in Stan. The colon has lower precedence than arithmetic operators. The colon isn’t technically an operator in Stan in that 1:N doesn’t have a meaning as a container all on its own like it does in R. But I should still make it clearer in the manual.

• Andrew says:

Bob:

To be super-clear, rather than referring to “posterior predictive check” and “prior predictive check,” we should refer to “predictive check” and have the conditioning be one of the arguments to the operator.

For example, consider an experiment with data y1,…,y_N, a parameter vector alpha, and hyperparameters phi. We fit the model and do inference to get p(alpha, phi | y).

Usually we’d talk about the “posterior predictive check” in which we simulate from p(y^rep | alpha, phi, y), using the posterior simulations of alpha and phi, or the “prior predictive check” in which we simulate from p(y^rep | y), drawing new alpha and phi (equivalently, we can simulate p(y^rep, alpha^rep, phi^rep | y) and discard alpha^rep and phi^rep. (In the prior predictive chekc the conditioning on y is unnecessary but we can keep it just to be consistent with our other notation.) Or we could do an “intermediate predictive check” and simulate from p(y^rep, alpha^rep | phi, y).

In the unified “predictive check” notation, we’d write something like: y^rep = check(y | …).
So, in the above example,
posterior predictive check is check(y | theta, phi, N)
prior predictive check is check(y | N)
intermediate predictive check is check(y | phi, N)

Setting up a predictive check is like setting up a Stan program in that you have to decide what to condition on.

• Bob Carpenter says:

Yes, that’s exactly how I phrased it in the user’s guide at your earlier urging. I coded all three versions using both graphical modeling notations and explicitly in terms of densities as in your reply. And as you say, in both cases, you need to decide what to condition on. It really helped writing this all out.

There’s also an entire section on intermediate checks for hierarchical models following your paper.

2. Andrew says:

Bob:

Yes, in retrospect showing y^rep_1, y^rep_2, etc. was probably not a good idea. This was before probabilistic programming was a “thing,” and I wasn’t always thinking so clearly about how to express posterior uncertainty in a graph.

From a Bayesian perspective, the key is that there’s a joint distribution, p(y, y_rep, theta | M) = p(theta | M) p(y | theta, M) p(y_rep | theta, M). That’s for the posterior predictive check. For the prior predictive check, it’s p(y, y_rep, theta, theta_rep) = p(theta | M) p(theta_rep | M) p(y | theta, M) p(y_rep | theta, M). More generally, different replications have different “graphical models.” I’ve been interested in this for a long time and remain interested in writing something more formal about it.

They way I think about it is that you start with some graphical model, a big joint distribution of observed data, latent data, and parameters, and you can do inference on that graphical model. Think of that model as being on some flat plane. Then there are auxilary models connected to our main model. The auxiliary models, which can be visualized as extending the model into a third dimension, represent different versions of SBC and PPC, and there are rules for how the different models connect to each other.

It’s all related to that “topology of statistical models” thing that we’ve been talking about, the formalization of Bayesian workflow.