You’ll sometimes see discussions of the differences between two different approaches to classical statistical null hypothesis testing.

In the Fisher approach, you set up a null hypothesis and then you compute the p-value, which you use as a measure of evidence against the hypothesis.

In the Neyman-Pearson approach, you define the p-value as a function of data and then work out its distribution under the null and alternative hypotheses.

Both approaches become more complicated when the null hypothesis is “composite”—that is, when it includes parameters that need to be estimated from data.

One difference between the two approaches is how you evaluate the p-value. In the Fisher approach it’s defined as the probability of seeing something more extreme than the data; in the Neyman-Pearson approach it’s defined as any function of data that has a uniform distribution under the null hypothesis. (In both frameworks, when you’re working with a composite null hypothesis you can add the word “approximate” to account for uncertainty in the parameters.) To use the terminology of section 2.3 of my article from 2003, Fisher’s all about p-values and Neyman-Pearson is working with u-values.

This doesn’t concern me so much any more cos I’m not really so interested in p-values except in a “sociological” sense to understand how they’re used and misunderstood (for example, the problems with p-values are not just with p-values, but my misunderstanding of the distinction between the Fisher and Neyman-Pearson approach got in my way a few years ago when I was trying to communicate a new idea to the statistics community.

Here’s what happened.

Back in 1991 I went to a conference of Bayesians and I was disappointed that the vast majority seem to not be interested in checking their statistical models. The attitude seemed to be, first, that model checking was not possible in a Bayesian context, and, second, that model checking was illegitimate because models were subjective. No wonder Bayesianism was analogized to a religion.

This all frustrated me, as I’d found model checking to be highly relevant in my Bayesian research in two different research problems, one involving inference for emission tomography (which had various challenges arising from spatial models and positivity constraints), the other involving models for district-level election results.

I was so frustrated with the attitudes of many of the prominent Bayesians that I decided it would make more sense to communicate my perspective to the mainstream of statistics. My key idea had to do with p-values for composite null hypotheses: instead of computing some sort of minimax p-values, you can average the p-value over the posterior distribution of the parameters: thus, p-value(y) = E(p-value(y|theta)), averaging over the posterior distribution of theta, p(theta|y).

At this point the notation is starting to get confusing so I defined y^rep as the random variable over which the p-value is calculated, thus, the p-value for some test statistic T(y) is p-value(y|theta) = Pr(T(y^rep) >= T(y) | theta), and the posterior predictive p-value is p-value(y) = Pr(T(y^rep) >= T(y) | y) = integral Pr(T(y^rep) >= T(y) | theta) p(theta | y) d theta.

So I wrote up my ideas on posterior predictive checking, but from a classical, Fisherian perspective. The idea was that you were going to do a chi-squared test (or, more generally, some hypothesis test) and you wanted to get something close to the right p-value, accounting for the fact that you had a bunch of unknown parameters, and, what with nonlinearity and nonnormality of the model, you couldn’t just adjust the degrees of freedom by subtracting the number of parameters to be fit. The Bayesian posterior calculation was a device for computing a reasonable p-value.

But this argument didn’t really work, in the sense that non-Bayesians pretty much entirely ignored my paper. I’m not complaining—they had no obligation to read it—I’m just saying the message didn’t get through.

Upon reflection, I think my mistake was to present the idea within the Fisher rather than the Neyman-Pearson perspective. Rather than giving a plan for computing a good p-value, I should’ve demonstrated that a hypothesis test calculated under this approach—compute the posterior predictive p-value and reject if it’s less than 0.05—would reject at approximately a 5% rate if the true parameter value is anything within a reasonable range of parameter space. This won’t work under all circumstances, but there are conditions where you will get close to nominal rejection rates.

I think the Fisher approach is how p-values are typically used, but it’s the Neyman-Pearson version that is dominant in statistics textbooks and journal articles.