Confidence intervals, compatability intervals, uncertainty intervals

“Communicating uncertainty is not just about recognizing its existence; it is also about placing that uncertainty within a larger web of conditional probability statements. . . . No model can include all such factors, thus all forecasts are conditional.”us (2020).

A couple years ago Sander Greenland and I published a discussion about renaming confidence intervals.

Confidence intervals

Neither of us likes the classical term, “confidence intervals,” for two reasons. First, the classical definition (a procedure that produces an interval which, under the stated assumptions, includes the true value at least 95% of the time in the long run) is not typically what is of interest when performing statistical inference. Second, the assumptions are wrong: as I put it, “Confidence intervals excluding the true value can result from failures in model assumptions (as we’ve found when assessing U.S. election polls) or from analysts seeking out statistically significant comparisons to report, thus inducing selection bias.”

Uncertainty intervals

I recommended the term “uncertainty intervals,” on the grounds that the way confidence intervals are used in practice is to express uncertainty about an inference. The wider the interval, the more uncertainty.

But Sander doesn’t like the label “uncertainty interval”; as he puts it, “the word ‘uncertainty’ gives the illusion that the interval properly accounts for all important uncertainties . . . misrepresenting uncertainty as if it were a known quantity.”

Compatability intervals

Sander instead recommends the term “compatibility interval,” following the reasoning that the points outside the interval are outside because they are incompatible with the data and model (in a stochastic sense) and the points inside are compatible our data and assumptions. What Sander says makes sense.

The missing point in both my article and Sander’s is how the different concepts fit together. As with many areas in mathematics, I think what’s going on is that a single object serves multiple functions, and it can be helpful to disentangle these different roles. Regarding interval estimation, this is something that I’ve been mulling over for many years, but it did not become clear to me until I started thinking hard about my discussion with Sander.

Purposes of interval estimation

Here’s the key point. Statistical intervals (whether they be confidence intervals or posterior intervals or bootstrap intervals or whatever) serve multiple purposes. One purpose they serve is to express uncertainty in a point estimate; another purpose they serve is to (probabilistically) rule out values outside the interval; yet another purpose is to tell us that values inside the interval are compatible with the data. These first of these goals corresponds to the uncertainty interval; the second and third correspond to the compatibility interval.

In a simple case such as linear regression or a well-behaved asymptotic estimate, all three goals are served by the same interval. In more complicated cases, no interval will serve all these purposes.

I’ll illustrate with a scenario that arose in a problem I worked on, a bit over 30 years ago, and discussed here:

Sometimes you can get a reasonable confidence interval by inverting a hypothesis test. For example, the z or t test or, more generally, inference for a location parameter. But if your hypothesis test can ever reject the model entirely, then you’re in the situation shown above. Once you hit rejection, you suddenly go from a very tiny precise confidence interval to no interval at all. To put it another way, as your fit gets gradually worse, the inference from your confidence interval becomes more and more precise and then suddenly, discontinuously has no precision at all. (With an empty interval, you’d say that the model rejects and thus you can say nothing based on the model. You wouldn’t just say your interval is, say, [3.184, 3.184] so that your parameter is known exactly.)

For our discussion here, the relevant point is that, if you believe your error model, this is a fine procedure for creating a compatability interval—as your data becomes harder and harder to explain from the model, the compatibility interval becomes smaller and smaller, until it eventually becomes empty. That’s just fine; it makes sense; it’s how compatability intervals should be.

But as an uncertainty interval, it’s terrible. Your model fits worse and worse, your uncertainty gets smaller and smaller, and then suddenly the interval becomes empty and you have no uncertainty statement at all—you just reject the model.

At this point Sander might stand up and say, Hey! That’s the point! You can’t get an uncertainty interval here so you should just be happy with the compatibility interval. To which I’d reply: Sure, but often the uncertainty interval isn’t what people want. To which Sander might reply: Yeah, but as statisticians we shouldn’t be giving people what we want, we should be giving people what we can legitimately give them. To which I’d reply: in decision problems, I want uncertainty. I know my uncertainty statements aren’t perfect, I know they’re based on assumptions, but that just pushes me to check my assumptions, etc. Ok, this argument could go on forever, so let me just return to my point that uncertainty and compatibility are two different (although connected) issues.

All intervals are conditional on assumptions

There’s one thing I disagree with in Sander’s article, though, and that’s his statement that “compatibility” is a more modest term than “confidence” or “uncertainty”; all of the statements are mathematically valid within their assumptions, and none are in general valid when the assumptions are false. When the assumptions of model and sampling and reporting are false, there’s no reason to expect 95% intervals to contain the true value 95% of the time (hence, no confidence property), there’s no reason to think they will fully capture our uncertainty (hence, “uncertainty interval” is not correct), and no reason to think that the points inside the interval are compatible with the data and that the points outside are not compatible (hence, “compatibility interval” is also wrong).

All of these intervals represent mathematical statements and are conditional on assumptions, no matter how you translate them into words.

And that brings us to the quote from Jessica, Chris, Elliott, and me at the top of this post, from a paper on information, incentives, and goals in election forecasts, an example in which the most important uncertainties arise from nonsampling error.

All intervals are conditional on assumptions. Calling your interval an uncertainty interval or a compatability interval doesn’t make that go away, any more than calling your probabilities “subjective” or “objective” absolves you from concerns about calibration.