Jake Hofman, Dan Goldstein, and Jessica Hullman write:

Scientists presenting experimental results often choose to display either inferential uncertainty (e.g., uncertainty in the estimate of a population mean) or outcome uncertainty (e.g., variation of outcomes around that mean). How does this choice impact readers’ beliefs about the size of treatment effects? We investigate this question in two experiments comparing 95% confidence intervals (means and standard errors) to 95% prediction intervals (means and standard deviations). The first experiment finds that participants are willing to pay more for and overestimate the effect of a treatment when shown confidence intervals relative to prediction intervals. The second experiment evaluates how alternative visualizations compare to standard visualizations for different effect sizes. We find that axis rescaling reduces error, but not as well as prediction intervals or animated hypothetical outcome plots (HOPs), and that depicting inferential uncertainty causes participants to underestimate variability in individual outcomes.

These results make sense. Sometimes I try to make this point by distinguishing between *uncertainty* and *variation*. I’ve always thought these two concepts were conceptually distinct (we can speak of uncertainty in the estimate of a population average, or variation across the population), but then I started quizzing students, and I learned that, to them, “uncertainty” and ‘variation” were not distinct concepts. Part of this is wording—there’s an idea that these two words are roughly synonyms—but I think part of it is that most people don’t think of these as being two different ideas. And if lots of students don’t get this distinction, it’s no surprise that researchers and consumers of research also get stuck on this.

I’m reminded of the example from a few months ago where someone published a paper including graphs that revealed the sensitivity of its headline conclusions on some implausible assumptions. The question then arose: what if the paper had not included the graph, then maybe no one would’ve realized the problem. I argued that, had the graph not been there, I would’ve wanted to see the data. But a lot of people would just accept the estimate and standard error and not want to know more.

Andrew:

Somewhat related:

> Sometimes I try to make this point by distinguishing between uncertainty and variation.

Daniel Kahneman talks here about the distinction/confusion among noise, biases, and variance.

https://www.schwab.com/resource-center/insights/content/choiceology-season-5-episode-1

At about 19:30 in.

He doesn’t say anything particularly profound…but you might find it worth listening to in terms clarifying these concepts with students.

To paraphrase Bill Spight from the other thread[1]:

“The question comes down to this. The uncertainty of what?”

Uncertainty and variability are properties of (abstract) objects, that may or may not intersect depending on which objects are being referred to and the specific context. It’s like saying ‘her pen’ and ‘her writing device’. Are they the same? Well, I don’t know: are they referring to the same ‘her’, who is she, and how does she like to do her writing? They’re not exactly the same context, but their overlap/similarity/synonymity (is that a word?) is contextual.

The variability of the time when I look at the clock is generally larger than the uncertainty in that time. But the variability in the next number produced by my random number generator is typically the same as my uncertainty in the outcome. I suspect whether one identifies these concepts as separate may depend on the specific examples one is thinking of when asked about it, which is probably highly dependent on the person’s own context.

I mean, sure, we can talk about the uncertainty in the estimate of a population average. But we can also talk about the uncertainty in the age of the next person picked at random from a population.

[1]https://gelmanstatdev.wpengine.com/2020/05/09/standard-deviation-standard-error-whatever/

> we can speak of uncertainty in the estimate of a population average, or variation across the population

We can also speak of variation of the estimator when we repeatedly calculate sample averages, or uncertainty about the value for a single measurement or individual.

I agree with Kyle, the question is “uncertainty about what”. It may be better to be explicit than to leave the object implicit in the choice of words (which are to some extent synonymous).

“I agree with Kyle, the question is “uncertainty about what”. It may be better to be explicit than to leave the object implicit in the choice of words (which are to some extent synonymous)”

Ditto.

> animated hypothetical outcome plots (HOPs)

Cool – if I ever get around to brewing beer again, that will be the label for it.

One of the things I have been thinking about is expanding on the meaning of probability into what would repeatedly happen in a chosen fake world that hopefully is like ours. And of course, animating that via simulation.

This is a really difficult concept. But it comes up in MCMC and hence in Stan all the time. MCMC standard error on the estimate of a posterior mean for a quantity is the posterior standard deviation of that quantity divided by the square root of the effective sample size. Adding more MCMC draws reduces the MCMC standard error, but the posterior standard deviation doesn’t change. Your uncertainty in the quantity is best characterized by the posterior standard deviation. But it’s hard to get people off their obsession with lots of iterations to get a really tight estimate of a posterior mean.

It’s why David MacKay stated hat you only need an effective sample size of a dozen for MCMC inference; it gets your MCMC standard error to about a quarter of the standard deviation.

We usually recommend aiming for a minimum ESS of 400 with Stan, because that lets you run 4 chains to help diagnose non-convergence and lets you run enough iterations per chain to reduce the noise in the effective sample size estimate itself. But it’s overkill in that we don’t really need to know the posterior mean to within a twentieth of the posterior standard deviation.

The efficient frontier I care about is wall time to effective sample size 100 or even 50 vs. power usage (to adjust more multiple cores, GPUs, etc.) For Stan, that means working on better adaptation.

This is a good example of what Kyle says above “uncertainty of what?”

large effective sample size reduces your uncertainty in your uncertainty (queue Xzibit meme here).

There are several questions:

1) At the highest level, which model should we use? (Model choice, or Bayesian mixture models)

2) Next level: what do we know about the world through our model? (posterior distribution of the parameters)

3) At the next level, what can we numerically calculate from the posterior (MCMC standard errors)

Big MCMC runs just make 3 converge towards 2… but 2 can only converge towards exact knowledge by extensive observations of real world data, not by more draws from your MCMC sampler.

This issue is depressingly common. I mean, just look at the graphs – why on earth do we as statisticians have the convention where the same “error bar” formulation is used to display both completely different quantities?

For that matter, why on earth do we use such similar terms for such distinct concepts: standard deviation vs. standard error vs. sampling error…sampling distribution vs. sample distribution…I can’t tell you how many times I’ve read that an author computed the sampling error from the sample.

Yes, standard deviation is descriptive. It does not decrease as you get more data.

The biggest example of this confusion is all the people trying to decrease the uncertainty in the climate sensitivity. Theyve been publishing papers about this for fecades

Sorry, accidentally submitted the above on mobile before it was complete.

The point of that post was that the biggest example of this confusion is the climate sensitivity. A more precise estimate is said to be worth $10 trillion dollars but the spread around the central value has changed little in decades. That is actually because what they are calling “uncertainty” is a descriptive statistic that will not decrease in response to more data:

https://gelmanstatdev.wpengine.com/2015/12/10/28302/#comment-254706

Basically, the climate sensitivity to CO2 is not a constant. The effect of CO2 varies depending on what is going on with the rest of the Earth/Sun system. This variation is not an uncertainty that decreases as more data is collected.

Frequentism is naturally going to make people equate uncertainty with variation, which is a recipe for endless befuddlment especially since the uncertainty *is* sometimes numerically similar to the variation.

In academic areas outside statistics this befuddlement is a field killer. All of theoretical finance seems based on the assumption that uncertainty = variation, and as an academic field, it’s the epitome of “publish thousands of papers and get absolutely nowhere”.

To see what problems this causes in physics look at that Jaynes paper that contained the famous dice problem:

https://bayes.wustl.edu/etj/articles/stand.on.entropy.pdf

Look around page 64 under “fluctuations”.

To see the crux of the issue, let x1, …, xn be a sequence of events. Think of it as the prototypical “repeated trial” of frequentist lore. Then there’s several distinct entities:

(1) the Uncertainty around each xi computed from the marginal distributions P(xi)

(2) the Uncertainty around the entire sequence x1,…,xn which is something like the entropy computed from P(x1,…,xn)

(3) the actual variation (1/n)(\sum_i(xi – mean)^2)

(4) the estimate of the variation E((1/n)(\sum_i(xi – mean)^2)]

(5) the uncertainty of the estimate of the variation

Sometimes these are numerically similar, often they’re not. In essence, Frequentism starts with the diktat that many of these are identical always and forever.

Isn’t this just precision vs. accuracy, as commonly visualized using a target and a group of shots?

The questions the values/graphs are supposed to answer is,

a) what is the probability that we make the better choice? (error)

OR what is the probability that the better-looking choice leads to a better outcome on average?

b) what is the probability that the better choice leads to a better individual outcome?

I feel that the blog illustration is misleading because the deviation bars are so long and have the same thickness throughout, they’ll intuitively look like linear distributions and not normals if you’re not trained.

It could be super misleading if values are correlated, i.e. if I’m graphing how much money people on the streets have in their wallet before and after I interact with them, you could get a graph similar to your blog post illustration, i.e. high variability, and you’d conclude that the probability that someone is better off after that interaction is >50%, but not that much greater, until you learn that I gave everyone $5.

And that would be clear from an error graph: that it’s a small, but certain improvement. The fact that it is a small improvement needs to be concluded from my knowledge of the domain and the actual values.

I’m not sure you’ve understood the point of the post, which is that people have poor intuition about the difference between a descriptive measure and an inferential one. It has nothing to do with making the right choice.

As to your other points:

Error bars are fairly universally based on a Gaussian distribution, not a uniform one. Lines aren’t the only choice, but they are conventional. If anything, most people learn about standard deviation around the same time they learn about normal distributions, so they likely implicitly link the two – or don’t understand either.

In your example, you’d presumably (or at least you should) use a paired design rather than treating the ‘before’ and ‘after’ as two samples (at which point all your differences would lie exactly at $5, so you’d have a standard deviation of 0 anyway). If you didn’t, then sure, you might miss the evidence.

Besides, if you have outright evidence of causation, that’s not a matter for statistical uncertainty.

No, really it is the distinction between a measure of variation for the sample mean (which is a function of outcomes), versus the measure of variation for a single outcome. Both can be regarded as inferential, the former typically using a confidence interval and the latter using a prediction interval. There, of course, is a third frequentist interval commonly used, the tolerance for a proportion of the outcomes in the population (or equivalently a prediction interval for any number of future outcomes. And yes, confidence intervals for the sample mean can be made vanishingly small, while the corresponding prediction intervals for a single outcome, or the tolerance interval for proportion of outcomes remain large. But I do not see the distinction as descriptive versus inferential.

You’re right, a prediction interval is inferential. I didn’t think about that.

Calculating prediction interval generally includes more assumptions about the distribution of values than a simple standard deviation, no? A standard deviation is just a descriptive measure of variation, which you’re right, can be used to make inferences.

I guess a better description of the distinction (or lack thereof) is that people confuse ‘x is better than y, on average’ with ‘x will be better than y in this case specifically’ when really we should care about both.

Finance/trading struggles with this as well. “Risk is measurable uncertainty while uncertainty is immeasurable risk” Hillson (2004)

David:

No, that’s a different distinction. You’re talking about two sources of uncertainty. I’m talking about uncertainty vs. variation.

Not to mention our uncertainty about variation is generally greater than our uncertainty about means, innit?

Which leads us to the question: What does the mean mean?

I would say the graphic doesn’t demonstrate that “people confuse a well-estimated mean with a certain outcome” as much as it demonstrates that “people can be tricked by bad graphics into mistaking a well-estimated mean for a certain outcome”

Also see aleatory vs. epistemic uncertainty.

“Uncertainty about what” is the crux of it.

Here is a easier explanation for the general public -> https://www.epa.gov/expobox/uncertainty-and-variability

A simple example of what to use to express uncertainty, using a medical example:

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3487226/

Usually, if one doesn’t understand the difference b/w the descriptive and inferential part of their output, they’ll go with the narrower SEM bars (convenient), missing out on what SEM stands for, compared to SD.

BTW, I don’t think ‘error’ is an appropriate term at all. In case of SEM,nobody makes an ‘error’ intentionally. It should be called a ‘standard uncertainty estimate of the population mean’, or similar.

One easy way to improve these plots would be to plot dotplots or boxplots of the entire dataset, rather than just plotting the mean ± SD or mean ± SE.

Boxplot is misleading when visualising uncertainty

https://ctg2pi.wordpress.com/2015/02/24/principles-of-posterior-visualization/