Richard Born writes:

The practice of arbitrarily thresholding p values is not only deeply embedded in statistical practice, it is also congenial to the human mind. It is thus not sufficient to tell our students, “Don’t do this.” We must vividly show them why the practice is wrong and its effects detrimental to scientific progress. I [Born] offer three teaching examples I have found to be useful in prompting students to think more deeply about the problem and to begin to interpret the results of statistical procedures as measures of how evidence should change our beliefs, and not as bright lines separating truth from falsehood.

He continues:

Humans are natural born categorizers. We instinctively take continuous variables and draw (often) arbitrary boundaries that allow us to put names to groups. For example, we divide the continuous visible spectrum up into discrete colors like “red,” “yellow,” and “blue.” And the body mass index (BMI) is a continuous measure of a person’s weight-to-height ratio, yet a brief scan of the Internet turns up repeated examples of the classification [into three discrete categories].

In some cases, such as for color, certain categories appear to be “natural,” as if they were baked into our brains (Rosch, 1973). In other cases, categorization is related to the need to make decisions, as is the case for many medical classifications. And the fact that we communicate our ideas using language—words being discrete entities—surely contributes to this tendency.

Nowhere is the tendency more dramatic—and more pernicious—than in the practice of null hypothesis significance testing (NHST), based on p values, where an arbitrary cutoff of 0.05 is used to separate “truth” from “falsehood.” Let us set aside the first obvious problem that in NHST we never accept the null (i.e., proclaim falsehood) but rather only fail to reject it. And let us also ignore the debate about whether we should change the cutoff to something more stringent, say 0.005 (Benjamin et al., 2018), and instead focus on what I consider to be the real problem: the cutoff itself. This is the problem I refer to as “black/white thinking.”

Because this tendency to categorize using p values is (1) natural and (2) abundantly reinforced in many statistics courses, we must do more than simply tell our students that it is wrong. We must show them why it is wrong and offer better ways of thinking about statistics. What follows are some practical methods I have found useful in classroom discussions with graduate students and postdoctoral fellows in neuroscience. . . .

It’s worth commenting that his logic is not sound. We break up the ‘continuous electromagnetic spectrum’ not arbitrarily, but significantly because we have discrete color rods, three. That’s not arbitrary. It may be centered on our own biology, but it is very real. Black and white are also not arbitrary. There is a hysteresis curve in detection.

People with rigid linear thinking are not considering the non-linearity inherent in biology, or much of nature.

Yeah,that’s not how color perception works. The fact that we have three types of color detecting cells, each of which, by the way, have a continuous response curve, does of course not explain 7 color names.

It does explain why three dimensions are enough to describe all colors humans perceive, though.

The number of distinguished colors, as judged by the number of color names, is a cultural variable. There are groups that distinguish three, or even only two, colors. And the “red” and “green” pigments barely differ from one another n their absorption curves, and neither are in the wavelength range that could be called “red”.

So yes, there is a large amount of arbitrariness in how we partition the spectrum.

As a non-statistician who ventured to write an article on the teaching of statistics (namely, the one from which Dr. Gelman quotes above; https://www.eneuro.org/content/6/6/ENEURO.0456-19.2019), my deepest fear was being criticized for the statistics. So imagine my surprise when I accidentally came upon this blog and found the first thing being discussed was vision–a topic I actually know something about! For my entire professional career, I have studied the primate visual system, and all I can say is that Tom Passin gets it right, and BenK gets it completely wrong. The fact that we have 3 discrete types of cone photoreceptors completely misses the point. As Mr. Passin points out, the classes have very broad and largely overlapping absorption spectra. (In fact, referring to the cone classes as “red,” “green” and “blue” is a complete misnomer, and most vision scientists refer to them more correctly as “long,” “middle” and “short” wavelength cones. Cone spectra here: https://upload.wikimedia.org/wikipedia/commons/9/94/1416_Color_Sensitivity.jpg). Color perception is actually very complex and tricky, but there is no disputing that our categories are both limited and completely arbitrary (even if, as primates we might largely agree among ourselves–a fact which is interesting). And even something as seemingly simple as lightness perception (“black/white”) is complex, and binary distinctions like B/W arbitrary. To convince yourselves of this, take a look at this now-classic brightness illusion created by Ted Adelson (https://www.illusionsindex.org/ir/checkershadow).

Color vision is actually another good way to teach students about the pitfalls of black/white thinking. I start by asking them whether dogs are colorblind. Almost everyone says, yes, of course, everyone knows that. But dogs, like human deuteranopes, are simply missing the middle wavelength cone, so they can distinguish many of the same colors that we humans can. That is, it’s not a matter of “seeing color” vs. being “color blind”; it’s a matter of dimensionality. As Ewout ter Haar correctly points out, three dimensions are enough to describe the color perception of most humans. You’d only need two for dogs. But some species of marine shrimp might require 7-dimensions!

Those are some nice words you are using (y).

I’d argue that arbitrary vs. significant is unnecessarily binary.

I think the larger point is that whether distinctions being drawn are arbitrary or significant necessarily depends on context, and then even on perspective. Sometimes black/white distinctions are useful, sometimes they’re not. Sometimes their usefulness is subjectively determined.

I think what’s most important to remember is that where we draw the lines, between black and white, or between arbitrary or significant is not merely a function of the phenomena were assessing but also of our orientation and our priors/biases. It depends on how we view variance and nose versus signal.

Maybe the key is to remain flexible and not too attached to our judgements.

… and noise versus signal also.

This depends on what your definitions of “noise” and “signal” are.

Martha –

No doubt.

As for (below), re “arbitrary”…

> I think one can give a fairly good definition of “arbitrary”,

It’s interesting to me that arbitrary can have two quite different meanings. One is basically “random” and the other is as determined by subjective preferences. Not sure which one you’re referring to…but yes, I agree with the “false dichotomy” aspect – which is what I was going for with “unnecessarily binary.”

Joshua said,

“I’d argue that arbitrary vs. significant is unnecessarily binary”

I think one can give a fairly good definition of “arbitrary”, but not of “significant” (in particular, “significant” depends on context). So I see “arbitrary vs significant” as a false dichotomy.

I have trouble thinking binarily. I force myself to do so to call attention to the fallacies that may be entailed in comparisons/contrasts made. It’s been frustrating for me.

+1

It can be frustrating to be compelled to constantly distinguish shades of gray. It makes it hard to feel settled sometimes.

On the other hand, it can be liberating to realize that black versus white is sometimes an illusion, and that the confidence derived from seeing black and white is fragile.

I like to think that seeking shades of gray is anti-fragile. I’ll let you know if I ever determine if that’s true. 🙂

“It can be frustrating to be compelled to constantly distinguish shades of gray. It makes it hard to feel settled sometimes.”

Sounds like you place a high value on “feeling settled” (which is a term you have not defined, so I really don’t know what you mean by it).

Martha –

> Sounds like you place a high value on “feeling settled” (which is a term you have not defined, so I really don’t know what you mean by it).

Not necessarily a high value. I like to feel like I know the answer in an unambiguous way. Uncertainty often feels uncomfortable. So sometimes a binary construction makes it easier to feel certain – even if the binary construction might not stand up to closer scrutiny.

Is that any clearer?

Yes, I think you have clarified what you are saying — but what you say is disappointing. It sounds like you put things like “feeling like you know the answer in an unambiguous way”, avoiding the uncomfortableness of uncertainty, and things that make it “easier to feel certain” above accepting uncertainty. I refer you to https://web.ma.utexas.edu/users/mks/statmistakes/uncertainty.html for my views on (accepting the reality of) uncertainty.

Martha –

> but what you say is disappointing. It sounds like you put things like “feeling like you know the answer in an unambiguous way”, avoiding the uncomfortableness of uncertainty, and things that make it “easier to feel certain” above accepting uncertainty.

Not “above.” I just recognize that I have that tendency, sometimes.

White: “measures of how evidence should change our beliefs”

Black: “bright lines separating truth from falsehood”

While that may sound facetious, the epistemology remains very murky for me. No one seems to be able to criticize p values in purely general terms, instead always referring to improper ways they are used. Ultimately, we want to know whether the peak rises above the noise by some objective measure that everyone can agree upon.

Anybody care to take a stab at p values in purely general terms? Specifically, can you show that no matter how stringently I define my null, it still cannot encompass all potential sources of noise for purely theoretical reasons? As near as I can tell, that is a high bar that no one to date has cleared.

well, p-values are a mathematical constructs, just like triangles or numbers, so they cannot be good or bad in themselves.

But statistics is a sort of applied mathematics, “applied” part matters here. If a scientific method require you to balance pyramid on its top (which is totally possible in math world), maybe this method is not good in practice.

+1

I doubt such a general demonstration exists, although much depends on the specifics and definitions. Loosely speaking if there’s a single instance in which p-value reasoning is “ok” then there can’t be a general proof they’re “bad”. Such a demonstration would prove too much.

The situation is somewhat similar to Fermat’s Last Theorem. The fact that there exist non-trivial solutions to x^n+y^n=z^n for n greater than 2 over the real numbers guarantees there can’t (loosely speaking again) be an “algebraic” proof that solutions don’t exist. If such a demonstration did exist it would apply to the real numbers as well and hence prove too much. Somehow the proof has to rely on properties that integers have that real numbers in general don’t.

This is a general problem when dealing with Frequentism. If Frequentims was simply wrong you could prove it and be done. But that’s not the issue with Frequentism. A fuzzier description of the problem with “Frequentism” is that it’s a highly special case which classical statisticians decided by fiat must be the general case. So there can’t be a general demonstration it’s “bad” since that would prove too much (i.e. it would prove that Frequentism is bad in that “special” case as well).

That in part is what makes the debate intracranial.

Exactly. In fact, we can go further. Per Martin-Löf proved that the definition of a “random bit sequence” as bit sequences which have close to maximal Kolmogorov complexity (ie. a sequence where you can’t write a computer program to output the sequence that’s substantially shorter than the sequence itself) is mathematically equivalent to the definition where the sequence doesn’t have a small p value in some “most powerful computable test” that tests for random sequences.

So, if you want to construct a random sequence, all you have to do is construct some extremely rigorous testing program, and then use some data filtering procedure on a source of electrical noise, and then test to see if that very long sequence is accepted or rejected by your test of randomness. If it’s accepted, and if your test of randomness is sufficiently good, then the sequence is random.

Proving that this doesn’t work would be proving that MCMC procedures don’t work at approximating expectation integrals. But they do work. Sure, you can probably define a definition of the word “work” which makes your proofs come out ok. But it won’t correspond to what anyone else would accept as the definition of “work”.

Basically, p values are mathematical facts about random number generators *by definition of what a random number generator is*. The problem with them isn’t that they’re wrong mathematical facts, it’s that in science we don’t deal with random number generators.

> in science we don’t deal with random number generators.

Agree we don’t deal with them but we think in terms of them.

Matt said,

“the epistemology remains very murky for me. No one seems to be able to criticize p values in purely general terms, instead always referring to improper ways they are used. Ultimately, we want to know whether the peak rises above the noise by some objective measure that everyone can agree upon.”

We, paleface? (See https://ordinary-times.com/2011/03/29/what-do-you-mean-we-paleface/ if you don’t know the reference.)

> “Banishing ‘Black/White Thinking’

A deeply ironic clause.

I think I disagree with Born here. When I was a statistics undergrad not too long ago, that p-values are continuous in their domain was well-observed. In those Stat 101 courses where we talked about NHST, it was frequently mentioned and I think generally appreciated that any set threshold is arbitrary. The discretization of a continuous scale is neither the main problem with NHST or the way it’s commonly taught, in my opinion. The less-appreciated problems of NHST (but that are well-known on this blog) that if widely understood would be a death sentence to it are the issues of measurement error, priors, and that the framing of the null hypothesis is itself not very meaningful. More generally, NHST just doesn’t give you anything that meaningful.

Michael J said,

“The discretization of a continuous scale is neither the main problem with NHST or the way it’s commonly taught, in my opinion. ”

Regrettably, it *is* the way NHST is very frequently taught.

There are often times when we have to make a dichotomous decision (should I give this Covid-19 patient Remdesivir?).

I appreciate the enthusiasm of the educator who wrote this article but his examples don’t really argue against dichotomisations. Rather, they show that there are alternate ways to dichotomise. (Well the first two examples do. I can’t make head or tail out of the third example, which is often the case when someone pulls the Bayes factor rabbit out of the hat without defining what they mean by a Bayes factor).

Nick said,

“There are often times when we have to make a dichotomous decision (should I give this Covid-19 patient Remdesivir?).”

Yes, but that decision needs to be made taking more than just p-values into account. A “one-size-fits-all” approach to using p-values is all to often an excuse not to think.

Martha, I agree entirely.

If one reads the medical literature as it is written, and in particular the abstracts and press releases produced from them, then it would appear that decisions are made entirely by dichotomising p-values. But in every academic hospital, every department will have a monthly journal club where 2 or 3 articles are studied in detail. What is usually taken into account is:

1. The plausibility of the claim based on previous knowledge (a sort of vague prior)

2. Possible biases inherent in the design and conduct of the trial

3. Conflict of interest

4. Unmentioned potential adverse effects of any proposed intervention

5. The effect size, confidence intervals and yes, p-values

After all this is taken into account then we dichotomise (or more likely, decide we are still not sure!)

As an example consider the use of hydroxychloroquine in covid19:

1. Early reports of efficacy were based on in vitro activity against the virus – we know that this is a poor predictor of in vivo activity

2. All trials so far have been observational and at high risk of bias

3. At least some trial list have staked their reputation on the drug or stockpiled it.

4. I have seen a few cases of chloroquine overdose – it is not pretty.

5. By the time we get here to the summary statistics, very little faith is put in their ability to guide us to a dichotomous decision.

Conclusion: more data needed.

I understand all of your 10 points. But not sure why the conclusion is that more data needed. Do you think that the dichotomization would somehow be less utilized? if more data is available? My apologies, I may be misunderstanding your post.

In my experience, those who have decried dichotomization also use it often without realizing that they are relying on it to make their case. Dichotomization is not so useless in simple cases; but its use in complex variable cases, it seems unhelpful as a tool.

Nick,

The journal club discussions that you described for teaching hospitals are definitely a good thing — but the question remains: Do the results of these discussions “trickle down” to the rest of the hospitals? If there is no written record (or at least none that is disseminated widely) of the discussions in the teaching hospitals, I doubt that they have much effect outside of that narrow community.

The article also contains the word “behooves” which I arbitrarily and dichomotously declare as unacceptable.

Nick:

I’m reminded of when a blog commenter chewed me out for calling something “shoddy,” and I replied, accurately, that this was a word that I never use. I think I can speak for guilty people everywhere when I say that the absolute worst thing is being slammed for something I didn’t even do.

Hah! Unlike Andrew’s “shoddy,” I am guilty as charged. But I stand by “behooves”: it is a perfectly good word with a perfectly clear meaning, AND it is intrinsically humorous. It always reminds me of an old Saturday Night Live commercial in which (I think) Lorraine Newman declares, “It’ll behoove ya to take care of your uvula.” https://www.youtube.com/watch?v=pRoGEWj6ka4

My favorite part of Born’s article:

“I close off this example by displaying the title of the article by Gelman and Stern (2006), and I encourage the students to repeat the title as a mantra each night before they go to bed and each morning when they awake for the next 2 months. And I add in a favorite quote from Rosnow and Rosenthal (1989): “That is, we want to underscore that, surely, God loves the 0.06 nearly as much as the 0.05. Can there be any doubt that God views the strength of evidence for or against the null as a fairly continuous function of the magnitude of p?” (emphasis added). As this quote is bang on and moderately funny, it puts a memorable cap on the exercise.”

Martha:

I don’t like that last quote. For reasons discussed in my article with Stern, I think the 0.04 vs. 0.06 thing is a distraction, as it leads people to think that the main problem with a p-value cutoff rule is arbitrariness. As Stern and I discuss, the difference between, say, p=0.20 and p=0.01 is not statistically significant, despite 0.20 being considered no evidence at all (or, worse, evidence in favor of the null) and 0.01 being considered very strong evidence against the null.

I think I see your point.

Well, to be fair, the example that this quote “closes off” is exactly the point that you raise in your article with Stern. The reason I like to throw it in is that is highly memorable and it reminds the students about the continuous nature of p-values. I also sometimes use the image of the spooky Halloween Jack-o-Lantern that has “p = .06” carved into it. I think humor is under-rated as a teaching tool. https://www.patheos.com/blogs/tippling/2017/10/31/spooky-punkins-and-statistical-correlation/

The fuzzy logic folk deal with some of these questions by looking at how well one thing resembles another rather than asking what is the probability they are the same (or different). For example, is a woman whose height is 5 ft 7 inches tall (170.2 cm) “tall”? There can be no probability that she is tall. Rather there can be a measure as to how well she matches a template for tallness in women. A woman 160 cm tall would not be a good match, one 180 cm tall would.

The template may (or may not) be arrived at by measuring probabilities that various people would call that height short, medium, tall, etc. But that’s a different thing.

“ Nowhere is the tendency more dramatic—and more pernicious—than in the practice of null hypothesis significance testing (NHST), based on p values…”

Clearly this dude doesn’t work for a tech company or crime analytics company, where probability prediction problems are turned into utterly bogus “classification” problems every day and then used every day to determine whether people get loans, go out on bail, or get involuntarily terminated from their jobs

This is an excellent point. If you could direct me to some specific examples of how this plays out, I’d love to use them in my teaching. Talk about “pernicious.” (And your inference was correct: this “dude” is a neurobiologist.)

»Humans are natural born categorizers.«

What I love about this statement is how perfectly self-referential it is.