Skip to content

“When Should Clinicians Act on Non–Statistically Significant Results From Clinical Trials?”

Javier Benítez pointed me to this JAMA article by Paul Young, Christopher Nickson, and Anders Perner, “When Should Clinicians Act on Non–Statistically Significant Results From Clinical Trials?”, which begins:

Understanding whether the results of a randomized clinical trial (RCT) are clinically actionable is challenging. Reporting standards adopted by JAMA and other leading journals lead to relative uniformity of presentation of RCT findings that help simplify critical appraisal. Such uniform reporting also means that the conclusion of the trial may be dichotomized as “positive” or “no difference” based on the statistical significance of the primary outcome. Dichotomization based on the statistical significance of the primary outcome variable reflects the correct, albeit narrow, interpretation of the experiment that the RCT represents. It also reflects decisions made by the investigators in the design of the study and highlights findings in relation to prespecified assumptions. However, there are situations in which a broader appreciation of the results may suggest that non–statistically significant results in the primary outcome of a clinical trial could influence and perhaps change practice. This includes consideration of the outcome in terms of effect size and accompanying CIs, placing the findings from the trial in the context of the totality of the existing relevant evidence.

My reaction: This article is not nearly radical enough for me.

For example, they write, “Dichotomization based on the statistical significance of the primary outcome variable reflects the correct, albeit narrow, interpretation of the experiment that the RCT represents.”

Huh? In what way is the dichotomization “correct”? This is pseudo-rigor. Null hypothesis significance testing has the aura of rigor, thus the unexamined assumption that it is in some sense correct. The so-called statistical significance is based on a conditional probability (Pr(T(y_rep) > T(y) | H0). I can’t see any world in which a decision based on this number is a “correct interpretation of an experiment.” At best, it’s a conventional interpretation.

I cced the usual suspects, and they responded:

Sander Greenland:

I’m sure we all agree with your comment. But for it to appear in JAMA is remarkable, given that JAMA has been the leading bastion of the superstition that the magical alpha=0.05 demarcates no association from association (in the form of whether the 95% CI includes the null). Now if they’d only understand how alpha and beta are supposed to be set based on costs of Type-I and Type-II errors.:)

I don’t think it makes much sense to talk about the “costs of Type-I and Type-II errors.” The cost of the error depends on how large the effect is and on what decision you might make. I’d rather just go straight to talking about costs and benefits of actual decisions, and I think that 2 x 2 table of decisions and errors is more of a hindrance than a help in setting this up.

Valentin Amrhein:

How can they start their piece with sentences that could be understood as if “no difference” based on statistical nonsignificance reflects the correct interpretation? If they write such a paper, they should be aware of the literature of the last 100 years saying that is incorrect. I could imagine though that the editors pushed them to start with some positive remarks about the status quo. I agree the rest of the paper is remarkable for a journal like JAMA.

To underline what I just wrote with respect to the JAMA editors maybe pushing the authors to be less radical in their written text: In the audio interview on the JAMA homepage, lead author Paul J. Young says:

“The notion that one dichotomizes clinical trials into either being positive or negative / neutral based on the statistical significance of the primary outcome variable is clearly an incorrect notion.” (minute -26:45)

“An individual RCT just forms part of the evidence base. And so if one conducts an individual RCT and finds a non-statistically-significant difference, but in the context of all of the other evidence there’s overwhelming suggestion that there is benefit or harm, then looking at that trial in isolation makes no sense.” (minute -23:50)

That sounds good! They should have written it in the paper.

Yes. Let’s move away from the paradigm of “causal identification + statistical significance = discovery.”

Blake McShane:

Yes that “correct, albeit narrow, interpretation” comment made me bristle. So too did the first FLASH example (although I might retract that if I knew anything about the prior HES evidence referenced; it might really sing as an example if I did). I did not really get much from the second ANDROMEDA-SHOCK example, although the p=0.06 made me chuckle (and I suspect the 0.06 may have been the reason it was chosen). However, I thought starting with the third COACT example going forward, the piece started to really hum along.

Sander, I don’t know about Type I and Type II errors, but how about if they’d understand about variation in treatment effects ;-)?

In response, Greenland points to this recent article, by Richard Riley et al., “Individual participant data meta-analysis to examine interactions between treatment effect and participant-level covariates: Statistical recommendations for conduct and planning.”

Sander followed up with further thoughts on Type I and Type II errors:

Prologue – I [Sander] am not here to defend NP hypothesis testing (which is often conflated with NHST, but isn’t nullistic) and I’m no fan of NP theory in general; it was just the first one I learned and I got from its semi-namesake. I think it’s an extraordinarily opaque if occasionally useful tool that the stat community vastly overembraced and oversold for practice far beyond appropriate. That helped give frequentism a bad name among some Bayesians; the disgust even created Bayesian converts (e.g., I think Frank Harrell). So, just as an aside, I’d say it’s pretty easy to see it as a perfectly reasonable decision procedure under some very narrow circumstances where Neyman’s behavioral set-up applies; it’s just that exactly none of the problems areas we deal with come close to that set up. And the pretense that it applies widely has led to a travesty of statistical reporting as seen (for example) in JAMA.

On the other hand, I am a fan of neo-Fisherian (NF) ideas – at least in their information-theoretic incarnation – and thus I could be labeled semi-frequentist within the awfully narrow scope of common statistics teaching (semi-Bayesian being my other half within that scope). I think NF ideas prevented some frequentists from Bayesian conversion (e.g., I think Wasserman, Efron, and Cox) and motivated others to seek frequentist justifications of Bayesian procedures rather than slip into pure Bayes (e.g., I think Bayarri & Berger). Thus, my reply will focus on what has been at the core of rejection of Andrew’s work by a coterie of notable theoretical statisticans (as in the PPP controversy)…

Now, I think framing that foundation as a matter of Type-1 error control (as Neyman did and Mayo does) is very bad, since it leads right back into the dichotomania that characterizes the NP system (really Neymanian, as E. Pearson backed off it quite a bit in his later years and his father Karl never bought it).

Nonetheless, to say “Type-I error is controlled at all levels” is the same as saying the random P corresponding to the observed (Fisher) p is uniformly distributed (or, if going “conservative”, that P is dominated by a uniform variate; but I will set that aside because once Type-2 error is considered we’ll end up getting as close to uniform as possible under the test hypothesis). I presume you can see that equivalence (the proof is maybe two lines, depending on notation), because…

This means that without uniformity, inference is miscalibrated in every precise frequentist sense. That is so even if (as I do) I reject Neymanian testing and instead endorse an informationalist vision of frequentism, in which statistics are supposed to be summarizing the information about models for how the data were generated. In this task, uniformity is needed to interpret p as showing the percentile at which the test statistic fell in the distribution implied by the sampling model (which includes the test hypothesis and all auxiliary assumptions about the observation-selection process). This percentile is analogous to the percentile at which one fell in a standardized admissions or achievement test (where the distribution is that of the whole population of those tested). For me, this percentile interpretation is the most direct and concrete reason to consider frequentist P-values and CI. This percentile is the entirety of the syntactic (context-free) information provided by the test statistic about the total tested model (hypothesis+assumptions), as measured precisely by its Shannon transform.

That percentile is all there is that is correct without a lot of side conditions, unlike subtly false verbal descriptions like “p tells us whether a given pattern in data could easily occur by chance alone“. Those subtly deficient descriptions tempt instructors as well as researchers far beyond such bare information, and degrade quickly over repetitions (“alone” being the first to go) into blatantly false descriptions like “whether a given pattern could have occurred by chance” (of course it could, as long as it’s possible under the test distribution). They also hide the fact that the reference distribution is just that: A reference distribution, nothing more, usually wildly hypothetical in our work because most of those auxiliary assumptions are wildly uncertain.

Chance is but one of many sources of uncertainty in our analyses and those traditional descriptions (which everyone from Gelman and Greenland to Berger and Robins have trotted out at times) focus all attention on chance alone. But the humble P-value p is something much less, and better than nothing only to the extent it is uniform under the reference distribution. A P-value is just where the model scored according to one test of its fit to the data. Any contextual elements have to be embedded in the model as they are ignored in the computation of this percentile score. When we “unroll” this score into the additive-information (negative-log) Shannon scale, it becomes a distance from the test model to the data, one that stands regardless of how many other models are tested (a fact to which rabid multiple-comparisons adjusters are cognitively blind to, just as rabid anti-adjusters are blind to the need to include model selection in the reference distribution).

It follows that, in my view, to reject uniformity or Type-1 error control defines the rejector as anti-frequentist, no matter what they say. That would include many who call themselves frequentist-Bayesians or calibrated Bayesians: no uniformity, no frequency interpretation. I say that not because I favor framing this all in terms of types of error, but because to claim error control under the tested model is logically equivalent to the information descrption that I prefer. I see the latter description as honestly limited and properly abstract, with no pretense of separating real-world forces and problems about which we have overwhelming uncertainty, like chance and procedural shortcomings of data collection. A true frequentist-Bayesian will incorporate this kind of calibration into statistics based on compound (hierarchical) sampling models, as done in frequentist random-effects and empirical-Bayes modeling.

OK, I expect at this point you’ve had enough, but as a lengthy coda I’ll add:

a) Although I always hope it is implicit, I welcome questioning and counterarguments if you can tell me what I’m missing.

b) In some ways NP connect more closely (via Wald’s theorems) to Bayesian decision theory whereas NF connects more (very) closely to reference-Bayes estimation theory. That’s another long essay, since statistical decision theory adds in the complexity of loss functions/utilities.

c) In discussing data information I’ve been ignoring the subtleties distinguishing discrete and continuous data as I don’t see them as more than a math sidetrack about using continuous data models to approximate discrete data – all real data are discrete and hence so are statistics (the issue for parameters is far more subtle as most of our contexts provide no natural discretization for parameter scales, and this is one reason why the Bayesian version of information stats is more difficult).

d) I emphasize that the issue I’m addressing about uniformity of P and frequentism is distinct from issues of dichotmania and nullism, but very close to the problem of model reification, which we only begin to discuss by noting that “all models are false, but some are useful.” OK, another long essay, which starts with:

“Just because we test a model doesn’t mean that we should have taken it seriously to begin with; and just because it passes statistical tests doesn’t mean we should take it seriously in the end.”

– I’m guessing that statement will be obvious to you, and further, that you will join me in lamenting how it is nonetheless ignored in the vast majority of stat teaching and published research applications (in our fields).

We have something like 500 posts a year, so it’s only fair for me to let someone else rant from time to time.

Leave a Reply