Skip to content

Wow, just wow. If you think Psychological Science as bad in the 2010-2015 era, you can’t imagine how bad it was back in 1999

Shane Frederick points us to this article from 1999, “Stereotype susceptibility: Identity salience and shifts in quantitative performance,” about which he writes:

This is one of the worst papers ever published in Psych Science (which is a big claim, I recognize). It is old, but really worth a look if you have never read it. It’s famous (like 1400 citations). And, mercifully, only 3 pages long.

I [Frederick] assign the paper to students each year to review. They almost all review it glowingly (i.e., uncritically).

That continues to surprise and disappoint me, but I don’t know if they think they are supposed to (a politeness norm that actually hurts them given that I’m the evaluator) or if they just lack the skills to “do” anything with the data and/or the many silly things reported in the paper? Both?

I took a look at this paper and, yeah, it’s bad. Their design doesn’t seem so bad (low sample size aside):

Forty-six Asian-American female undergraduates were run individually in a laboratory session. First, an experimenter blind to the manipulation asked them to till out the appropriate manipulation questionnaire. In the female-identity-salient condition, participants (n = 14) were asked [some questions regarding living on single-sex or mixed floors in the dorm]. In the Asian- identity-salient condition, participants (n = 16) were asked [some questions about foreign languages and immigration]. In the control condition, participants (n = 16J were asked [various neutral questions]. After the questionnaire, participants were given a quantitative test that consisted of 12 math questions . . .

The main dependent variable was accuracy, which was the number of mathematical questions a participant answered correctly divided by the number of questions that the participant attempted to answer.

And here were the key results:

Participants in the Asian-identity-salient condition answered an average of 54% of the questions that they attempted correctly, participants in the control condition answered an average of 49% correctly, and participants in the female-identily-salient condition answered an average of 43% couectly. A linear contrast analysis testing our prediction that participants in the Asian-identity-salient condition scored the highest, participants in the control condition scored in the middle, and participants in the female-identity-salient condition scored the lowest revealed that this pattern was significant, t(43) = 1.86, p < .05. r = .27. . . .

The first thing you might notice is that a t-score of 1.86 is not usually associated with “p < .05"---in standard practice you'd need the t-score to be at least 1.96 to get that level of statistical significance---but that's really the least of our worries here. If you read through the paper, you'll see lots and lots of researcher degrees of freedom, also lots of comparisons of statistical significance to non-significance, which is a mistake, and even more so here, given that they’re giving themselves license to decide on an ad hoc basis whether to count each particular comparison as “significant” (t = 1.86), “the same, albeit less statistically significant” (t = 0.89), or “no significant differences” (they don’t give the t or F score on this one). This is perhaps the first time I’ve ever seen a t score less than 1 included in the nearly-statistically-significant category. This is stone-cold Calvinball, of which it’s been said, “There is only one permanent rule in Calvinball: players cannot play it the same way twice.”

Here’s the final sentence of the paper:

The results presented here cleariy indicate that test performance is both malleable and surprisingly susceptible to implicit sociocultural pressures.

Huh? They could’ve saved themselves a few bucks and not run any people at all in the study, just rolled some dice 46 times and come up with some stories.

But the authors were from Harvard. I guess you can get away with lots of things if you’re from Harvard.

Why do we say this paper is so bad?

Why do we say this paper is so bad? There’s no reason to suspect the authors are bad people, and there’s no reason to think that the hypothesis they’re testing is wrong. If they could do a careful replication study with a few thousand students at multiple universities, the results could very well turn out to be consistent with their theories. Except for the narrow ambit of the study and the strong generalizations made from just tow small groups of students, the design seems reasonable. I assume the experiments were described accurately, the data are real, and there were no pizzagate-style shenanigans going on.

But that’s my point. This paper is notably bad because nothing about it is notable. It’s everyday bad science, performed by researchers at a top university, supported by national research grants, published in a top journal, cited 1069 times when I last checked—and with conclusions that are unsupported by the data. (As I often say, if the theory is so great that it stands on its own, fine: just present the theory and perhaps some preliminary data representing a pilot study, but don’t do the mathematical equivalent of flipping a bunch of coins and then using the pattern of heads and tails to tell a story.)

Routine bad science using routine bad methods, the kind that fools Harvard scholars, journal reviewers, and 1600 or so later researchers.

From a scientific standpoint, things like pizzagate or that Cornell ESP study or that voodoo doll study (really) or Why We Sleep or beauty and sex ratio or ovulation and voting or air rage or himmicanes or ages ending in 9 or the critical positivity ratio or the collected works of Michael Lacour—these are miss the point, as each of these stories has some special notable feature that makes them newsworthy. Each has some interesting story, but from a scientific standpoint each of these cases is boring, involving some ridiculous theory or some implausible overreach or some flat-out scientific misconduct.

The case described above, though, is fascinating in its utter ordinariness. Scientists just going about their job. Cargo cult at its purest, the blind peer-reviewing and citing the blind.

I guess the Platonic ideal of this would a paper publishing two studies with two participants each, and still managing to squeeze out some claims of statistical significance. But two studies with N=46 and N=19, that’s pretty close to the no-data ideal.

Again, I’m sure these researchers were doing their best to apply the statistical tools they learned—and I can only assume that they took publication in this top journal as a signal that they were doing things right. Don’t hate the player, hate the game.

P.S. One more thing. I can see the temptation to say something nice about this paper. It’s on an important topic, their results are statistically significant in some way, three referees and a journal editor thought it was worth publishing in a top journal . . . how can we be so quick to dismiss it?

The short answer is that the methods used in this paper are the same methods used to prove that Cornell students have ESP, or that beautiful people have more girls, or embodied cognition, or all sorts of other silly things that the experts used to tell us “have no choice but to accept that the major conclusions of these studies are true.”

To say that the statistical methods in this paper are worse than useless (useless would be making no claims at all; worse than useless is fooling yourself and others into believing strong and baseless claims) does not mean that the substantive theories in the paper are wrong. What it means is that the paper is providing no real evidence for its theories. Recall the all-important distinction between truth and evidence. Also recall the social pressure to say nice things, the attitude that by default we should believe a published or publicized study.

No. This can’t be the way to do science: coming up with theories and then purportedly testing them by coming up with random numbers and weaving a story based on statistical significance. It’s bad when this approach is used on purpose (“p-hacking”) and it’s bad when done in good faith. Not morally bad, just bad science, not a good way of learning about external reality.

Leave a Reply