Skip to content
 

My thoughts on “What’s Wrong with Social Science and How to Fix It: Reflections After Reading 2578 Papers”

Chetan Chawla and Asher Meir point us to this post by Alvaro de Menard, who writes:

Over the past year, I [Menard] have skimmed through 2578 social science papers, spending about 2.5 minutes on each one.

What a great beginning! I can relate to this . . . indeed, it roughly describes my experience as a referee for journal articles during the past year!

Menard continues:

This was due to my participation in Replication Markets, a part of DARPA’s SCORE program, whose goal is to evaluate the reliability of social science research. 3000 studies were split up into 10 rounds of ~300 studies each. Starting in August 2019, each round consisted of one week of surveys followed by two weeks of market trading. I finished in first place in 3 out 10 survey rounds and 6 out of 10 market rounds.

The studies were sourced from all social sciences disciplines (economics, psychology, sociology, management, etc.) and were published between 2009 and 2018 . . .

Then he waxes poetic:

Actually diving into the sea of trash that is social science gives you a more tangible perspective, a more visceral revulsion, and perhaps even a sense of Lovecraftian awe at the sheer magnitude of it all: a vast landfill—a great agglomeration of garbage extending as far as the eye can see, effluvious waves crashing and throwing up a foul foam of p=0.049 papers. As you walk up to the diving platform, the deformed attendant hands you a pair of flippers. Noticing your reticence, he gives a subtle nod as if to say: “come on then, jump in”.

To this, I’d like to add the distress I feel when I see bad research reflexively defended, not just by the authors of the offending papers and by Association for Psychological Science bureaucrats, but also by celebrity academics such as Steven Pinker and Cass Sunstein.

There’s also just the exhausting aspect of all of this. You’ll see a paper that has obvious problems, enough that there’s really no reason at all to take it seriously, but it will take lots and lots of work to explain this to people who are committed to the method in question, and that’s related to the publication/criticism asymmetry we’ve discussed before. So, yeah, I imagine there’s something satisfying about reading 2578 social science papers and just seeing the problems right away, with no need for long arguments about why some particular interaction or regression discontinuity doesn’t really imply what the authors are claiming.

Also, I share Menard’s frustration with institutions:

Journals and universities certainly can’t blame the incentives when they stand behind fraudsters to the bitter end. Paolo Macchiarini “left a trail of dead patients” but was protected for years by his university. Andrew Wakefield’s famously fraudulent autism-MMR study took 12 years to retract. Even when the author of a paper admits the results were entirely based on an error, journals still won’t retract.

Recently we discussed the University of California’s problem with that sleep researcher. And then just search this blog for Lancet.

But there are a few places where I disagree with Medard. First I think he has a naivety about sample sizes. Regarding “Marketing/Management,” he writes:

In their current state these are a bit of a joke, but I don’t think there’s anything fundamentally wrong with them. Sure, some of the variables they use are a bit fluffy, and of course there’s a lack of theory. But the things they study are a good fit for RCTs, and if they just quintupled their sample sizes they would see massive improvements.

The problem here is the implicit assumption that there’s some general treatment effect of interest. Elsewhere Menard criticizes studies that demonstrate the obvious (“Homeless students have lower test scores, parent wealth predicts their children’s wealth, that sort of thing), but the trouble is that effects that are not obvious will vary. A certain trick in marketing might increase sales in some settings and decrease it in others. That’s fine—it just pushes the problem back one step, so that the point is not to demonstrate an effect but rather to find out the conditions under which it’s large and positive, and the conditions under which it is large and negative—my point here is just that quintupling the sample size won’t do much in the absence of theory and understanding. Consider the article discussed here, for example. Would quintupling its sample size do much to help? I don’t think so. This is covered in the “Lack of Theory” section of Medard’s post, so I don’t see why he says he doesn’t think there’s “anything fundamentally wrong” with those papers.

I also don’t see how Menard aligns his statement that “They [researchers] Know What They’re Doing” when they do bad statistics with his later claim (in the context of discussion of possible political biases) that “the vast majority of work is done in good faith.”

Here’s what I think is happening. I think that the vast majority of researchers think that they have strong theories, and they think their theories are “true” (whatever that means). It’s tricky, though: they think their theories are not only true but commonsensical (see item 2 of Rolf Zwaan’s satirical research guidelines), but at the same time they find themselves pleasantly stunned by the specifics of what they find. Yes, they do a lot of conscious p-hacking, but this does not seem like cheating to them. Rather, these researchers have the attitude that research methods and statistics are a bunch of hoops to jump through, some paperwork along the lines of the text formatting you need for a grant submission or the forms you need to submit to the IRB. From these researchers’ standpoint, they already know the truth; they just need to do the experiment to prove it. They’re acting in good faith on their terms but not on our terms.

Menard also says, “Nobody actually benefits from the present state of affairs”—but of course some people do benefit.

And of course I disagree with his recommendation, “Increase sample sizes and lower the significance threshold to .005.” We’ve discussed this one before, for example here and here.

In short, I agree with many of Menard’s attitudes, and it’s always good to have another ranter in the room, but I think he’s still trapped in a conventional hypothesis-testing framework.

Leave a Reply