Rik de Kort writes:

This morning I stumbled across a very interesting blog post, dissecting some drama related to a non-replicating paper in computer science land. The question the paper tries to answer is whether some programming languages are more error prone than others. For a paper in computer science I would expect all their data and code to be open to begin with, but it wasn’t! As it turns out, once people got their hands on the data and code, they found severe flaws and wrote a rebuttal paper. And then there was a rebuttal rebuttal, etc.

I think this story is worth sharing for the following reasons:

– It’s a very recognizable example outside of social sciences.

– There is some indication the conclusions might be true even if the methodology to reach them was totally flawed.

It’s interesting. One thing, though. The linked post, by Hillel Wayne, starts off:

I love science. Not the “space is beautiful” faux-science, but the process of doing science. Hours spent calibrating equipment, cross-checking cites of cites of cites, tedious arguments about p-values and prediction intervals, all the stuff that makes science Go.

“Tedious arguments about p-values . . .”: sure, that *can* be part of the process of doing science, but it doesn’t have to be! There’s a whole world of serious science where the statistics goes straight to the scientific questions, and you don’t have to argue about p-values at all. Just for example see here or here or a zillion other applied papers here. I’m not saying you can’t do good science using p-values, nor am I saying that arguing about p-values can’t be a part of good science, nor am I trying to argue with Wayne’s personal experience doing science—I’m just saying it doesn’t have to be that way. As Don Rubin liked to say, one aspect of a good statistical method is that it allows you to spend less time talking about the statistics and more time talking about the science.

That linked post features this definition: “The P-value is the probability that you would have seen the same result if the hypothesis wasn’t true, purely due to other factors.” This is wrong. That’s fine. The author of the post is not a statistician. If I tried to describe some computer science term in my own words, without cheating and looking it up first, I’d probably get it wrong too. But I think this does add support to my idea that scientists and engineers have better things to do than argue about p-values. Sometimes it seems that the only thing more common than a practitioner misunderstanding a p-value is someone misunderstanding p-values in the process of trying to explain them to others! Later he describes the p-value as “the probability that a ‘coincidence’ could explain the data.” That’s wrong. It’s true that the p-value is a probability, and it does have something to do with coincidences, and it also has something to do with data, but that’s about it.

Also, Wayne writes:

Don’t trust anything that’s not replicated.

If an effect exists, it should be observable by anyone in similar conditions. This is one of the core foundations of empirical science: all claims should be replicated by a third party. If something doesn’t replicate, then the effect is more limited than we expected, or there was an implicit assumption in the original experiment, or somebody made a mistake somewhere.

Sometimes. Not always. There are observational sciences such as political science and economics, where you can’t just run 10 more elections or start 10 more civil wars in order to get new data. I mean, sure, maybe the appropriate response to this is to not trust economics or political science, and maybe that’s right—but that just pushes things back one step, and we have to decide how to make decisions and understand the world given science that can’t be trusted.

And one other thing. Near the end of the post is the following advice:

Be diligent with your threats to validity. Check all of your assumptions. Don’t blindly trust automation. Don’t dox other scientists. The usual.

I agree with most of this. But what’s with “Don’t dox other scientists”? First off, if doxing is a bad idea, we shouldn’t dox anyone, right? But was there any doxing involved? I scrolled through the post and found that someone linked to someone else’s Facebook account. I’m not on Facebook so I’m not at all sure about this, but isn’t a Facebook account public? I’m confused about how it is doxing to link to that. I do agree that it’s better to keep science discussions scientific rather than personal; I just feel like I’ve lost the thread somewhere along the line regarding the doxing.

But let me conclude with a point that really resonates with me.

Wayne writes:

Is this even a good idea?

We’ve just spent several thousand words on methodology to show how the original FSE paper was flawed. But was that all really necessary? Most people point to a much bigger issue: the entire idea of measuring a language’s defect rate via GitHub commits just doesn’t make any sense. How the heck do you go from “these people have more DFCs” to “this language is more buggy”? I saw this complaint raised a lot, and if we just stop there I could have skipped rereading all the papers a dozen times over.

Jan Vitek emphasizes this in his talk: if they just said “the premise is nonsense”, nobody would have listened to them. It’s only by showing that the FSE authors messed up methodologically that TOPLAS could establish “sufficient cred” to attack the premise itself. The FSE authors put in a lot of effort into their paper. Responding to that all is “nuh-uh your premise is nonsense” is bad form, even if the premise is nonsense. Who is more trustworthy: the people who spent years getting their paper through peer review, or a random internet commenter who probably stopped at the abstract?

It all comes back to science as a social enterprise. We use markers of trust and integrity as heuristics on whether or not to devote time to understanding the claim itself. This is the same reason people are more likely to listen to a claim coming from a respected scientist than from a crackpot. Pointing out a potential threat to validity isn’t nearly as powerful as showing how that threat actually undermines the paper, and that’s what TOPLAS had to do to be taken seriously.

Setting aside all the technical details here, which I’m not in any position to evaluate, I think Wayne is on to something here, and it reminds me of the asymmetries between publishing new ideas and publishing criticisms, which we’ve discussed here, here, and here, among other places. That last article has the title, “It’s too hard to publish criticisms and obtain data for replication,” which pretty much says it all.

It’s indeed frustrating to see bad work, work that’s clearly bad, and to feel that the only way to criticize it is to first put in a ton of effort to figure out exactly what went wrong.

It’s like if someone was trying to sell you a perpetual motion machine and the only way you could get him off your doorstep is to figure out exactly where’s the connection to the hidden power supply, and then once you did that, other people jumped out of the bushes to say that it’s a really good perpetual motion machine and the people who created it are promising scholars and why are you being so picky what is it you don’t want to save the world with free power etc etc. Enough to make you want to drift into stream of consciousness, at the very least.

**P.S.** The post concludes with this:

I shared a first draft of this essay on my newsletter, all the way back in November. It’s pretty much the same content as my blog except weekly instead of monthly. So if you like my writing, why not subscribe?

That makes me think that I could charge people money to read my blog posts 6 months ahead of time . . .

“There are observational sciences such as political science and economics, where you can’t just run 10 more elections…maybe the appropriate response to this is to not trust economics or political science, and maybe that’s right—but that just pushes things back one step”

You don’t need to run the exact same experiment to create a “replication” in an observational science. Instead, you replicate by observing many different incidences of the same phenomena and try to match the model.

It’s not an issue of “trust”. If you only observe one incidence of a phenomena, surely you’ll overfit your model! :) So you need to “replicate” by applying your model to multiple instances of the phenomena, rather than by rerunning the same experiment over and over.

Probably a better word than “replicated” is “validated”. Findings that cannot be validated probably should not be called scientific.

Sure. I’ll buy that.

> You don’t need to run the exact same experiment to create a “replication” in an observational science. Instead, you replicate by observing many different incidences of the same phenomena and try to match the model.

This doesn’t qualify as a replication. The original paper will typically try to identify and fit to all incidences of the same phenomenon to begin with. For example, someone testing a hypothesis about U.S. GDP will backtest their model at each point in the US GDP timeseries to test validity, then fit the final estimates to the entire timeseries, and all of that will go in the original paper. How do you “replicate” then? Try it on a different country? Maybe the hypothesis is true in a different country and not in the United States. Trust that the original paper as its own replication? What if they were noise mining, what about forking paths in the analysis, etc. It really isn’t that simple and easy.

“The original paper will typically try to identify and fit to all incidences of the same phenomenon to begin with.”

I was thinking about this yesterday after I wrote the comment, but in my mind I was thinking about elections rather than GDP. Elections are discrete events, so the obvious answer is to build the forecast from a subset of elections. I guess the answer with GDP is really the same: select a random subset of periods and work with that. If it’s a sound model shouldn’t it work with a reasonable subset of periods?

“Try it on a different country?” This is a reasonable approach

“Maybe the hypothesis is true in a different country and not in the United States. ” Or maybe not, you don’t know until you try. What’s the hypothesis? If this is some general principle of economics then it would hold true in market economies, accounting for variations in policy. But if you’re only applying to one country then the model is implicitly dependent on policy assumptions right? So to apply to another country you have to make those assumptions explicit and address them.

Jim:

Just to clarify, I’m only using the word “trust” because that was used in the blog post that I was quoting.

Re: “I think Wayne is on to something here, and it reminds me of the asymmetries between publishing new ideas and publishing criticisms, which we’ve discussed here, here, and here, among other places.”

…someone (cough, cough) could start a journal…

Hi, author here. I just added a disclaimer saying I got the definition of p-values wrong and will aim to revise that section. One thing I’ll say in my defense, though:

“That makes me think that I could charge people money to read my blog posts 6 months ahead of time…”

I don’t charge for my newsletter. I mean _you_ still could do this, but I’m not and don’t intend to.

Hillel:

Thanks. Just to clarify, if you did charge for the newsletter, I wouldn’t think there was anything wrong with doing that. Columbia charges students to take my classes!

Is the definition of the p-value really that bad? If we take the Wikipedia definition (if that’s not correct, surely the very notion of truth crumbles :) ), we get:

“The p-value is the probability of obtaining test results at least as extreme as the results actually observed, under the assumption that the null hypothesis is correct.”

Contrasting that with:

“The P-value is the probability that you would have seen the same result if the hypothesis wasn’t true, purely due to other factors.”

If the hypothesis is that A correlates (positively, say) with B, then the null is that it doesn’t (or correlates negatively). If we interpret “other factors” as the error term, the second seems a good faith restatement of the first. I guess technically I can spot two differences:

1) “Hypothesis is not true” is not the same as “the null hypothesis is true.” There are many other hypotheses.

2) “The same result” is different from “results at least as extreme as”.

The first seems not so relevant in the situation where the hypothesis is about the sign of a correlation (as in a regression). The second seems a bit nitpicky, given that the result of seeing exactly a result is usually 0.

Am I missing another distinction, or misunderstanding the importance of these points?

Yes, those are both of the errors, but I think they’re both worse than you make them. First, even trying to discuss the truth of the hypothesis, instead of the rejection of the null, is what people *want* to do instead of what NHST actually *does* when it is correctly performed. Even gigantic rejections of the null (ie tiny p values) don’t necessarily tell you much about the hypothesis tested… unfortunately.

Second, the “results more extreme than” is critical in just understanding what the p-value is, and often leads to tremendous mischief in discrete problems in which the probability of a particular result is nonzero. I’ve seen people use it in the discrete case to argue that the null is unlikely (which it is, but the discrete probability of the given result isn’t much of any proof one way or another.)

Mathjis said,

“The p-value is the probability of obtaining test results at least as extreme as the results actually observed, under the assumption that the null hypothesis is correct, ” … “Am I missing another distinction, or misunderstanding the importance of these points?”

Indeed, something important is missing from the first quote: Namely, that the assumption (under which the p-value is calculated and conclusions are considered) also includes model assumptions. If the model assumptions don’t fit the reality, then the calculated p-value is meaningless.

Your (1) is the most important. It’s not just that “your hypothesis is not true”. It’s that A SPECIFIC NULL HYPOTHESIS *IS* TRUE.

The first *IS* very relevant in the situation where the hypothesis is about a sign of a correlation. Because the null hypothesis is something like “the data is random draws from a non-correlated bivariate soandso distribution”.

This will inevitably fail in multiple ways:

1) The data is NOT from that distribution

2) Often it’s not a good assumption for the data to be from *any* random distribution (ie. the data generating process doesn’t meet tests of being a stationary high complexity sequence)

Let’s try that again… there are **computational random number generators originally designed for RNG usage** which fail to meet (2) (and so are not used as RNGs) why should “measure some people’s foo and bar measurements” meet (2)?

So basically, whatever your null hypothesis is, it’s pretty much guaranteed to be false, which will cause you to reject your data as coming from whatever your null hypothesis is… which then in the utterly flawed logic of NHST will “allow you to conclude that your favorite hypothesis has some support, or maybe is even true” [sic]

Well, we had this discussion before… I agree that no null hypothesis will be true, but this is not what testing is about. Testing is about finding out whether the data could be *compatible* with a model, i.e., a thought construct. This may well be true, meaning that the data cannot *distinguish* reality from the model. So no, the fact that the null hypothesis is not literally true will very often *not* cause the data to reject it, and that’s all fine. (Which is why some people have to try very hard to find the significances for which they hope and that could grant them publications, when the data don’t deliver these significances if treated correctly.)

And this is why I think the most useful kind of significance test is the one that doesn’t reject the hypothesis. That tells you that the tested RNG is sufficiently like the data that it is compatible with being a hypothetical data generating process.

Even though a given RNG isn’t how the data came about, it’s sufficiently compatible with the data that we don’t have evidence otherwise. that’s useful

+ 1. This is related to another oft-raised point that rejection at some significance level is essentially a referendum on sample size.

> The first seems not so relevant in the situation where the hypothesis is about the sign of a correlation (as in a regression).

Beta = 1 is categorically different from beta = 100; yet, null hypothesis significance testing allows you to operate under the logic of “Beta = 0 is incompatible with the data, therefore, we reject it in favor of the alternative hypothesis where beta = 100”. Maybe under the null hypothesis beta = 1, p = 0.1. Then we can’t reject beta = 1, but beta = 1 makes for a much less dramatic and categorically different conclusion. The paper’s hypothesis is never REALLY about the sign of a regression.

Give me a break. This is not a computer science paper; it’s a paper about the psychology or biology or whatever of programmers. Admittedly, language design is something computer scientists do. But how humans behave is not computer science. I could go on, but I won’t.

Bob76

PS. I’ve had job titles like “Professor of Computer Science” and I’ve contributed to the design of about 1.3 languages—none of which got any market share.

This comment from Bob76 is, to my mind, quite hurtful. Unfortunately, as a professor and senior member of some national and international research communities in CS, I have seen similar narrow, exclusionary views often; I think it’s one reason why our graduates produce so many computing projects which are dismally bad for the people around them. What is sometimes called “human-centered computing” including software engineering, human-computer interaction, computing education, etc are often left to separate academic departments (eg ‘i-schools’). But I believe we all benefit from a broader view of the field, that really considers the people in the systems. Fortunately, things are changing: in particular, some of the AI research community seems to have really understood that usability, fairness etc are important issues for CS to consider. The database community is moving on this too (with human-centered workshops and keynotes at the top conferences). I’m not sure yet about the programming languages community; and I fear that other subfields (such as algorithms and operating systems) have quite a way to go.

Thanks guys, for the thoughtful comments! I should have mentioned, although it should be obvious, that I am not a statistician, nor even a data scientist, but I am very curious about these issues. I learn a lot from these comments.

I just want to make sure I understand: model specification matters for the point of the many hypotheses. My suggestion that if the hypothesis is corr(A,B)>0, then the null is just corr(A,B)<=0 is not sufficient, because to assess Pr(result|null) we need more assumptions than that. In particular, we need A and B to have some distribution (with corr(A,B)<=0). If the null specifies a bad distribution, then we will find that Pr(result|null) is low for reasons that have nothing to do with the correlation between A and B. Is this a fair representation of your points?

I see the logic of this argument, but I would like to dig a little deeper to see how compelling it is. I see how by making extreme choices for the null distribution, you could get Pr(result|null) low. (My favourite: have the null distribution have a support that does not contain all the data points. That way you'll get a p-value (and an R^2, but who cares about that?) of 0. Someone should definitely try to get published with this strategy!) But I imagine that usually the null distribution will have some free parameters (mean, variance etc…) determined by the data.

Suppose that the data generating process (DGP) has corr(A,B)=0 and A and B have some super weird distribution with finite moments. Suppose the null hypothesis assumes that A and B are normal (and uncorrelated), but with the correct mean and variance. Can the p-value under this null be that different from Pr(result|DGP)? Are there theoretical estimates for this difference? Examples? I've been trying to come up with examples where this goes very wrong, but I haven't got there yet. For example, making Pr(result|DGP) large by making the DGP degenerate results in both that and the p-value being 1 (because the variance is correct in the null).

My intuition was that it should be fairly easy to see whether the null distribution fails to match the data because it is incorrect about the correlation, or because it is incorrect about other features of the distribution. And that the null would be sufficiently flexible on the second front that it would not be rejected only on that ground. (And that the phrase "other factors" is not such an unreasonable summary of that. :) )

Sorry, this was meant to be a reply to Jonathan, Martha and Daniel.

Often it’s assumed that you can say that although you don’t know what distribution your data comes from, it’s at least true that it’s IID draws from *some* distribution.

But this is already a REALLY STRONG assumption. Note that in the Frequentist testing regime, this is an assumption about the behavior of the universe. In the Bayesian conception, instead this is an assumption about *how much you know* about the universe.

Suppose that someone gives you a set of data… it’s 35 measurements of length… You graph some histogram of it, and it’s got some sort of shape with a center and some spread about it…

The person asks you what’s your best guess of what the average of the next 35 measurements will be… and you tell them it’s the mean +- the standard error…

They say thank you, and then you say “by the way what are these measurements?”.

“Oh, this is the height of my potted ficus tree that germinated a few months ago.”

With that simple sentence, all the Frequentist statistics in the world just went out the window. We all know that the ficus tree is going to grow and grow and get bigger through time. In fact not only will the average be bigger next time you take 35 measurements, but *every single measurement* will be bigger than any measurement taken so far.

The assumption that these repeated measurements are IID samples from some distribution is strongly strongly violated. And this happens much more than you might at first expect given how often such assumptions appear in stats textbooks.

Daniel, Bayesians are in the same boat though. In particular, Bayesians typically assume that past and future data are exchangeable, which in practice is tantamount to making the i.i.d. assumption (which is why Bayesians and frequentists tend to use the same statistical models). It’s true that, as a Bayesian, you can always retreat to subjectivism and say that the assumptions you are making are about your beliefs rather than about the world, but if you want your inferences to say anything about the world (which I assume is most people’s goal) you do need to make the further assumption that your beliefs actually reflect the way the world is. And if that assumption is seriously violated, then the Bayesian inference goes out the window together with the frequentist one.

Good point. Actually if you believe that any violation of i.i.d is possible in reality that has any meaningful implications regarding the order of the data, as a Bayesian you cannot use an exchangeability model anymore either. Donald Gillies made the point (that can probably be challenged in general terms, however I’d say it is almost always true in practice) that using a Bayesian exchangeable model actually implies something equivalent to believing that the reality is i.i.d.

Why do frequentists and Bayesians make some untennable assumptions like these? Because (a) they allow tremendous simplification, which is partly what modelling is about, and (b) statistics requires us to infer something from certain observations about some other, later, newer observations, which always requires assumptions of this kind, be it on a higher level (such as about residuals or innovations in a time-dependent model).

> you do need to make the further assumption that your beliefs actually reflect the way the world is

Actually you need to test *many* models of the world and see which ones win out. I am not aware of any body of Frequentist statistics as taught in typical undergraduate or even graduate courses for scientists that emphasizes constructing multiple meaningful models and comparing them against each other. The best you’re likely to get is some of those “robustness checks” where you specify a regression with and without some predictor to see if the other coefficients change p values dramatically.

It can be done, but it would be rare to emphasize. I feel like that’s not the case at all in Bayesian practice where an emphasis is placed on describing a generative process through scientific assumptions (rather than mathematical ones) and comparing alternative scientific assumptions is a relatively straightforward and obvious portion of building the model in the first place.

Thanks, I like the example, it’s very clean and simple. I’m not sure, though, if it is the best fit for this question. There is not really a hypothesis being tested. If the hypothesis being tested concerned the average, then, if the null is flexible in the way I described – in particular, if its mean is based on the data – then the p-value will be 1. The mean will match the data exactly under the null. If the hypothesis concerned the correlation between time and height, then the null of no correlation (whether because height is IID or otherwise uncorrelated with time) will be soundly and correctly rejected. (I don’t know much about ficusses (fici?), but I assume the thing is growing at this point.)

I am really looking for an example where the null is rejected for reasons that have nothing to do with the correlation we are interested in, even when the null is flexible (mean and variance are chosen to match the data). This seems like an interesting statistical question. I think the example at most does the opposite, it might give a situation where the null is not rejected when it should be.

(Just two sidenotes: first, I am defending (rhetorically, I expect to be wrong, but I want to understand why) a definition of p-values, not frequentism itself. I guess one is turning into the other because the problem with that definition is that it implicitly paints too rosy a picture of hypothesis testing? Second, I may not be a statistician, but I am most certainly Bayesian. :) I am a game theorist (economic theorist), we are dogmatically Bayesian. While we game theorists sometimes think about what it would be like to not be Bayesian, I think it is fair to say that we cannot conceive of frequentism.)

The point is that the “null hypothesis” includes many assumptions, not just ones about the distribution but about how observations are drawn from that distribution. People get in trouble when they forget this, whether they do hypothesis tests or Bayes. We just happen to have a longer history of people making this error in the context of NHST than in Bayes.

In any event, this error can be circumvented by directly modeling the sampling process as part of the null—including researcher intentions regarding sampling. John Kruschke has a few fun examples of this (see, e.g., https://jkkweb.sitehost.iu.edu/articles/Kruschke2010WIRES.pdf). This avoids relying on asymptotics that depend on iid assumptions that are unlikely to be true.

But an appropriately defined null hypothesis is just about setting up the analysis, not what you do with it afterwards. Speaking for myself, I sometimes care about rejecting a bad model, and this is what we can do with NHST. But I’m often more interested in understanding how uncertain I should be about a model or its parameters, and that’s something I can do with Bayes. Either way, we should worry about whether our models contain assumptions that just don’t work, but the complaints about NHST are more to do with the outcome of the analysis than the models per se.

Daniel: Obviously it’s a fairy tale to think that frequentists should analyse any data making assumptions out of thin air without considering the context and meaning of the data. That’s just childish as a slag against frequentism. Frequentists can use growth models, you know. Ask a stupid enough Bayesian to analyse these data without context, and they will assume exchangeability (because “I don’t know about any dependence”), use a default prior, and will end up in the very same mess.

The point of the example is that often there are ways in which the frequentist conception of IID draws from a distribution is strongly violated, and it can be without us knowing about it.

The reason why that is important to remember is the way Frequentist methods are sold. They’re pretty much sold as:

1) Everything is IID draws from a distribution

2) By pure mathematical logic, some statistics calculated from data of type (1) are mathematically guaranteed to be sampled from approximately a certain distribution regardless of the distribution in (1)

3) Therefore inference about the future is mathematically guaranteed to be within some small bounds of the truth.

(2) is a correct mathematical statement about high complexity random sequences. (3) would in fact be true if (1) were true… But (1) ISN’T TRUE

The sequence (1) and (2) therefore (3) is strongly predicated on (1) being not just approximately true, but really quite true. And yet, it’s easy to think up situations such as the growing ficus tree where it’s violated.

And stuff is analyzed without properly “considering the context and meaning of the data” ALL THE TIME. This is more or less the entire field of Machine Learning. It’s also common enough in Medicine and Biology, where people simply do t-tests or run ANOVA procedures because “that’s what you do”.

Bayesian methods are sold as “if you make this assumption about what you know, then you should think that xyz is true”… This makes a “GIGO” interpretation fairly obvious. But too many people take my step (1) as simply god-given, and then my step (2) seems like magic that makes meaningless data turn into inference through mathematical attractors like the Central Limit Theorem.

Frequentist statistics can be done right. But FAR too often they’re done like this cartoon of mine.

Who sells frequentism like this? Certainly not me.

You’re right that frequentist methods are often misused in cartoonish ways. But what about Bayesians who assume exchangeability without discussing it, use default or convenience priors without exploring the implications, discussing the subject matter background (or discussing it in a single sentence regarding a single aspect of their prior so that they could claim it “encodes information”) or doing any kind of sensitivity analysis etc.?

Appropriate and correct application of statistics is no Bayes vs. frequentistm issue.

I would say that textbooks appear to sell Frequentism like this. They discuss things like CLTs all the time but it would be exceedingly rare to see them discuss whether the idea of a random process is even a meaningful model of say a medical intervention. The random process is always assumed on the front end in my experience. perhaps I’m not up to date on the latest in stats textbooks. I’d love to see an alternative.

PS: If I draw 10 applied Bayesian papers at random, I’m pretty sure that max 2 will explain in any detail how the chosen prior reflects their belief/knowledge.

I would say that textbooks appear to sell Frequentism like this. They discuss things like CLTs all the time but it would be exceedingly rare to see them discuss whether the idea of a random process is even a meaningful model of say a medical intervention. The random process is always assumed on the front end in my experience. perhaps I’m not up to date on the latest in stats textbooks. I’d love to see an alternative.

Can you show me a textbook that says the things explicitly that you claim they say?

I agree that they don’t usually discuss such issues enough and in enough depth, however that’s different from saying “it is sold as if everything is iid draws from a distribution”.

Well, I can see what you mean thinking a bit about how examples are usually presented. I should certainly not try to defend a majority of textbooks that I don’t like for such reasons.

The thing is, they cater for an audience that wants cookbook recipes and doesn’t want to be told that what they are taught comes with problems, requirement for conscientious thought, decisions, and responsibility. Both frequentists and Bayesians who want to teach their stuff well have a hard time reaching such an audience. I know what I’m talking about.

“…I should certainly not try to defend a majority of textbooks that I don’t like for such reasons.

The thing is, they cater for an audience that wants cookbook recipes and doesn’t want to be told that what they are taught comes with problems, requirement for conscientious thought, decisions, and responsibility. Both frequentists and Bayesians who want to teach their stuff well have a hard time reaching such an audience. I know what I’m talking about.”

Yes — so we need to choose textbooks carefully, and prepare to reach the audience we have rather than the one we’d like to have. This typically involves supplementing the textbook with additional materials that improve on what’s in the textbook. And we need to emphasize repeatedly the points that the audience doesn’t want to hear, and back them up with examples.

I’ve posted (at https://web.ma.utexas.edu/users/mks/) some materials from courses I’ve taught. See in particular the following links there for some examples/ideas:

May 2016 SSI Course

Handouts from a Summer Master’s Course on Probability and Statistics for Teachers

M358K Instructor Materials

Common Mistakes in Statistics

Well, I’ve mostly gotten rid of textbooks that I think are fairly terrible. I keep Heiberger and Holland Statistical Analysis and Data Display (2004) around because I think of it as “typical” of a masters level textbook (and I happened to buy it as one of the first advanced textbooks I tried to teach myself statistics with back in 2005 or so (shortly before finding BDA2)).

After some very short initial stuff, Chapter 3 begins on page 21 with “A Brief Introduction to Probability”, in which they jump directly into probability and random variables. They then describe standard frequentist ideas about probability and random variables and datasets, and some means of plotting them (like histograms).

There is no discussion of *why* other than the intro sentences “The quality of inferences are commonly conveyed by probabilities. Therefore, before discussing inferential techniques later in this chapter, we briefly digress to discuss probability…”

They discuss “Sampling Distributions” and talk about the Central Limit Theorem, all this BEFORE getting to part 3.6.1 Statistical Models, which they describe as:

“A key component of statistical analysis involves proposing a statistical model. A statistical model is a relatively simple approximation to account for complex phenomena that generate data. A statistical model consists of one or more equations involving both random variables and parameters. The random variables have stated or assumed distributions. The parameters are unknown fixed quantities. The random components of statistical models account for the inherent variability in most observed phenomena.”

So, while acknowledging the simplifications, they pretty much just state that statistical models are RNGs.

I did buy “Diggle and Chetwynd Statistics and Scientific Method: an introduction for students and researchers (2011)” for my wife to read, but I can’t find that copy. It was a textbook recommend on this blog, and I remember it being ok. However in the introduction (which I can see on the Amazon look inside feature) they say:

“It follows that the key idea in the statistical method is to understand variation in data and in particular to understand that some of the variation which we see in experimental results is predictable, or systematic, and some unpredictable, or random. Most formal treatments of statistics tend, in the authors’ opinion, to overemphasize the latter, with a consequential focus on the mathematical theory of probability…Even worse, many service courses in statistics respond to this by omitting the theory and presenting only a set of techniques and formulae, thereby reducing the subject to the status of a recipe book.”

So they’re calling out the thing that I’m claiming is common and saying that they wrote their book specifically to counter that trend.

I loved this comment:

“There are observational sciences such as political science and economics, where you can’t just run 10 more elections or start 10 more civil wars in order to get new data.”

The problem with this field, in general, is that they’re pretty much mining the Internet (incl. GitHub etc.). Endless stream of new data. Endless options & variables. Endless forking paths. So no wonder that there is no consensus.

I also wonder whether the question examined is actually worth the trouble of replications…

I’m not quite sure I agree that “It’s only by showing that the FSE authors messed up methodologically that TOPLAS could establish ‘sufficient cred’ to attack the premise itself.” Sure you can’t *just* say the premise of their design/method is nonsense, but you can say it and then make an argument to that effect–rationally and/or empirically. People publish papers like that all the time. Are people really so quick to reject debunking of relatively new methods on the basis that they’re unsound? Or is it just that debunking requires evidence, but the blogger thinks it ought to have been enough just to call it nonsense?