I originally gave this post the title, “Stigler: The Changing History of Robustness,” but then I was afraid nobody would read it. In the current environment of Move Fast and Break Things, not so many people care about robustness. Also, the widespread use of robustness checks to paper over brittle conclusions has given robustness a bad name.

This 2010 article by Stigler is excellent. I came across it while doing reading for a research project, and then I got to see all these cool bits:

[In a paper from 1953, George] Box wrote of the “remarkable property of ‘robustness’ to non-normality which [tests for comparing means] possess,” a property that he found was not shared by tests comparing variances. He directed his fire particularly toward Bartlett’s test, which some had suggested as a preliminary step, to check the assumption of equal variances before performing an ANOVA test of means. He summarized the results this way:

To make the preliminary test on variances is rather like putting to sea in a rowing boat to find out whether conditions are sufficiently calm for an ocean liner to leave port!

After this dissection, Bartlett’s test, much like a frog in a high school biology laboratory, was never the same again.

You go, George! I then went and read Box (1953), and it’s good stuff. It’s a funny thing, though: reading it, I get the feeling that Box was constrained—boxed in, as it were—by having to work within a hypothesis-testing framework. it’s all about testing for equality of means, testing for equality of variances, testing for normality—even though the ultimate purpose of these methods is not to test hypotheses that we know ahead of time are false, but to learn from data.

The idea of robustness is central to modern statistics, and it’s all about the idea that we can use models even when there are assumptions are not true—indeed, an important part of statistical theory is to develop models that work well, under realistic violations of these assumptions. But, due to historical circumstances, Box was forced to develop some of those ideas within a more constricted theoretical framework.

Back to Stigler, who continues:

[In his 1960 article,] Tukey called attention to the fact that in estimating the scale parameter of a normal distribution, the sample standard deviation ceases to be more efficient than the mean deviation if you contaminate the distribution with as little as 8-tenths of a percent from a normal component with three times the standard deviation. This took most statisticians of that era as a surprise . . .

0.008—that’s interesting, also this seems like an excellent homework assignment for a theoretical statistics class, to ask them to perform a simulation study evaluating the performance of these two estimators (and others) as a function of the size of the second component and its scale.

Stigler continues:

Any history is a product of its time; it must necessarily take the present view of the subject and look back, as if to ask, how did we get here? My 1973 account was just such a history, and it took the 1972 world of robustness as a given. Huber had brought attention to Simon Newcomb and his use in the 1880s of scale mixtures of normal distributions as ways of representing heteroscedasticity; I enlarged and extended that to other works. I noted that Newcomb had used an early version of Tukey’s sensitivity function, itself a forerunner of Hampel’s influence curve. I reviewed a series of early works to cope with outliers, and I trumpeted my discovery of Percy Daniell’s 1920 presentation of optimal and efficient weighting functions for linear functions of order statistics, and of Laplace’s 1818 asymptotic theory for least deviation estimators. I found M estimates in 1844 and trimmed means (called “discard averages”) in 1920.

I wonder *where* he found those trimmed means? I only ask because sometimes I’ve found gems in old psychometrics articles. But maybe there’s nothing special about psychometrics; maybe just about any applied field has great statistical ideas in the old literature, if you just know where to look.

And then Stigler usefully rounds out his discussion:

None of this, I [Stigler] hasten to say, is recounted to undercut the striking originality of Tukey and Huber and Hampel—to the contrary. I mean it in the spirit of Alfred North Whitehead’s famous statement that, “Everything of importance has been said before by somebody who did not discover it”; that is, to provide historical context, where one might now see that, for example, it was not the M estimates that were new in 1964, it was what Huber proved about them that was revolutionary.

But also this:

[L]east squares will remain the tool of choice unless someone concocts a robust methodology that can perform the same magic, a step that would require the suspension of the laws of mathematics.

I don’t think so! In the ten years since Stigler’s article has appeared, regularization has taken over. Instead of least squares, we do lasso, or regularized logistic regression, or deep learning. Even little rstanarm uses weak but proper priors by default. Lasso and Bayes and machine learning and modern computing have moved regularization toward default status. Sure, lots of people use least squares, and I’m sure they always will. But at this point I’d call it a legacy method more than a tool of choice. And no suspension of mathematical law was required.

“Instead of least squares, we do…”

I still run a lot of ordinary regressions. It’s true that I am happy to have alternatives at my disposal — I’m writing this comment as I wait for a Stan model to finish running — but if someone said least squares is still _my_ tool of choice I wouldn’t squawk.

Just to piggyback on your comment, the thing I took from Stigler’s remarks was his emphasis on the “magic” aspect, which I take to mean the feeling that you’re getting something for nothing. But regularization or Bayes, there you have to make choices and justifications, and it feels like you have to make more of an investment.

Another “magic” to least squares is that it tells you immediately whether your model is “good” (low squared error) or “bad” (high squared error). No need to do model comparisons, cross validation, or predictive checks, all of which (again) require choices and so lose their magic.

Anyway, like you I end up using most of these methods at various times depending on my goals—and am also currently waiting for a Stan run to finish!

I copied out this passage of Stigler’s on OLS for my quotation page at http://www.rasmusen.org/a/quotations.txt (a kind of page each of us should have for personal use):

“Ever since the statistical world fully grasped the nature of what Fisher created in the 1920s with the analysis of multiple regres sion models and the analysis of variance and covariance—ever since about 1950—we have seen what that analysis can do and seen the magic of the results it permits. The perfection of that distribution theory, the ease of assessing additional variables and partitioning sums of squares as related to potential causes— no other set of methods can touch it in these regards. Least squares is still and will remain King for these reasons— magical properties—even if for no other reason. In the United States many consumers are entranced by the magic of the new iPhone, even though they can only use it with the AT&T system, a system noted for spotty coverage—even no receivable signal at all under some conditions. But the magic available when it does work overwhelms the very real shortcomings. Just so, least squares will remain the tool of choice unless someone concocts a robust methodology that can perform the same magic, a step that would require the suspension of the laws of mathematics.”

Interesting — seems like essentially a comment on human nature.

Speaking of regularization, have you read this piece by Stigler on it?

https://projecteuclid.org/euclid.ss/1177012274

Zad:

I don’t know if I read that particular article, but I read something else that Stigler wrote on this topic around the same time, and I liked it a lot.

Re your long-standing principle in statistics, Stigler notes in that paper that the origins of shrinkage estimation are in psychometrics, in particular in Truman Kelley’s work in the 1920s:

They are discussed in the Percy Daniell article that Stigler mentions:

Daniell, P. J. (1920). Observations weighted according to order. American Journal of Mathematics, 42, 222-236.

I thought using least squares meant you were being robust!

Well, ordinary least squares, at least. It’s simple, and not as fancy as explicit maximimum likelihood, and more robust than instrumental variables. Lasso and machine learning are better for some things— model-less prediction— but if you’ve got underlying theory, are they better? Aren’t they vulnerable to the sample being special, since they’re so data-driven?

Lasso and L2 are equivalent to a joint double exponential and normal prior centered on zero for the parameters. So if you’re in a multidimensional setting and your theory precludes lots of covariates all having very strong influences which just happen to cancel out just right and add up to something reasonable, which I think is generally correct, regularization is better.

Lasso and ridge are perhaps more akin to empirical Bayes rather than standard Bayes, since the lambda parameter in lasso/ridge (which corresponds to the standard deviation of the normal/laplace prior in the Bayesian context) is typically determined empirically rather than via background information.

Olav:

What used to be called “empirical Bayes” is hierarchical Bayes. And hierarchical Bayes is standard Bayes.

Huh? Has there been a shift in terminology? I thought the distinction was that standard Bayes was “you pre-specify the priors, possibly by some elicitation process, crank the handle and get the posterior”, whereas with empirical Bayes you are allowed to go back and change the prior afterwards, for example using cross validation or to minimize some data based criteria like AIC or BIC.

Indeed, but how else will “we all be Bayesians by 2020” as Lindley famously predicted?

Of course, he also said “there is no one less Bayesian than an empirical Bayesian” so one must

choose. I’m glad to see that Andrew has opted for the less doctrinaire viewpoint, if robustniks

can sneak under the Bayesian tent there is hope for all of us.

> not as fancy as explicit maximimum likelihood

it is in fact explicit maximum likelihood of a normal likelihood

By “explicit” I meant, setting up a likelihood function and maximizing it. I’d say it “happens to be” maximum likelihood. But you can also justify it other ways.

sure I can see “doesn’t require tuning an optimizer” as a kind of robustness

That’s part of the magic.

Thank you for recommending the Stigler RObustness artcle–it’s great!

The painting he uses for illustration, taken from Hansen and Sargent, is The Cheat, about some card cheats. That makes me wonder if statisticians ahve thought about another dimension of robustness– robustness to cheating. That is, what techniques are good if you don’t trust the analyst? (in turn, this might be subdivided into an analyst who makes up data points vs. an analyst who uses deceptive techniques). This is the old How to Lie with Statistics, but more sophisticated. In Economics, this is a major drawback of structural modellings— it gets so complicated that people are suspicious that some accidental glitch or purposeful choice along the way has created teh results.

Methinks the most problematical type of “cheating” may be when the “cheater” doesn’t realize they’re really cheating — they’re just doing what they were taught, without questioning “authority.”

Or maybe not even formally “taught”. Just reusing some code from similar analyses that a colleague or someone on Github shared, changing the variable names and a few starting parameters. I mean hey, if the analysis was good enough for that guy to get four publications out of it must be right.

Here’s a non-gated version for the hoi polloi:

http://home.uchicago.edu/~lhansen/changing_robustness_stigler.pdf

Thanks for the link.

Thanks! That $100+ price tag on the article was a bit much for me.

Wait, is a regularised fit really sufficiently different from a least squares fit to be not least squares in this context? Even with Lasso etc there is still a lot of “assuming subgaussian errors” going on.

As somebody who might have skipped an article called “Stigler: The Changing History of Robustness”, I think this is the first time I am thankful for a clickbait title.

I was a graduate student at University of Wisconsin in the late 1970s, when George Box was a force there. Like most students there, I just assumed that everybody knew Box’s work on robustness, including his work showing how non-robust Bartlett’s test was, and his use of the rowboat analogy. (George was relatively kind to Bartlett. Not so much for some authors with whom he had a disagreement.)

Also, Hampel’s work on the influence function was still new back then, as was Huber’s work, and I recall at least one of Box’s students wrote a dissertation showing that Bayesian near-equivalents existed for at least some of the M estimators.

Regarding your note above with the play on words of George being boxed in with testing, he was actually not a particular testing fan. In fact he would make fun of people who thought that hypothesis testing was the be all and end all, making fun of them by how he thinks they would talk to a client: “Well, the hypothesis test was significant at alpha equals .05. Have a nice day!”

Also, Steve Stigler was still on the UW faculty until 1977 or so!