Skip to content

Advice for a Young Economist at Heart

Shoumitro Chatterjee, who sent me that paper we discussed yesterday, writes:

I [Chatterjee] recently finished my PhD in economics from Princeton and am starting as junior faculty at Penn State. I do applied work on development using observational and administrative data, and I have a few questions:

1. Is there a difference between multiple comparisons and multiple hypothesis testing?

2. Your examples of multiple comparisons is mostly from experimental settings. I use observational data in my work. Here is my concern: think of any large scale household survey dataset — the demographic and health surveys, the PSID or the National Sample Surveys of India. Many papers use the same data. Different papers are testing different hypotheses — paper A might be running a mincer regression using the NSS and the paper B could look at the effect of neighborhood disease environment on heights. Since paper A and paper B use the same underlying data set, are they also subject to the multiple comparisons problem? (May I also request that in your blog you sometimes mention examples from papers that use large sample surveys for exposition).

3. I was writing a paper with my co-author using the DHS. We had a economic model in mind. That had a few implications which we tested using the DHS. We could not reject the null so we gave up that idea. Next, we came up with another idea (using the DHS), again had an economic model to begin with that we wanted to test and this time it “worked”. We learn by examining the data. I know you write about this in the statistical crisis in science paper. Your recommendation is to “more fully analyze existing data” and analyze all possible comparisons. May I request you to please elaborate on this — may be with respect to my own paper?

We were interested in the effect of economic growth on fertility and we were using the DHS. Therefore the main regression is fertility on economic growth with a bunch of fixed effects. Should we have looked at effect of economic growth on things other than fertility like education, heights of children? (but we were not interested in those questions). We did explore heterogeneity of relationship between growth and fertility and found that deep recessions had a significant relationship to fertility but not booms. We also found that particular countries were driving the results. Finally, we also looked at how long term growth was related to long term changes in fertility. What else could we have done?

1st year statistics taught to Econ graduates could be much better — if taught with specific examples of applied work. Is there a book/lecture notes you’d recommend that teaches the important concepts with examples for applied work? Please let me know.

My reply:

1. I’m not really interested in multiple comparisons or multiple hypothesis testing. To put it another way: the classical multiple comparisons question is that you see a pattern in data and you want to know if it’s “statistically significant,” but you need to adjust for the fact that it’s only one of N possible comparisons you could’ve done. (The appropriate N is not the number of comparisons you did, it’s the number you could have done had the data been different; see further discussion of this point here.)

The whole deal with multiple comparisons is the selection problem. My solution to the selection problem is to include all possible comparisons of interest, ideally using a multilevel model (as discussed here) or, if you want to stick with classical approaches, a multiverse analysis.

2. My colleagues and I do lots of multilevel analysis with big surveys; see for example here. Regarding your question about how to think about multiple analyses of the same dataset: I think the best approach would be to conduct a single analysis looking at all the comparisons of potential interest. I’m not saying that’s easy, but I think it’s the way to go. Here’s an econ paper by Rachael Meager that does Bayesian multilevel modeling. It’s not quite what you’re asking, but I think the same principles can apply to the sorts of problems you are interested in.

3. Regarding the questions for your own research: I’m not quite sure what I would’ve done—to answer that question would require some thought! When I see a crappy paper, it’s easy for me to think of a million things I could’ve done better. It’s more of a challenge to make useful contributions to existing careful work. I have lots of confidence that our methods could make a difference, but I guess it could take some effort.

Speaking generally, here are some tips:

(a) Forget about what’s “significant” and what’s not. Make a table/graph of estimates of everything you might care about. Where an estimate is, relative to some “significance” border, is pretty much irrelevant. A statement such as, “deep recessions had a significant relationship to fertility but not booms,” can mislead. Better to just estimate all these things and accept the uncertainty that results.

(b) Consider lots of interactions. Again, though, expect that most things will not be statistically significant—remember 16—but that doesn’t mean they’re not important. Instead of thinking of your study as establishing definitive truths, think of it as a step forward in our imperfect understanding.

4. Finally, are there books I can recommend with examples? I like Angrist and Pischke; I like my own book with Jennifer and Aki; I think there must be a lot more out there, maybe readers can help?

Chatterjee adds:

One quick clarification.

If a data set has N variables, then the number of possible comparisons is literally N_C_2. Is your suggestion that I include all N_C_2 comparisons? Some of them might not even make economic sense ex-ante. If not then the question is how to choose? Suppose I was interested in the relationship between income and infant mortality, how to choose what other comparisons to include? The basis of that must be some economic theory, right?

My response:

I don’t think that all these comparisons are of interest, but in any case you can implicitly estimate all of them by estimating the vector of all N effects; see the examples in this article.

As to the question of what comparisons to study, yes, I’d expect this to be guided by theory. In my response above, I was assuming that you already had theoretical reasons for studying the things you were looking at.


  1. Anonymous says:

    “the appropriate N is not the number of comparisons you did, it’s the number you could have done had the data been different”

    What if you’re geneticist and the source of the data is your own DNA. How many comparison could you do if your DNA were different?

    • Joe Nadeau says:

      See the old Nat Genet paper by Lander and Kruglyak. Basically, the number of markers (now SNPs) that you test (genotype), or could have. Actually, the number is somewhat less because of linkage disequilibrium – simply counting the number of markers assume they provide independent information, and they don’t because of LD. For example, if two SNPs are in perfect LD, knowing one fully informs the other. But these methods are well-established in GWAS. This is the ‘genetic architecture’.

      The harder question involves the phenotypic architecture. If the genotype-phenotype test involves only a single trait, then no problem. But if two or more traits are tested, then the relationship between traits because critical – are they independent? Unfortunately, there’s no easy answer. The correlation structure can be obtained, but the methods usually assume linear relations. There are ways, but I’m not confident that the underlying assumptions are satisfied, and generally there’s no independent way to validate.

      Even if one does not want to declare significance, one usually wants to ay something about relative contributions. Effect sizes help. This is a neglected aspect of genotype-phenotype analysis. We obsess about getting the genotype (sequence right, and usually worry less about aspects like which trait, reliable measurements, heterogeneity, context-dependence, non-additivity, non-linearity, etc etc…….

  2. Dale Lehman says:

    There is a gap between Chatterjee’s question and your response, at least for me. I’ll use one of the examples from your linked paper – comparing test scores across the 50 states. I believe the kind of analysis Chatterjee is thinking of would include a fairly large number of potential predictors from each state: e.g., income levels, income distribution, ethnicity, religion, school spending, average class size, teacher salaries, etc etc etc. When you say to model and report ALL effects and interactions, it does seem infeasible to include all of these. And, while two way interactions (e.g, different effects of educational spending by state) are relatively clear, why not 3 and 4 way interactions? Perhaps the combination of religious background, ethnicity, per capita income, and state affects test scores in an interactive fashion. When there is a theoretical reason to be interested in a particular interaction effect, I understand how to model that – but when your advice is to model them all, I have trouble understanding what “all” means. The examples in your paper do not include nearly the number of variables that I believe Chatterjee (and I) think of. So, can you clarify this further?

    • Dale, shortly after Trump’s election, when all the talk was about racism and Nazis marching in NC while “the economy” was booming and unemployment was plummeting, and I was busy driving past homeless encampment after homeless encampment on my way to my wife’s office, and whatnot I had an economic theory. My theory was that standard measures used by economists to understand the “health” of the economy were broken, and probably had been for a long time. My theory was that the experience of everyday people was actually rather terrible in the sense that they couldn’t really afford a decent quality of life due to the burden of excessive debt and the inflation of core costs of living which weren’t showing up in the CPI and other normal ways of measuring the economy that are taken for more or less granted.

      In any case, I decided to find out what I could from the data I thought were relevant, using methods I thought were the right ones. I began by collecting data… (I can tell at this point, this is going to be a long saga comment… ) I downloaded *all* of the American Community Survey… and I mean *all* of it, every single microdata entry… I discovered that my RAM chips weren’t up to it… downloading that many gigabytes I was getting errors in the data caused by bit-flips… So I bought new RAM. I bought a multi-terabyte hard drive to store the data and to load the CSV files into Mariadb… I discovered ways to read the CSV files directly into Mariadb as if they were already tables. Soon I had every single ACS entry in SQL queriable form.

      And then I needed more… It’s no good to have income information, which ACS gives you, because you can only spend *after tax* income… So I started combing the IRS website and found datasets on taxes… But I needed more. I downloaded the consumer expenditures survey.. all of it… Using summary data is no good, because summary data eliminates all the *spatial* information, and the *spatial* information is critical… because my hypothesis was that costs of living vary dramatically in different parts of the country, and even dramatically within small regions, like South Central LA is way different from Glendale…

      Anyway, obviously I wasn’t going to be able to fit models to the entire datasets, so I came up with ways to sample the microdata that would preserve the information I needed… The Public Use Microdata Area (PUMA) was the ideal thing, since they are already designed by the census to have approximately the same number of people in them… So I started sampling things by PUMA, with just a few entries for each year for each PUMA. I believe in the end I had a sample of about 200,000 ACS microdata rows covering the whole country across 10 years or so.

      But, we still have issues… There’s spatial structure here we need to account for… and there’s time-structure we need to account for. So I started building a model…

      The model used a Radial Basis Function to fit a cost of living surface to the entire continental united states… so I had to do some simple cartographic projections.

      But that wasn’t enough, because I had ACS data going back 10-15 years… so I had to put time varying coefficients on the RBF expansion. So now I have a smooth surface over the united states that undulates in time, with a Bayesian posterior distribution on the entire surface at each point in time… It was a thing of beauty in my mind. It was like flower petals rippling in the wind (I’m working up to my Oscar speech).

      And the RBF is really just a way to get the smooth structure… stuff like “along the coast in CA is different from the central valley, and both of those are different from Los Vegas and those are considerably different from either Boise Idaho or Pittsburg…”

      But there are something like 15 million people who live in the greater southern california area, and their conditions vary considerably driving just a few tens of miles…

      So the RBF expansion was just basically the structure you needed to *set the priors* on the conditions in each PUMA… remember PUMAS are around 100k people, so in SoCal alone for example you’ve got something like 1500 PUMAs and they vary dramatically between say Compton and Manhattan Beach… In Southern California you can walk through an entire PUMA on your way to meet your friend for brunch… So we expect *long tails* in the results. So I set up a model where the results in each PUMA were t-distributed relative to the smooth predicting RBF.

      But costs of living aren’t the same for every household. Next, we needed to account for family / household structure. The cost of living is quite different between say a single college educated adult age 25 and a married couple with two masters degrees age 39 with three kids, and yet again for an immigrant family from mexico with less than a high school education and 5 children… all of which live practically within stone throwing distance of each other… So I decided to model the need for housing in terms of square footage for adults and children, and caloric intake as a function of age. So it was off to the USDA to get information on food prices and calorie content of different foods…

      And it was spectacular… And I could even sort of, kind of, fit it. Oh sure, it would compile. It would even run. Getting it into the typical set was a bit of a challenge. Stan’s warm-up algorithm could get stuck, it helped a lot to use some kind of optimization step to find initial conditions, otherwise it would take time-steps so small it didn’t go anywhere. And even though I could only devote maybe an overnight run to it and get a few hundred samples with a few tens of effective samples, it still made sense! It would produce very reasonable looking results, with an entire distribution of ratio of after-tax income to expected cost of living in a given PUMA with expected cost of living calculated in terms of a fixed square footage for each family size and a fixed number of calories depending on age and household makeup, and a fixed cost of transportation depending on gas consumption and typical car fuel efficiencies varying through time, and distances traveled determined in part by population density in the region of residence…

      And it was *spectacular* and it made total sense (ok, now I feel like I understand Dan Simpsons blog posts better) but it had a dark underbelly… It took hours to run Stan on this model, and it had divergences, and doing this work was work… I mean, real work, like I was spending 6-8 hour days working on this at one point, and I realized *I’m not getting paid to be a professor of economics* and furthermore, *I’m not getting paid at all* because I run a small business and I need to get clients, but I’m spending all my time being an academic… and I mean like a Platonic Academic, I’m doing this **because I really want to know the answer** and because I believe in the core ideas of my model of what’s wrong with the country, but I want that confirmation, that evidence…

      Furthermore, *this is totally un-publishable* and I know it will be spectacularly spurned by the Econ profession, and besides I couldn’t care less about “publications”. I could publish this on my blog no problem… But then it’d just be one more blog entry that no-one reads. So I stopped. And the whole project is sitting there in a directory somewhere, and Mariadb still runs on my desktop computer, the entire dataset spinning spinning spinning at 5400 RPM just waiting to die, waiting to be put out of its misery… But also calling to me… “Dan… come back… find out what’s wrong with your country… show people what the minimum basic standards of the Econ profession should look like…” But it was not to be… I saw the best minds of an entire generation move to India, or New Zealand because the United States was failing them. They skipped out on the student loans, they left Bumstead Indiana to rot in a cesspool of thousands of hypodermic needles discarded in the woods behind the WalMart… they moved to the Bay Area to try to suck down the government rent created by printing 3 trillion dollars in cash and handing it over to banks and hedge funds who “invested” it in companies that spy on americans in a way that the NSA could never have dreamed of only to sell advertising to enormous government protected media companies that produce meaningless superhero movies…

      so, to answer the question you never asked… It’s possible, and it’s the *bare minimum* required to really understand what is going on in the economy, just like running proper weather simulations on supercomputers is the bare minimum required to produce forecasts of where hurricanes will hit land.

      I’d be happy if someone could refer me to the chapter in Angrist and Pischke where they say “here… this is how you understand what is going on in the world… start by downloading some of the hundreds of gigabytes of data that the government has been collecting for decades, and then start thinking about what people do with their lives and see if your model of what is going on in their lives helps you understand the data we’ve collected”…

      I couldn’t find it.

      But then I only looked at the Amazon “Look Inside” feature. I’m sure it’s in there somewhere.

      • Dale Lehman says:

        Daniel – well, that was an entertaining read. I have spent many hours pouring over the same data sources that you mention, although I stopped far short of fitting the kind of models you did (out of lack of capability, brainwise and other). I probably differ somewhat on your economic premises (in fact, I believe the US economy is mostly healthy, but I am more concerned and pessimistic about it political and social well being) – but back to the question posed in the initial post. Are you really saying that unless you model all the possible interactions then the models are no good? If that is the case, then virtually all economic research (that I am aware of) is largely worthless (it may be for other reasons, but I would not have pegged it on the failure to consider all the potential interaction effects). I believe Chatterjee is asking about this – with n variables in a model, then N_C_2 interactions are possible (I think his notation was to indicate the number of ways 2 things out of N could be combined). I would go further, since there is no reason why we should be limited to 2-way interactions. So, as entertaining as your comment (I’d support some kind of Oscar, though I’m not sure what category) is, how about a more direct answer to the question?

        • Dale, email me some time and we’ll talk about working on that model some more. Maybe someone who I respect who actually does economics and has a different perspective is just what that research needs…

          Anyway, how about a more direct answer… Sure.

          You can’t just “model all the interactions” because that’s insane. Yes, the motion of a gram mass a lightyear away perturbs the paths of individual molecules in a balloon so that after a few seconds they’re unrecognizable from the predictions without that gram mass… but we don’t include every gram of mass in Alpha Centauri in our models of bicycle mechanics. So what do we consider? The most important mechanisms by which things happen on the bicycle: brake pads have coefficients of friction, the hill you’re climbing has a slope as calculated by imagining a line between the regions of contact of the front and back wheels… we ignore the fractal nature of the rough ground surface… The rider has a center of mass which is higher when standing on the pedals and lower when seated and hunched down… the gear ratios are whatever they are, the rider has a VO2 max and a weight and is in a certain state of cardiovascular fitness, had a certain thing for breakfast, and has been training for this ride with certain climb times in the recent past…

          So, if you want to know how neighborhood disease environments affect height of children… What are the important factors? This is a social question but more than that it’s a *biological* question, and it involves communicable diseases, immune system activity, nutrition, and bone growth.

          To think that you can understand how height comes about and the factors that influence it without having *any* idea about how the biology of growth works… without *any* mechanism in your model at all… well that’s insane. And that right there is what’s wrong with social science today, it seems to *insist* that it can understand the world without understanding *anything* about physics, biology, chemistry, even other aspects of social sciences like criminology, psychology, or politics… It isn’t sufficient to simply abstract them a little… it’s basically taken for granted that “modeling mechanism” is impossible.

          So, what would a rough mechanistic model of disease and height look like? First let’s identify some factors that are important:

          1) Which diseases are active in the region? Cholera is different from dengue fever etc… you can become immune to viruses, but you can be chronically infected by parasites… so lets look at the types of diseases that cause the most morbidity.

          2) What kinds of healthcare / treatments are available in the region. Places where people get diseases and then get cures are different from places where people get diseases and then become chronically ill or die… And death obviously is a big censoring issue for surveys. Not too many people who are dead answer questions about their height. Same is maybe true for chronically ill people. So maybe there are problems with the data that have to do with that in addition to direct effects of healthcare on healthiness.

          3) Nutrition is important: growing bones requires calories, and apparently there are hormones released by the bones that lead children to strongly prefer sweet things, so probably specifically glucose calories are important. So composition of the diet should be important. Also, immune system activity takes energy, so fighting infection will compete with bone growth for energy, there’s no way to get around this.

          4) Furthermore bone growth involves a bunch of immune cell activity, osteoclasts are immune system derived cells that break down bone, and osteoblasts are derived from a different lineage and they build up bone. But that’s kind of a bone maintenance thing, bones mainly grow in length at growth plates. I don’t know enough about the cells involved, but it wouldn’t surprise me if immune system cells were important and excessive infections could lead to reduced growth by essentially attracting cells away from the important growth regions to fight infection. I’d want to talk to some biologists.

          5) “social” variables like income, education, race etc will primarily affect these biological and physical factors through their effect on choices that people make, availability of resources, cultural biases and soforth. They are important to understand, but not necessarily strongly predictive, and we shouldn’t expect them to always have the same effect, they will interact with the biological mechanisms, which we should understand and get data on.

          So, if you want to understand how social processes affect biological and physical processes, at least make your model have some explicit interaction between the processes which is plausible, and then collect social data and biological data, and physical data and try to actually see if your model makes sense compared to how this data fits together.

      • Terry says:

        There are a lot of economists who care very much about understanding this type of data. I suggest you contact some of them to see if any are interesting in building on your foundational work. This could be a big opportunity for them. I have seen many examples of research reputations built on understanding the data better than anyone else. And you only need one collaborator … the indifference of many others should not discourage you.

  3. Peter Dorman says:

    I’ve used DHS, MICS and LSMS in the past, so I was piqued by these questions. My sense is that multiple uses of the same data set for different purposes is not a problem. What sometimes happens is that different people will use the same data set for overlapping purposes. A useful question to consider is, how should a research team handle such a (partial) reuse? The current standard is to report the earlier use and then show why yours is “better”. But the implicit assumption of replacement is not necessarily such a good approach. I’ll leave this for others.

    As for the very relevant question, what to do about the myriad of potential comparisons, I’ll give a response based on my experience. I’ve hung out a bit with business school types and have come to see the benefits of summarizing uncertainty in the form of a few scenarios. I gravitate toward three: low, medium and high (for the effect of interest). Then theory comes in: you think of modeling choices that ought to be either favorable or unfavorable to the effect — which you can check independently — and then clump some of the unfavorables in a “low” cluster, favorables in a “high”, and throw in a cluster of mixes. You estimate all the models in each cluster and average, then report the low, medium and high averages and use all of them. Of course, in an appendix you report everything — every step.

    I realize some of the detail of a full/everything report is lost, but the three scenarios are easy to work with and convey to the reader (and I have written mostly for nontechnical practitioner audiences) the nature, scope and importance of the uncertainty. Whether this could ever get published in a “good” journal is another question, of course. (I’ve never bothered.)

    • Dale Lehman says:

      Scenarios are easy to communicate and a step in the right direction – but we can, and should do much better. I’ve published 2 books on simulation modeling and I far prefer probabilistic outcomes to discrete scenarios. One problem with the latter is that there is no way to assess how likely those scenarios are – so they get misinterpreted (just think about “worst case” and “best case” scenarios). On the other hand, some risks (e.g., black swans) are not amenable to probabilistic modeling, and in those cases, scenarios may be the only/best way to handle them. But, in my experience, businesses overuse scenarios in situations where probabilistic simulations are quite possible (granted, it is still better than just using the mean or expected).

      • Peter Dorman says:

        Yes, and there are usually ways to summarize probabilistic outcomes to nontechnical people, although they lean toward scenarios, e.g. picking out a few points on the distribution and expanding on them. In the areas I’ve worked in, however, I’ve felt the underlying probabilities were not well known. Yes, if you trust a particular modeling framework, but I guess my degree of trust was always limited. For instance, in work involving child labor, I’ve always felt that factors measured (with error) by surveys, like household income and assets, the size and frequency of shocks, school accessibility, health and demographics, only operated in the context of local norms and shared expectations. But subjective factors are not only harder to measure, they are also subject to change without notice. Perhaps this applies more to behavioral studies than other kinds — I don’t know.

        Maybe I should read your books though!

Leave a Reply