Skip to content

“They adjusted for three hundred confounders.”

Alexey Guzey points to this post by Scott Alexander and this research article by Elisabetta Patorno, Robert Glynn, Raisa Levin, Moa Lee, and Krista Huybrechts, and writes:

I [Guzey] am extremely skeptical of anything that relies on adjusting for confounders and have no idea what to think about this. My intuition would be that because control variables are usually noisy that controlling for 300 confounders would just result in complete garbage but somehow the effect they estimate ends up being 0?

My reply: It’s a disaster to adjust (I say “adjust,” not “control”) for 300 confounders with least squares or whatever; can be ok if you use a regularized method such as multilevel modeling, regularized prediction, etc. In some sense, there’s no alternative to adjusting for the 300 confounders: once they’re there, if you _don’t_ adjust for them, you’re just adjusting for them using a super-regularized adjustment that sets certain adjustments all the way to zero.
Do I believe the published result that does some particular adjustment? That, I have no idea. What I’d like to see is a graph with the estimate on the y-axis and the amount of adjustment on the x-axis, thus getting a sense of what the adjustment is doing.

P.S. In googling, I found that I already posted on this topic. But I had something different to say this time, so I guess it’s good to ask me the same question twice.


  1. Dale Lehman says:

    The good news is that they managed to get published a study that failed to find a significant effect. The not-so-good news, is that there was indirect funding from the pharma industry, and the study’s conclusions serve to address some fears that these drugs have severe side effects – since the mortality seems very similar for the two (retrospective) groups. The even-less-good news is that the data is proprietary and the propensity score based on 300 factors (which I believe has been the subject of concerns raised on this blog before) I guess just needs to be trusted.

    • Dale Lehman says:

      One additional comment – a question. I looked at Scott Alexander’s critique of this study and he refers to the study data as “using a New Jersey Medicare database.” ( But when I go to the article, they say “We collected data from Optum Clinformatics Datamart (OptumInsight, Eden Prairie, MN), a large US commercial insurance database covering more than 14 million people annually from all 50 US states.” They also linked, for their outcome measure, as follows: “Main outcome measure All cause mortality, determined by linkage with the Social Security Administration Death Master File.” Since the SSA administers the Medicare system, this may be where the reference to a “Medicare” database comes from – although I don’t see any particular reference in the study to New Jersey (in fact, they specifically say “all 50 US states.”

      My question is why the discrepancy between the study and the critique regarding the source of the data? Perhaps I am just not understanding something here. Can anyone clarify?

      • Edward Drozd says:

        The Optum data used in the study I believe are from United Healthcare (the large insurer), not Medicare, and I *believe* do not include their Medicare Advantage (Medicare HMOs and PPOs) plans. These data certainly cover more states than just New Jersey. Maybe the “New Jersey Medicare” thing refers to one of the other studies mentioned in his blog post?

        Also, being a little pedantic, SSA does not administer Medicare; the Centers for Medicare & Medicaid Services (CMS) does. Quite some time ago CMS was once part of SSA, but it’s been a separate agency under DHHS for decades (at least three, to my knowledge). It used to be the Health Care Financing Administration (HCFA), but the name was changed in 2003 or so. Why does CMS have only one M rather than two? I used to joke that they left out one M for a reason…

        • Dale Lehman says:

          Thanks for the clarification regarding SSA and CMS – I actually thought when I posted my comment that SSA was not the administrator (although I believe they collect the money for Medicare). In any case, I did wonder whether the “New Jersey Medicare” reference might have pertained to one of the other studies mentioned, but when I read what is written, it doesn’t read that way.

  2. Dimitriy Masterov says:

    Given the large sample size, what is the problem with 300 controls and no regularization?

    • Anonymous says:

      This is outside my wheelhouse, but I’d guess the concern is post-treatment bias. If the drug increases death rate, there’s a good chance it would do so through a measured intermediary, such as increasing blood pressure. If you adjust for blood pressure as a confound, you’ll find the drug does nothing, regardless of data volume.

      • (Another) Anonymous says:

        I’m also quite confused about this, and I’m not sure that post-treatment bias is the answer. I’m not sure why regularization would be more likely to shrink the coefficients on post-treatment variables than pre-treatment variables.

        If you have collinearity, regularization would help you deal with the model form, but if all you’re concerned about is model *predictions*—which, as far as I can tell, is all you really need to worry about in this case with propensity score matching—then regularization shouldn’t affect the propensity scores very much. (I suppose if you’re training the model on one portion of the data and then predicting on another portion, you could run into issues with collinearity that only exists in the training set, but my impression is that there’s no training / test split here.)

        • Anonymous says:

          After thinking more, I agree. Since predictions are the only interest, and the dataset is huge, the number of variables included shouldn’t hurt. And any prior will be swamped unless there are serious identification issues.

  3. That you don’t have a good idea how more than 2 or 3 confounding variables actually impact the phenomenon you are studying?

  4. High dimensional propensity scores are an interesting beast; they are not adjusting for 300 confounders but rather using the 300 variables to develop a model for probability of exposure (and then using that to break the confounding arrow between the confounders and the exposure). So it is a two stage model, although I suppose you could do more parsimonious models for the PS and look at how much it changes the estimate. In may experience it is a very minimal influence after the first dozen strongest variables, but this might be an exception.

    It’s a little tricky because you don’t want to adjust for pure predictors of exposure but you do want to adjust for risk factors (here’s a paper I did when using PS to develop MSMs: In drug safety studies this is done all of the time, including with the Canadian consortium CNODES. Mostly because they don’t measure all of the confounders but they have a lot of variables weakly correlated with the confounders and use them instead. They also use matching, to make it easier not to worry about the parametric fit of the PS.

    It can be a pretty big rabbithole to argue this. I am not the biggest fan of HDPS because it shifts towards automated models but I have seen them work well in practice with a competent statistician, which Robert Glynn certainly is.

  5. Michael Nelson says:

    After comparing this post with the earlier one on the same study, I conclude that Andrew’s statistical views have high test-retest reliability.

    I also like the point he makes about plotting the data after making adjustments/transformations/exclusions, etc. And then reporting on it.

  6. Ariel says:

    How would you graph “the amount of adjustment on the x-axis”? What does that mean?

  7. paper says:

    One of the adjusted variables was a mediator OOOOOOOOOOOO.

Leave a Reply