Skip to content

How should those Lancet/Surgisphere/Harvard data have been analyzed?

As you will recall, the original criticism of the recent Lancet/Surgisphere/Harvard paper on hydro-oxy-whatever was not that the data came from a Theranos-like company that employs more adult-content models than statisticians, but rather that the data, being observational, required some adjustment to yield strong causal conclusions—and the causal adjustment reported in that article did not seem to be enough.

As James “not the racist dude who assured us that cancer would be cured by 2000” Watson wrote:

This is a retrospective study using data from 600+ hospitals in the US and elsewhere with over 96,000 patients, of whom about 15,000 received hydroxychloroquine/chloroquine (HCQ/CQ) with or without an antibiotic. The big finding is that when controlling for age, sex, race, co-morbidities and disease severity, the mortality is double in the HCQ/CQ groups (16-24% versus 9% in controls). This is a huge effect size! Not many drugs are that good at killing people. . . .

The most obvious confounder is disease severity . . . The authors say that they adjust for disease severity but actually they use just two binary variables: oxygen saturation and qSOFA score. The second one has actually been reported to be quite bad for stratifying disease severity in COVID. The biggest problem is that they include patients who received HCQ/CQ treatment up to 48 hours post admission. . . . This temporal aspect cannot be picked up a single severity measurement.

In short, seeing such huge effects really suggests that some very big confounders have not been properly adjusted for. . . .

Setting aside data problems of the sort that caused the article to be retracted, what should they have done?

I don’t have a specific answer for this particular study, but the general ideas are discussed in various textbooks in causal inference, for example chapter 20 of Regression and Other Stories. The basic idea is to start with the comparison of treated and control group in the observed data, then compare the two groups with respect to demographics and pre-treatment health status, then do some combination of matching and regression to estimate the treatment effect among the subset of people that had a chance of getting either option, then see how the estimate changes as you adjust for more things, then consider the effects of adjustment for important but unmeasured pre-treatment predictors. It’s all assumption-based, but if you do it carefully and make your assumptions clear, you can learn something. It’s not a button that can be pushed in a statistics package.

See here for further discussion of the challenges of adjustment for causal inference in observational studies.

P.S. The editor of Lancet described this as “a shocking example of research misconduct in the middle of a global health emergency.” Had there been no research misconduct, this would’ve been a not-at-all-shocking example of bad research getting publicity and influence.


  1. It’s hard for me to understand the meaning of a question like “what should they have done” when the data is maybe not even existent. What you should do to analyze a dataset is fundamentally linked with the process that created the data set: measurement errors, confounding, sampling biases, and physical mechanisms all enter into it.

    When the process that created the data is “someone made that crap up” then the answer of what to do with it is “nothing”.

    If the data are somehow real but illegally exfiltrated from hospitals… then it’s another problem, but you still need to think about things like “what caused this data” in order to get at some reasonable model of what’s going on. If different hospitals had different ways of deciding who got what kind of treatment etc, then it’s really difficult to know what to make of the data.

    This is the fallacy of big data, that if you just throw enough of it together you’ll learn about the world… Sure you’ll learn about the world, but all you may learn is that “there are all kinds of human biases in who gets what treatment”. Of course you may misinterpret that as “we know the true answer as to what treatments should be performed”.

    I think especially in medicine, “big data” approaches will be limited to hypothesis generation.

    • Andrew says:


      Sure, but suppose your goal is hypothesis generation. The same principles apply: you want to compare treated and control groups and adjust for pre-treatment differences.

      • Yes, but the methods whereby you do that should inherently encode aspects of how the data came about… a “generative model” in some sense. So when the data come about by fiction, it’s impossible to actually model anything. But if they didn’t come about by fiction, then you have a hard problem where you have to start asking individual hospitals what they did to recruit or how they made decision, or you need to make a lot of assumptions.

        So, I agree with you in principle, but since fiction seems to be a plausible explanation it’s hard to go further than just reiterating the basic principles.

        • Anoneuoid says:

          The “data” almost surely came about via some complicated machine learning model that filled in all the individual missing data points (which is unreported but probably over at least 50% for most of their features). Then when they took the average over a lot of these it was just the average of whatever data they did have.

          Anyway, without enough description of the methods for anyone to replicate it the study was dead on arrival. It should have taken reviewers less than a minute to spot this.

          • Andrew says:


            Good point. I hadn’t thought of that. I’d been thinking about the data being real or being fabricated, not about the data being put together based on some model.

            • Anoneuoid says:

              Yes, also I suspect a lot of the “covid-19” patients were hospitalized for something else but tested positive due to nosocomial transmission or were simply one of the 80+% mild/asymptomatics.

              But we have no way of telling from the methods they provided. There is no excuse for reviewers not to ask these questions and reject the paper when no answer was forthcoming.

  2. yyw says:

    The apparent misconduct gave Lancet an easy way out. My fear is that the wrong lesson will be learned here (rare bad apples versus systemic flaw.)

    • Brent Hutto says:

      Like if a bank kept getting robbed because they never lock their vault. The police just caught the latest set of thieves, therefore problem solved. No need to start locking the vault or anything like that.

      I’m not putting any of my money there.

      • Anoneuoid says:

        And meanwhile…

        HQC: Study after study reports results in patients hospitalized for covid. No one ever claimed this would work and from multiple lines of evidence we’ve seen hospitalization occurs weeks after infectious virus peaked

        Vitamin C: FBI raiding clinics giving out free IV vitamin C. FCC sending threatening letters to others. Sorry, not enough patients available to run a study but doctors are allowed to give 1/10 to 1/20 the reportedly required dose (~20 g/day IV depending on how sick they are) after already hospitalization and a lot of damage has been done. And not a single measurement of blood vitamin C levels has been reported.

        HBOT: Again, FCC sending out threatening letters and various bureaucratic excuses are given to limit the number of patients. Not “statistically significant”, sorry.

        Smoking: People hospitalized for covid who smoke have worse outcome, we will totally ignore how rarely smokers are hospitalized, test positive on PCR, and test positive for antibodies later.

        This is what people should be rioting about. There is a systematic suppression of cheap and relatively safe prophylaxis and treatments, while obvious BS like this paper gets max press and leads to stopping other trials on HCQ until public outrage makes them retract.

        • Esbee says:

          In a way I agree with the gist of your argument. However, one of the authors of the first HCQ clinical trials to come out, and after the recovery trial came out, tweeted that the only missing link for HCQ now is whether it works given as an early treatment. I think based on all the studies done so far, no one can definitively answer that. Does HCQ prevent Covid-19, based on the UMN studies, it looks like it doesn’t work. Does it treat severe hospitalized case, the UK Recovery Trial answers that too. Thus, the missing link here seems to be whether it works when given early. Also, I have seen some papers coming out that hypothesize about vitamin D deficiency or insufficiency being correlated with severe outcomes. Unfortunately, on this too, we do not have something that can definitively answer this. Coming from a developing country whose economy has now been devastated by Covid-19, I would at least have expected the experts here to test these theories.

          • Anoneuoid says:

            The original idea was to take it in the initial phase of the illness to slow viral replication. Why have dozens of papers been published studying what happens when given at a theoretically inappropriate time, but not a single one testing the actual proposal?

        • Adede says:

          Are you proposing smoking as a prophylaxis against covid? That’s….novel.

  3. Tom Passin says:

    It’s very hard to answer even an apparently simpler question: Is country/state A doing better than country/state B in terms of controlling the spread of the corona virus? If we can’t treat this in a satisfactory way, how can we hope to answer harder questions about medication, etc.? So consider:

    We can’t use total number of cases – even per capita – unless both A and B’s cases have leveled off. In the case where one of them is the US, for example, this isn’t happening. In addition, the actual number of cases depends strongly on the number of initial ones (the ones that got the state on a strongly exponential curve), plus when they occurred relative to the start of control measures.

    A better measure would seem to be the curve of the daily case rate per capita. This can be *very* spikey, especially for areas with a small population, like many US counties (which can have populations as low as 4000, and even 1000). With very spikey data, least squares and many smoothing methods can be unduly influenced by the influence, via the square of the deviations. Also, if one of the regions to be compared has not leveled off yet, how can you choose some place on the curve to do comparisons?

    In addition, it appears that population density has an important effect, not to mention other social and cultural issues that can strongly affect R0. If you start from a high value of R0 for whatever reasons, you may end up being successful in controlling the pandemic but still end up with higher daily rates and more cases than a country that was less effective but started with a lower R0.

    In the earlier period of strong exponential growth, I have found that the fractional change in the daily case rate seems to be a pretty good measure (better than per capita), and seems to be pretty consistent across states and countries as long as there is a clear period of strongly exponential growth before control measures come into effect. Later, after the daily case rate levels off and drops, this measure becomes less and less useful.

    Of course, there’s the question of the data itself, how the number of cases or deaths are determined, whether they have been influenced by politics, etc.

    So it’s really hard just to compare different regions on a “simple” measure of case number or death number. The case of evaluating medications and protocols is much, much harder. It will take a lot of convincing for me to put much faith in the results of comparisons between different medications.

  4. Carlos Ungil says:

    > Watson wrote:
    >> The big finding is that when controlling for age, sex, race, co-morbidities and disease severity, the mortality is double in the HCQ/CQ groups (16-24% versus 9% in controls). This is a huge effect size! Not many drugs are that good at killing people. . . .

    It think it’s worth repeating that this is incorrect. The increase in mortality that they reported was not 100% but a third of that.

    HR: 1.33 [1.22 1.46] hydroxychloroquine alone
    HR: 1.45 [1.37 1.53] hydroxychloroquine + macrolide
    HR: 1.36 [1.22 1.53] chloroquine alone
    HR: 1.37 [1.27 1.47] chloroquine + macrolide

    This is not so shocking compared to the results of other observational studies. The results are not statistically significant for the most part but the 95% confidence intervals are wide enough to include huge effect sizes.

    (Med, June 5) VA Hospitals
    HR: 1.83 [1.16 2.89] hydroxychloroquine alone
    HR: 1.31 [0.80 2.15] hydroxychloroquine + azithromycin

    (JAMA, May 11) 25 NY hospitals
    HR: 1.08 [0.63 1.85] hydroxychloroquine alone
    HR: 1.35 [0.76 2.40] hydroxychloroquine + azithromycin

    (NEJM, May 7) Columbia University Irving Medical Center
    HR: 1.04 [0.82 1.32] (death or intubation) hydroxychloroquine with or without azithromycin

    Of course all of those studies may be biased to different extents, depending on the quality of the data and the analysis. Surely there are many more opportunities to get things wrong with an aggregation of data from heterogeneous sources (especially if you get it from a back alley data-provider who might have cut it with God knows what random imputed garbage) than with data from a single hospital. While the statistical analysis may be better in some cases than in others, probably all of them could be considered examples of bad research getting published if you have high expectations.

    • James Watson says:

      True. That effect size was actually based on their propensity matching analysis. If I understand the sup tables correctly, they estimate about 100% increase in mortality for the different treatment groups. That’s what made me initially think that the adjustment for disease severity wasn’t very good.

  5. Carlos Ungil says:

    You didn’t like that the authors of the Lancet paper said in their retraction that “Several concerns were raised” and didn’t give enough credit to the concern raisers. It could have been worse, they could have thanked The Guardian! From the link in your PS:

    Retraction made after Guardian investigation found inconsistencies in data

    The Lancet paper that halted global trials of hydroxychloroquine for Covid-19 because of fears of increased deaths has been retracted after a Guardian investigation found inconsistencies in the data.

    A Guardian investigation had revealed errors in the data that was provided for the research by US company Surgisphere. These were later explained by the company as some patients being wrongly allocated to Australia instead of Asia. But more anomalies were then picked up. A further Guardian investigation found that there were serious questions to be asked about the company itself.

    In its investigation, the Guardian put a detailed list of concerns to Desai about the database, the study findings and his background.

  6. Malcolm says:

    I’m going to repeat and amplify comments from Joseph Delaney and me in an earlier installment of this saga.

    Pharmacoepidemiologists will tell you than confounding by indication is a massive, probably insurmountable problem. This analysis was never going to be useful as a study of efficacy. It could have provided some information about safety, assuming the arrhythmia endpoint was measured without too much bias.

    On 5 May the International Society for Pharmacoepidemiology published a Considerations for pharmacoepidemiological analyses in the SARS‐CoV‐2 pandemic by Anton Pottegård et al (in Pharmacoepidemiology & Drug Safety)

    Their “Consideration #2” is:

    Observational studies to assess efficacy are, in the context of an ongoing pandemic, unlikely to add significant value in the short term.

    • Anoneuoid says:

      This seems to imply some sort of specialized knowledge/experience is required to spot red flags with this study. You dont need to be a pharmacoepidemiologist to spot it.

      Anyone who knows the methods section should contain enough info to replicate a study could spot it very quickly. Or anyone who knows comorbidity and treatment rates typically vary pretty substantially in different places. Or anyone who just knows you should compare those rates to what has been previously published as a sanity check.

      Pretty sure the reviewers/editors just went “big sample size, wow!”. Small sample size is the first critique the people who dont know what they are doing focus on these days.

      • Malcolm says:

        If The Lancet can’t find a pharmacoepidemiologist to review a pharmacoepi article, then there is no hope for them whatsoever.

        Sure there are plenty of other red flags that the thing was fishy, but this post was about how it should have been analyzed assuming the data were real.

        The answer from anyone who paid attention in pharmacoepi 101 was, as a study of excess arrhythmias among people taking hydroxychloroquine for COVID-19. Not as an all-cause mortality, “does hydroxychloroquine work for COVID-19” study, which anyone who paid attention in pharmacoepi 101 will tell you is a waste of time.

        • Anoneuoid says:

          If The Lancet can’t find a pharmacoepidemiologist to review a pharmacoepi article, then there is no hope for them whatsoever.

          They apparently didnt find anyone versed in any form of scientific research to review it. Seems like a much more egregious failure to me than not find a pharmacoepidemiologist, so we should focus on that.

  7. Anoop says:

    Surprised nobody commented on Mandeep Mehra’s tweet above!!

    This is what he had to say after the retraction, ” Mehra said. But because Surgisphere would not transfer the primary data to the Bethesda, Maryland, institute, “I no longer have confidence in the origination and veracity of the data, nor the findings they have led to.”

    \What ajoke!

  8. Dale Lehman says:

    At present, the Surgisphere website has been gutted. The home page is still there, but everything else is gone. No research reports, no COVID tools, linkedin and twitter pages, gone, etc. While this is probably a reasonable development – and expected – it is not quite an admission of problems and appreciation for mistakes and their discovery. Perhaps statements will be coming that tell us more about what happened, but possibly this is as far as we’ll ever see.

  9. Anoneuoid says:

    Editor in chief of the Lancet claimed this paper appeared “methodologically perfect” in a secret meeting:

    These people are totally incompetent.

    • Carlos Ungil says:

      >  Editor in chief of the Lancet claimed this paper appeared “methodologically perfect” in a secret meeting

      Really? You refer to a link which doesn’t say that which refers to a video which doesn’t say that where Douste-Blazy is apparently talking about a meeting that happened years ago.

      • Anoneuoid says:

        Dont know what video you just watched. It refers to a recent meeting this year.

          • Anoneuoid says:

            Cant get google translate to work on that page, can you quote the part you think is relevant?

          • Anoneuoid says:

            Ok, managed to get this:

            According to the former Minister of Health, “The boss of The Lancet said: ‘Now we will no longer be able, basically, if this continues, to publish data from clinical research, because the pharmaceutical companies are so strong, financially , and manage to have such methodologies to make us accept papers which apparently, methodologically, are perfect, but which basically, say what they want. ”

            What it really is. A conference took place in London, but on April 1 and 2, 2015, so five years ago and not “the other day”. It was organized by the British Academy of Medical Sciences, the British Council for Medical Research, the British Council for Biotechnology and Biological Sciences and the Wellcome Trust, a British charitable foundation focused on medicine. Its theme was “the reproducibility and reliability of biomedical studies”.

            Ten days later, on April 11, 2015, the editor of The Lancet, Richard Horton, echoed the discussions held there in an editorial. He clarified in passing that the participants were committed to respecting the Chatham House rule – very common in international meetings and conferences – according to which the participants undertake to respect the confidentiality of their exchanges. Not really a “leak” therefore. And nothing “top secret” either.

            “This symposium, writes Richard Horton, addressed one of the most sensitive problems in science today: the idea that something fundamentally went wrong with one of our greatest human creations. The argument against science is simple: a large part of scientific literature, perhaps half, can simply be false. ”

            The boss of the Lancet, whose job it is to examine scientific publications before printing them, deplores “studies with small samples, tiny effects, invalid exploratory analyzes”, but also “flagrant conflicts of interest” and the “obsession with pursuing fashionable trends”. “As one participant said, ‘Bad methods work badly,'” he writes again. And to conclude, a bit bitter after this symposium: “The good news is that science is starting to take some of its worst flaws very seriously. The bad news is that no one is ready to do the job. first step to cleaning up the system. ”

            In his editorial, the editor of The Lancet never mentions any lobbying by pharmaceutical companies. This mention of “Big Pharma” appears in the interpretation made by the conspiratorial author Frederick William Engdahl in a text published on the New Eastern Outlook site, as spotted by the Observatory of conspiracy. This falsification of Richard Horton’s words is translated into French by the Stop Mensonges site. It was then taken over in 2016 by Egalité & Réconciliation by Alain Soral, Agoravox or even Criigen, the anti-GMO association founded by Corinne Lepage. She resurfaced on May 23 in the Facebook group entitled “Didier Raoult Vs Coronavirus”.

            Doesnt sound like they are talking about the same thing at all. I guess it comes down to whether that quote exists somewhere in 2016.

          • Anoneuoid says:

            They say it came from this which mentions nothing about apparently methodologically perfect studies:

            And they seem to call the claims there a conspiracy theory when it is stuff we all know is going on.

            So I dont see how this debunks the quote. The health minister should tell us where to find this leaked audio.

            • Carlos Ungil says:

              Even if he’s talking about something else, that only he knows about, the point stands that he makes no reference to ‘this paper”.

              • Anoneuoid says:

                Well he made it sound like it just happened when this paper came out, which needs some excuse for publishing. But correct.

Leave a Reply