Skip to content

Is causality as explicit in fake data simulation as it should be?

Sander Greenland recently published a paper with a very clear and thoughtful exposition on why causality, logic and context need full consideration in any statistical analysis, even strictly descriptive or predictive analysis.

For instance, in the concluding section – “Statistical science (as opposed to mathematical statistics) involves far more than data – it requires realistic causal models for the generation of that data and the deduction of their empirical consequences. Evaluating the realism of those models in turn requires immersion in the subject matter (context) under study.”

Now, when I was reading the paper I started to think how these three ingredients are or should be included in most or all fake data simulation. Whether one is simulating fake data for a randomized experiment or a non-randomized comparative study, the simulations need to adequately represent the likely underlying realities of the actual study. Only have to add simulation to this excerpt from the paper “[Simulation] must deal with causation if it is to represent adequately the underlying reality of how we came to observe what was seen – that is, the causal network leading to the data”.  For instance, it is obvious that sex is determined before treatment assignment or selection (and should be in the simulations), but some features may not be so obvious.

Once someone offered me a proof that the simulated censored survival times they generated where the censoring time was set before the survival time (or some weird variation on that) would be meet the definition of non-informative censoring. Perhaps there was a flaw in the proof, but the assessed properties of repeated trials we wanted to understand, were noticeably different than when survival times were first generated and then censoring times generated and then applied. In that way, simulations likely better reflect the underlying reality as we understand it. And others (including future selves) more likely to raise criticisms about this.

So I then worried about how clear I had been in my seminars and talks on using fake data simulation to better understand statistical inference, both frequentist and Bayes. At first, I thought I had, but on further thought I am not so sure. One possibly misleading footnote on the bootstrap and cross-validation I gave likely needs revision, as that did not reflect causation at all.

The example of the “flawed” simulation of censoring times does point out the risks of accepting a faulty proof (or one that does not apply), but there is something more subtle and perhaps more important to consider. That is, the use of a less transparent means to simulate the correct distributions rather than a direct and literal representation of what is understood to be the reality that produced the observations.

That is, if one could somehow simulate the correct distribution of the data of the represented reality (model assumptions), but that method did not explicitly involve the same pathways as what lead to the data being observed, it hides what is being represented to be going on. With great transparency comes doubt, doubt acquired with much less difficulty and greater value.

Now as for being clear about this in presentations on fake data simulation, I had been using diagrammatical reasoning to enable non-statisticians and early career statisticians to better grasp statistical reasoning. Statistical reasoning that is, being mostly about what to make of analysis results and how they should change ones thinking and future actions.  Arguing that what to make of analysis results should primarily be based on discerning what would repeatedly happen given a (model) representation of how the results came about. Today, such an assessments can be carried out with simulation (fake data simulation) but only if understood as simply discerning what happens in a “realistic” but idealized abstraction (mathematical) or fake (possible) world, that needs to be _transported_ to actual studies in hand.

The footnote was – “As an aside, the Frequentist approaches based on  cross-validation and bootstrapping are a form of mathematically degenerate simulation as they use finite populations to mechanically extrapolate to and from representations to realities.”

In an email to a colleague afterwards I explained – “Approaches based on  cross-validation and bootstrapping somewhat thoughtlessly define the fake world (model) as either the hold out samples or the data set in hand.  For the bootstrap, an automatic and often thoughtless choice of fake world – just the data in hand (but then if say the x were chosen rather than sampled – don’t resample x values.) So there is a model and like you point out usually assumption of iid sampling. So in my mind, they are a form of mathematically degenerate probability as they use finite populations to mechanically extrapolate to and from those representations to realities. But both can be embedded in [more flexible] probability models to be assessed under those (perhaps more realistic) assumptions.”

But the bootstrap and cross-validation completely disregard causality. So doubly degenerate?





  1. Zad says:

    The discussion in the concluding section also reminds me of Nelder’s JRSS paper where he discusses how statistical science differs from mathematical statistics or as he preferred to call it, “statistical mathematics”:

    “One of our biggest problems is the word `statistics’ itself. We need a new term, and that term should be, I believe, `statistical science’. It is the name of a journal, and it also the title of the new professorship in the University of Cambridge. It shows that statistics belongs with science (and hence technology) and not with mathematics. If the new name is accepted several changes follow.

    First ‘applied statistics’ becomes a tautology, for statistics is nothing without its applications. The phrase should be abandoned. It has arisen to distinguish it from ‘mathematical statistics’. However, this is also a misnomer, because it should be ‘statistical mathematics’, as A. C. Aitken entitled his book many years ago.

    To make this change does not in any way diminish the importance of mathematics. Mathematics remains the source of our tools, but statistical science is not just a branch of mathematics; it is not a purely deductive system, because it is concerned with quantitative inferences from data obtained from the real world.

    Bertrand Russell said `mathematics is a subject in which we do not know what we are talking about, nor do we care whether what we say is true’. As statisticians, we should know what we are talking about and should care that what we say is true, in the sense of agreeing with phenomena in the real world. If we statisticians are to become statistical scientists we must become thoroughly familiar with the processes of science.”

  2. Sander Greenland says:

    In papers by Robins and colleagues where they simulate to study causal methods, they enforce causal restrictions by sampling sequentially over time branching out from initial (baseline) causes through intermediates on to end effects – that is, simulation based on the g-computation algorithm. The class of ‘causally coherent’ distributions that can be generated this way can be much narrower than the class of all possible joint distributions for the observed variables. In line with your censoring story, that means that if one ignored the causal-sequence restriction, one could inadvertently generate data from a distribution that contradicted basic background information like temporal sequencing and absence of certain effects. Conversely it also displays how the independencies assumed by common methods (such as partial likelihood for fitting the Cox model) can be far too strict given what isn’t known. To get more realistic data-generation simulations, one has to add observation-selection variables (e.g., selection and censoring indicators) to the sequence.

    On an interesting related note, one can view Bell’s inequality as a restriction forced by our familiar type of potential-outcomes (counterfactual) causal model; experiments showing its violation thus refute that type of model. See Robins Vanderweele & Gill

  3. Yuling Yao says:

    “Fake data simulation”, or in other words, likelihood, to me is just another name for “prediction”. Causality = potential outcome/ prediction under covariate shift so I would say fake data simulation does play a role there. Indeed we always need to take care of covariate shift in bootstrap and cross-validation anyway.

    • Ben says:

      What does the phrase “covariate shift” mean? Is it just a change in covariates or something weirder?

    • Yuling:

      > “Fake data simulation”, or in other words, likelihood,
      The more commonly accepted definition of likelihood is something like the probability of re-observing the same observation as a function of the parameters. So it’s a restricted prediction of just that – the same observation given a point in the parameter space.

      > covariate shift in bootstrap and cross-validation
      Can you elaborate on this in the bootstrap and cross-validation?

  4. Andrew(not Gelman) says:

    Dunno. I’m with Russell. “The law of causality, I believe, like much that passes muster among philosophers, is a relic of a bygone age, surviving, like the monarchy, only because it is erroneously supposed to do no harm.”

    • Sander Greenland says:

      Andrew-not-Gelman: Thanks for bringing up the Russell quote…

      One of my themes is that we should not bow to “great men” of the past because however brilliant they were at times, they also made great errors. I am a huge fan of Russell as a general philosopher. But he had no research experience in “soft sciences” or their statistics, and here he made a huge mistake – not unlike Kelvin at the same time denying the Earth could be billions of years old, or later, Jeffreys denying continents could drift or Fisher’s intransigent pseudoskepticism about cigarettes causing lung cancer.

      To err is human, and unfortunately it is as human to repeat errors and invent fallacious rationales for the errors just because venerated authorities made them. We see this today with ongoing defenses of significance tests as truth tests and “confidence intervals” as uncertainty intervals from those who should know better – but never will because they have based their careers on repeating the errors. Sadly, this behavior refutes the notion that science progresses funeral by funeral: Bad ideas and methods seem more like undead zombies that keep rising from the grave to eat the living flesh of science, as you demonstrated by resurrecting Russell’s quote – which needs a stake through its heart.

      Speaking of stakes, Russell had no idea of what is at stake today in research or of how modern causal concepts supply tools to address those. He may have been referring to the fact that there is no singular “law of causality” (perhaps apart from those arising from the 2nd law of thermodynamics or from relativistic constraints). There are however causal models which (as Pearl explains) incorporate our qualitative information about time order of events and the mechanisms generating data in order to better predict consequences of actions. It is tragic when statisticians try to avoid learning and teaching these tools, because causal forecasting (prediction of potential data outcomes of different design choices) is as crucial to study design and conduct as it is to analysis and decisions based on results. To reveal assumptions and invite their criticism, such forecasting needs to be done as transparently as possible. Due to their acausal nature, pure probability models do not begin to provide the transparency needed for observational research (which for example is crucial for postmarketing surveillance of drug and device safety) whereas causal graphs can quickly show bias avenues and fallacies that are routinely overlooked in reports of conventional regression analyses (e.g., see

  5. Ron Kenett says:

    Sanders – Causality is a powerful generalisation argument. The reason statisticians should be interested in causality is that this enhances the generalizability of the claims derived form their analysis. A related idea is the use of pseudo variables and simple model simulated data to better understand the generalisability properties of your analysis. Given that a pseudo variable is self generated random noise, any effect above it indicates an effect beyond noise.

    A related issue is the representation of findings. You first present findings in one way or another and than you should discuss their generalisability. A causality argument certainly does that.

    For more on some of this see

    • Sander Greenland says:

      Thanks Ron. True that causal arguments are at the heart of generalizations. See my response above to Andrew-not-Gelman about how causality is also at the heart of study design, conduct, and analysis including narrow interpretation specific to the study or data source under analysis – even for descriptive surveys such as for voter preferences.

      Thus I’m with Pearl in arguing that causality is more fundamental than probability for sound statistical science. The view of probability as a sufficient foundation has been a massive conceptual error in applied statistics, apparently stemming from the missteps of K. Pearson, Russell and other authorities in the early 20th century who were notably writing before modern models for causation had been elaborated and deeper questions about relativistic limits emerged (I have read that, ironically, it was Pearson’s The Grammar of Science that was one of young Einstein’s inspirations to explore those questions). I advise those quoting them to take heed of the resurgence of causal notions not only in soft sciences but also in physics. Here’s some quick PBS coverages of that:

      • Andrew says:


        I think that Don Rubin would agree with you on this. He never framed probability as being fundamental. In his take on things, you should first define what you’re interested in (using latent variables where necessary), and then probability modeling is just a convenient tool for performing statistical inference. Yes, we use probability all that time, but probability is not fundamental. Similarly, we use math all the time and we use computers all the time, but math and computers are not fundamental to scientific inference; they are just very useful tools.

        • Sander Greenland says:

          Cool if Don agrees, thanks for pointing that out – I presume you then agree too that we should start off theory for stat with causality and models for it, then derive probabilistic consequences of those models.

          That would just leave the sticking point about graphs as useful tools, especially for tracing out bias sources such as that from conditioning on colliders. It seems Don was unable to ‘get’ that back in the exchanges with Schrier, Sjolander and Pearl in Stat Med 2008-2009 when they each pointed out the dangers of tossing all measured treatment predictors in a propensity score; there he thought the objections involved cancelations when it is just the opposite – like confounding, collider bias happens unless there are perfect cancelations (unfaithfulness). Can we teach an old Don new tricks?

      • Ron Kenett says:

        Sanders – The Grammar of Science has been quite controversial. In a paper titled with embedded British humour, David Cox gives a well rounded and sound review of where statistics stands, including a full section dedicated to causality: I had interesting discussions with him on causality, at Oxford, 2 years ago.

        Regarding Judea Pearl. His repeated view that statistics has ignored causality is not factual. I am not sure what drives this “anger” but it certainly does not entice constructive discussions. On the pragmatic side, Structural Causal Models have not been used to address COVID challenges. You would think that someone would have tried to establish causality in disease contagion data but, unless I missed something, I have not seen it. SCM would have been great to support generalisability (transportability) of findings from area A to area B – a much needed information supporting decision makers.

        It seems that causality is addressed by a wide range of options. The challenge is to operationalise it but, this is more complex than the toy examples you find in journals.

        • Sander Greenland says:

          Sorry Ron, but my point is not about Pearl. Whatever historical and political mistakes he’s made are irrelevant to my point that for sound analysis we have to delineate what caused the data – e.g., what caused our observations to show various features. SCMs are just one of many, many classes of models in our toolkit; focusing on them misses the general point that applied statistics rests on causality and thus needs to include basic causal ideas and tools in basic training and beyond – whether SCMs are worth including depends on the field and application.

          I also think you err grievously in describing the literature:
          1) There are plenty of causal contagion models, e.g., look up the work of Halloran, Longini and colleagues.
          2) There are now many articles on transporting results using causal models, e.g., search on transportability and authors like Barenboim, Stuart, Cole, Hernan.
          3) There are now decades of articles in epidemiologic journals applying modern causal models to real, complex data. Many have Robins as a coauthor so you should be able to find some by going to his online publication listing at Harvard.

          • > Whatever historical and political mistakes he’s made are irrelevant
            Definitely worth guarding against those sorts of things getting in the way of understanding what is of real value.

          • Ron Kenett says:

            Sander – the book by Hernán and Robins is indeed one of the best treatments on causality. Also the software developed by Elias and his team works very well My point was that in the COVID19 related long list of publications, I did not see any reference to SCM and generalizability (also called transportability). Did you see any such applications?

            • Sander Greenland says:

              No but I am unclear how that relates to the general topic here. Covid epidemiology is not even a year old and is like nothing I’ve ever seen, disastrous from the start thanks to innumerable problems including lack of reliable or consistent survey data and (here in the U.S. at least) appalling politics. Meanwhile media coverage and publication has bordered on chaos. As an example, at the editor’s invitation we wrote the following lament in March and it was not even published until July:
              Basic issues remain.

              • Ron Kenett says:

                Sander – thank you for your response. Three comments/thoughts:
                1. The paper you wrote with a very impressive list of co authors does not mention generalisation of findings. What are the implications of a study in Taiwan, to Italy? Can we handle this question methodologically or is t only left to expert opinion inter[retation.
                2. The premise of the paper you referred to, and many others, is that statistical analysis leads to claims presented in a certain way. There is no mention of methods to present findings. They can be directional or magnitude related, leading to an S type or M – type error evaluation. Gelman and Carlin introduced this. John Carlin is one of your co-authors and, for some reason, your paper does not bring this up. I wrote about all this in
                3. You would think that the COVID case, with messy data but also with significant stakes, would lend itself to some high level analysis, including attempts at causality assessments. Are kids attending schools driving infection rates? Is the application of ventilators detrimental to health outcomes etc etc. Some of these claims are made. Again, not attempts to establish causality with SCM or other methods seem to have been made. The SEIR models are not actually addressing causality as they only rely on health outcomes and no driving factors. Moreover, they often are using aggregate data which is quite nonsensical when you consider localised patterns. Look for example at the heterogeneity in Italian provinces. Here in Israel, the patterns in various groups like the ultra orthodox are very different from those in other groups. My observation is that applying SEIR at the country level, in Italy or Israel, is not very useful to policy makers and possibly much misleading.

                To help assess data driven analysis I have been suggesting checklists tailored to specific application domains. For checklists in industrial statistics see One might want to develop such a checklist in epidmiology..

                So, as you write “Basic issues remain”. Would be very interested in getting the inputs of Andrew to all this.

              • Martha (Smith) says:

                Ron Kennett said,
                “My observation is that applying SEIR at the country level, in Italy or Israel, is not very useful to policy makers and possibly much misleading.”

                Makes sense to me.

              • Sander Greenland says:

                Ron: You seem to want to go far off the present post’s topic (getting causal foundations integrated into basic probability and stat education) into sophisticated issues of modeling for covid-19 research. Given the distance of your questions from basics of education and data collection (at least, I haven’t seen you draw a clear connection), I suggest you might ask Andrew if you can open up a new post/page for your topic.

            • Hi Ron,

              For a paper using SCM to address Covid challenges, see Victor Chernozhukov, Hiroyuki Kasahara, Paul Schrimpf “Causal Impact of Masks, Policies, Behavior on Early Covid-19 Pandemic in the U.S.” .

              I’m not associated in any way with this research.

  6. Bill Harris says:

    Causality involves time. When I hear you describing what you all seek, I hear something that could be aligned with (Jay Forrester’s) field of “system dynamics”: the solving of problems by means of the creation of a generative simulation model using systems of often nonlinear ODEs operating over time. PBPK models and epidemiological models such as an SIR model fit in the category of a system dynamics model. John Sterman’s /Business Dynamics/ is one current and somewhat encyclopedic text in the field.

    In doing system dynamics modeling, there is emphasis on model testing that includes both validating whether the model represents the causal structure of the “problem” being addressed as well as how the statistics of the model hold up. Admittedly, many treat such models deterministically, for often the underlying problem that’s being addressed has a strong enough deterministic component that one can learn how to change the model and then the real-world system to eliminate the problem without worrying about its stochastic aspects.

    In other cases, the statistics are vital: one may want to determine model parameters based on data, or one may want to optimize the performance of the model to achieve a certain goal which one hopes one can achieve in the real world.

    If you’re interested in thinking about making causality explicit in generative models, it might be worth reading at least section 2.5 and chapter 3 of /Business Dynamics/ on the modeling process. It might also be worth reading chapter 21 on model testing to see what system dynamics calls for.

    Whether one does system dynamics in the Vensim, STELLA, Powersim, or another classic system dynamics simulator or by simulating ODEs in Stan or MCSim, I think there’s opportunity for system dynamicists and statisticians to learn from each other.

  7. Ron Kenett says:

    Keith – the reflection of bootstrapping and crossvalidation approaches to the structure of the problem at hand is indeed unexplored territory. Gelman addresses it in when he talks about hierarchical data. My paper with colleagues addresses this in the context of designed experiments with replicates

    The classical fallacy is in the application of crossvalidation in fitting a neural network to data from a designed experiment. The models you fit are all over the place. A better approach is the Bayesian bootstrap suggested by Don Rubin and recently investigated as a fractionally weighted bootstrap by Gotwalt and Meeker

    • Keith O’Rourke says:

      Thanks (always liked the Bayesian bootstrap. By the way when Rob Tibshirani presented the bootstrap in the lab course when he was a post doc – I asked if it was not just the method of moments and something to be wary of. Of course it is, but as Peter Hall explained to me years later, not one of the “valid” ways to express it that people like. (Only?) With great expertise, it can work responsibly.)

      ( By the way, I once had to review a neural network that predicted outcomes based on < 100 noisy observations :-( )

  8. Hey Keith,

    You might be interested in this paper ( by Max Little and Reham Bedawy (2020). From their introduction:

    We augment the classical bootstrap resampling method [Efron and Tibshirani, 1994] with information from the causal diagram generating the observational data. This leads to a simple weighted bootstrap which can be used to generate new
    data faithful to an interventional distribution of interest. Any standard, complex nonlinear machine
    learning predictor can then be applied to the new data to construct interventional predictors, rather
    than associational predictors. This method is applicable to most interventional distributions which
    can be derived from observational causal models using the rules of do-calculus, according to the general
    identification algorithm of Shpitser and Pearl [2008].

    We develop several bootstrap algorithms for common causal inference scenarios including general
    back-door and front-door deconfounding, tailored to supervised classification or regression machine
    learning methods. We demonstrate the effectiveness of this technique for synthetic data and real-world,
    practical causal inference problems.

  9. Austin says:

    I know that one way causation can be simulated by simulating the casual terms first. But what if two items each have some influence on each other, in some feedback loop? Is this the sort of thing where you do multiple rounds of simulation and hope that it converges?

Leave a Reply