Skip to content

My reply: Three words. Fake. Data. Simulation.

Kash Ramli writes:

I am planning on running an experiment to determine whether an adaptive treatment approach to behaviour change interventions could be effective at reducing the heterogenous treatment effects currently observed in the field.

The context of the experiment is providing households with social norms based feedback of their consumption (i.e. comparing your consumption with your average neighbours). We know this intervention works (~2% reduction in consumption) but the effect is inconsistently heterogenous when looking at baseline consumption as a moderator.

I therefore am planning on running an experiment where I have created 3 versions of an intervention, which are designed to theoretically motivate people based on their different levels of consumption (relative to others). But rather than simply testing the the three versions, I would like to use an adaptive method, known as Sequential Multiple Assignment Randomised Treatment (SMART). Basically, every 2 months, each household gets assessed, and if they haven’t reduced their consumption by a set amount, they get re-randomised into a new treatment, while those that have reached the set amount, stick with the same treatment:

My question:
I’ve done calculations to determine the required sample size for the MDE of the initial split between treatment or control. But I am not sure how to ensure the subsequent sequences/re-randomisation will be sufficiently powered? In theory, I will have a very strong of prior of the effectiveness of treatment 1. So could I somehow use this?

I’ve tried looking through the literature on SMART designs, but they usually talk about it being exploratory and therefore not in need of proper sample size calculations.

My reply: Three words. Fake. Data. Simulation.

Saying “fake data simulation” doesn’t give you the answer—you still have to figure out what to simulate—but that’s kinda the point. To figure out what to simulate you need to construct some model of the process, and that’s necessary for the design analysis.


  1. gec says:

    Maybe I’m just being pedantic, but couldn’t the reply be just a single word, “simulation”?

    It seems to me that the production of “fake data” is inherent in what simulation is, so my real question is what rhetorical purpose the two words “fake data” serve.

    On the one hand, I think saying “fake data” is good because it emphasizes that the goal of this particular simulation is to produce data that looks like the type of data you expect to address rather than just any old data. I also like that it is not jargony.

    But on the other hand, I think the phrase “fake data” puts a lot of folks off, either because it sounds nefarious (some critics of Bayesian methods claim that it is inherently misleading because it relies on “fake data” in the prior) or because it sounds like it is trivializing the problem (making “fake” stuff is a lot easier than “real” stuff, right?).

    I’m not proposing any kind of solution here, but my overall point is that I agree 1) that simulation is a fundamental aspect of modeling; and 2) people don’t do it nearly enough. I guess I just don’t know how to “market” simulation effectively so more people embrace it as part of the modeling process.

    • Dzhaughn says:

      +1 that the name is problematic. The expression also leaves one wondering whether the data or the simulation is fake.

    • Andrew says:


      One reason I don’t just say “simulation” is that we use simulation to get inference from the posterior distribution. So if I just say, “Do simulation,” it could be interpreted as to simulate from the posterior distribution.

    • Another +1 for gec’s comment. Simulating data is immensely valuable, and I’m amazed that it’s not more common, and routinely taught. I also agree, though, that calling it “fake data” doesn’t help convince people. Like it or not the connotation of “fake” is important. “Simulated data” is far better.

      • Andrew says:


        I dunno. I like “fake” because I like to emphasize that the data are fake. To me, that’s a big deal! I agree that my fake data aren’t intended to fool anybody, though. I guess that’s your point: “fake” means constructed and not real, but “fake” can also imply deceptive, and I don’t want to imply that.

        To put it another way, I want my fake data to be openly fake. I don’t want to do a Michael Lacour or Brian Wansink.

        • Ben says:

          I had a collaborator get bugged by fake as well. We switched to saying something like simulated. We always end up defining the whole process anyway so I’m not too concerned with confusion.

          I remember at presentation once where someone was doing computational stuff. An old experimental person in the audience asked a question beginning with “We all know the value of virtual experiments…”

          Which at the time I just thought it was a funny/retro way of saying computation, but it seems like another way of getting at this without saying fake.

        • When teaching students about using simulation studies to investigate behaviour of models and estimation methods, I tend to use “synthetic” as a more neutral alternative to “fake”.

          • Martha (Smith) says:

            I wonder if perhaps “simulated plausible data” might avoid the negative connotations of “fake”, while saying that the data are not “real”.

            • Martha and Andrew:

              Doesn’t “simulated” already imply “not real, but plausible,” without the negative connotations of “fake” and without the need for “plausible?” If I say “here’s a simulated brain,” no one expects me to hand them a real brain. If I say, “here’s a simulated image of bacteria,” it’s obvious that it’s not an image of bacteria that was actually captured by a microscope and camera.

              • Andrew says:


                Yes, but “Simulated-data simulation” doesn’t sound right!

                Maybe “Simulated-data experimentation” or “Simulated-data exploration” would be a better term?

              • Martha (Smith) says:

                Raghu said,
                “Doesn’t “simulated” already imply “not real, but plausible,” without the negative connotations of “fake” and without the need for “plausible?””

                I think this applies to most people reading this blog, but I think that there are many “users” of statistics to whom it doesn’t apply.

        • Andrew says:

          Ben, Finn:

          Yes, “fake data” is the computational version of the physicists’ term, “thought experiment.”

  2. rm bloom says:

    Are you distinguishing it from (if there is — in your view — such a category as) “real data simulation” ?

    I am confused.

    [1] A controlled experiment is an exercise in “real data elicitation”.
    [2] A thought-experiment is an “elicitation of the logical consequences of one’s assumptions”
    [3] A simulation is an elicitation of the numerical consequences of one’s assumptions, usually in combination with some particular set (or reduced description thereof — parameterization) of empirical data.
    [4] Real data simulation is : ???

  3. I think the distinction is that we are simulating inputs to inference (data) rather than outputs of inference (predictions).

  4. Of the many terms, I agree with Andrew, that “fake” works well – fake data simulation from a fake world seems to shock people to the most appropriate degree to keep the abstract model (that is just an attempted representation of reality) separate from the reality we have to deal with.

Leave a Reply