Skip to content

Fake data simulation: Why does it work so well?

Someone sent me a long question about a complicated social science problem involving intermediate outcomes, errors in predictors, latent class analysis, path analysis, and unobserved confounders. I got the gist of the question but it didn’t quite seem worth chasing down all the details involving certain conclusions to be made if certain affects disappeared in the statistical analysis . . .

So I responded as follows:

I think you have to be careful about such conclusions. For one thing, a statement such as saying that effects “disappear” . . . they’re not really 0, they’re just statistically indistinguishable from 0, which is a different thing.
One way to advance on this could be simulate fake data under different underlying models and then apply your statistical procedures and see what happens.

That’s the real point. No matter how tangled your inferential model is, you can always just step back and simulate your system from scratch. Construct several simulated datasets, apply your statistical procedure to each, and see what comes up. Then you can see what your method is doing.

Why does fake-data simulation work so well?

The key point is that simulating from a model requires different information from fitting a model. Fitting a model can be hard work, or it can just require pressing a button, but in any case it can be difficult to interpret the inferences. But when you simulate fake data, you kinda have to have some sense of what’s going on. You’re starting from a position of understanding.

Leave a Reply