Skip to content

“To get started, I suggest coming up with a simple but reasonable model for missingness, then simulate fake complete data followed by a fake missingness pattern, and check that you can recover your missing-data model and your complete data model in that fake-data situation. You can then proceed from there. But if you can’t even do it with fake data, you’re sunk.”

Alex Konkel writes on a topic that never goes out of style:

I’m working on a data analysis plan and am hoping you might help clarify something you wrote regarding missing data. I’m somewhat familiar with multiple imputation and some of the available methods, and I’m also becoming more familiar with Bayesian modeling like in Stan. In my plan, I started writing that we could use multiple imputation or Bayesian modeling, but then I realized that they might be the same. I checked your book with Jennifer Hill, and the multiple imputation chapter says that if outcome data are missing (which is the only thing we’re worried about) and you use a Bayesian model, “it is trivial to use this model to, in effect, impute missing values at each iteration”. Should I read anything into that “in effect”? Or is the Bayesian model identical to imputation?

A second, trickier question: I expect that most of our missing data, if we have any, can be considered missing at random (e.g., the data collection software randomly fails). One of our measures, though, is the participant rating their workload at certain times during task performance. If the participant misses the prompt and fails to respond, it could be that they just missed the prompt and nothing unusual was happening, and so I would consider that data to be missing at random. But participants could also fail to respond because the task is keeping them very busy, in which case the data are not missing at random but in fact reflect a high workload (and likely higher than other times when they do respond). Do you have any suggestions on how to handle this situation?

My reply:

to paragraph 1: Yes, what we’re saying is that if you just exclude the missing cases and fit your model, this is equivalent to the usual statistical inference, assuming missing at random. And then if you want inference for those missing values, you can impute them conditional on the fitted model. If the outcome data are missing not at random, though, then more modeling must be done.

to paragraph 2: Here you’d want to model this. You’d have a model, something like logistic regression, where probability of missingness depends on workload. This could be fit in a Stan model. It could be that strong priors would be needed; you can run into trouble trying to fit such models from data alone, as such fits can depend uncomfortably on distributional assumptions.

Konkel continues:

For the Bayesian/imputation part: Is there a difference between what you describe, which I think would be what’s described in 11.1 of the Stan manual, and trying to combine the missing and non-missing data into a single vector, more like 11.3? We wouldn’t be interested in the value of the missing data, per se, so much as maximizing our number of observations in order to examine differences across experimental conditions.

For the workload part: I understand what you’re saying in principle, but I’m having a little trouble imagining the implementation. I would model the probability of missingness depending on workload, but the data that’s missing is the workload! We’re running an experiment, so we don’t have a lot of other predictors with which to model/stand in for workload. Essentially we just have the experimental condition, although maybe I could jury-rig more predictors out of the details of the experiment in the moment when the workload response is supposed to occur…

My reply:

Those sections of the Stan model are all about how to fold in missing data into larger data structures. We’re working on allowing missing data types in Stan so that this hashing is done automatically—but until that’s done, yes, you’ll need to do this sort of trick to model missing data.

But in my point 1 above, if your missingness is only in the outcome variable in your regression, and you assume missing at random, then you can simply exclude the missing cases entirely from your analysis, and then you can impute them later in the generated quantities block. Then it’s there in the generated quantities block that you’ll combine the observed and imputed data.

For the workload model: Yes, you’re modeling the probability of missingness given something that’s missing in some cases. That’s why you’d need a model with informative priors. From a Bayesian standpoint, the model is fit all at once, so it can all be done.

To get started, I suggest coming up with a simple but reasonable model for missingness, then simulate fake complete data followed by a fake missingness pattern, and check that you can recover your missing-data model and your complete data model in that fake-data situation. You can then proceed from there. But if you can’t even do it with fake data, you’re sunk.


  1. Mike Maltz says:

    I’m a little skeptical about modeling missing data, in particular in cases where the data may come from a variety of sources. Specifically, I deal with crime data, which is collected by the FBI from over 17,000 agencies throughout the US. While the FBI calls its program “Uniform Crime Reports,” there is hardly uniformity in the missingness. In some cases it’s due to a computer breakdown (almost all of Kentucky, for a few years), in some cases it’s due to the FBI deciding that the data looks fishy, in some cases the police chief is pressured to reduce crime and does it the old-fashioned way (this happened in a bunch of cities over the years), in some cases because the officer in charge of submitting crime reports to the FBI retires and the newbie takes a while to learn the ropes. Etc. etc.

    And a friend of mine who works for the Federal Reserve says that there are similar stories about missingness in bank reports – who’da thunk? So I’m a bit leery when I hear talk about modeling missing data, especially when the data comes from an organization that may be evaluated using the data. Astronomical observations, not a problem – political science, un uh. AmIrite, Andrew?

    • You don’t want to use a missing-at-random assumption when the data is not missing at random. As Andrew qualified above, “and you assume missing at random”. If your data isn’t missing at random, you need to model the missingness. This is very well described in Andrew’s books.

  2. The manual chapter is more about missing predictors than missing outcomes. But the idea’s roughly the same—you need a model for the missingness, even if it’s just “missing at random”. For outcomes, you already have a model, but in may regression problems where predictors are missing, there’s no model built into the regression for the predctors ($latex x$ in Andrew’s notational conventions).

    Modeling missing data and fitting it jointly will not provide the same result as multiple imputation. Multiple imputation is an approximation that pipes the output of the imputation into the model, ratehr than modeling the missing data and everything else jointly.

    If the outcomes are missing at random, what Andrew’s suggesting is factoring the posterior with known outcomes $latex y$ and missing outcomes $latex y’$, with parameters $latex \theta$ as

    $latex p(y’, \theta \mid y) = p(\theta \mid y) \cdot p(y’ \mid \theta).$

    Then in Stan, you can code the $latex p(y’ \mid \theta)$ component in the generated quantities block very efficiently.

  3. Alex says:

    Thanks Andrew! This is all essentially what we ended up doing.

Leave a Reply