It is probably more than that: the case-control invariant mentioned by Thomas only holds for logit model, not probit, not normal normal, whereas change of a distribution does not change any data collection or variable inclusion. So there is still some other principles. Indeed, a case-control logit model does not model the selection explicitly at all.

In the high level, in many problems we often encounter (y,x) with data selection on either y or x. Most Bayesian models addresses the conditional model y|x directly (discriminative model), as long as the likelihood is invariant under samples and observation. The weighted approach can be viewed as a special case where we model p(x,y) jointly (generative model), in which p(x) is modeled through inverse probability weighting.

The naive weighting approach can be horrible. Nevertheless, these is no universal answer to whether the discriminative or generative approach is always better for all problem, otherwise the alternative will not be invented in the first place. For some problems such as measurement error, modeling X is necessary and inevitable. For some other problems such as the case-control logit model, the conditional model is as good as the joint model even without taking into account the selection. In the bottom line, there should at least be a robustness-efficiency tradeoff between these two approaches.

]]>Thank you for the explanation!

]]>I couldn’t understand Lumley’s blog post, which Lauren pointed me to. I then couldn’t understand Lauren’s attempts to clarify. So I brought it up during the Stan meeting. But I couldn’t follow Andrew’s explanation there, either. I’m afraid the language barrier was high enough that I had to get Andrew to write down a concrete example, then work through the two modeling approaches. I find Ben’s explanation easier to follow because he kindly paraphrased “endogenous” and “exogenous” and talked about the core problem.

]]>The way all this works is buried deep in the Stan reference manual. What happens is that a sampling statement using ~ will drop all constant terms in the log density. In this case, as Peter notes, that means the whole thing gets dropped and it has no effect. I was just writing out the full model for symmetry, but now see how that’s confusing. Alternatively if you write

target += bernoulli_logit_lpmf(1 | y_obs);

Then you get a constant contribution to the log density. So you’ll see different values for `lp__` (the unnormalized target density defined by the Stan program), but the sampling will be the same because they’re constant. We’re going to fix this going forward in the Stan 3 parser so that we’ll be able to control whether or not to include constant terms in both sampling and log probability density/mass functions.

You’re right the model’s unrealistic. It’s just an illustration of how hard this is even if you know the missingness pattern perfectly. As others pointed out, normally you have to estimate this.

]]>Yes, way too strong. I didn’t mean this to be a realistic example, just one that illustrates the problem when inclusion weights relate to outcomes. As a computer scientist, I like to build from very simple examples, but I’ve seen that statisticians like to introduce problems with more realistic examples.

In a realistic setting, we’d have to estimate the effect of incentives on inclusion probability.

]]>> But he wanted to see it in a simulation …

Hmm having been scolded as a student for stooping to simulation to check my work or get a better grasp of something, I am worried some might read this as a similarly disparaging remark.

Simulation is an experiment on probability models to learn “anything” you want about them, and if designed and analysed well, no one should be criticized for making use of them.

In fact, I would suggest anyone (no accusations here) that would refrain from such stooping are being possibly overly certain of their math skills, putting elegance above function (avoiding questions for which they can get analytical results) and more concerned about seeming to be right than actually being right. For instance, it likely still is true as Rob Tibshirani reported a couple years ago that it was impossible to publish methods in stats journals which did not have analytical results even though extensive simulation had shown them to be superior.

OK, my rant is over ;-)

]]>Thanks. I fixed it.

]]>I meant a sample of items y[n[1]], …, y[n[M]]. But otherwise, I just meant that you automatically get calibration with Bayes if the model’s well specified.

]]>That wasn’t Andrew writing the post, it was me. As I was trying to extract a simple workable example from them, Lauren and Andrew kept emphasizing that everything’s unknown and estimated, usually from misspecified models. The hard lesson for me was realizing that simple inverse probability weighting, even when normalized to the number of actual items, doesn’t get the right posterior uncertainty.

]]>There’s a practical problem in official surveys in that the design variables are often variables about a geographic area, and these variables may not be available to the analyst. For example, in NHANES, the public-use data sets include information about race/ethnicity for sampled individuals, but the sampling also involves proportion of minority individuals in small areas (?census blocks — I don’t remember exactly) and these are not part of the data set.

If you’re analysing your own survey, you always have access to the actual design variables, but if you’re analysing someone else’s survey you may not.

]]>That’s true, and important, but for me the fun point about the case-control example is that sampling *is not* ignorable, but that you still get unweighted logistic regression as the likelihood or Bayesian analysis because all the bias ends up in the intercept. Since sampling is really not ignorable, all the rules of thumb for relative efficiency of weighted analysis under ignorable sampling are wrong.

]]>Thanks for these pointers.

Just by happenstance I was earlier – before reading this blog post – reading an article which you had co-authored and discussed those dog-shocking experiments… that somehow I think were related to that/a paper by Mosteller. At least there was an earlier post in your blog. Ah, tiredness makes me confused!

]]>Ben:

Yes, what you say in your first paragraph is exactly what I told Bob at the meeting! But he wanted to see it in a simulation . . .

]]>Daniel:

Or, to put it more formally, your inferences will be sensitive to (a) your model for who is included in the sample, and (b) your prior distribution for groups that are unrepresented or underrepresented in the sample.

An example where this arose was in some state polls in 2016 that did not poststratify on education.

]]>For poststratification to work you need each sub-population that you define in your sample to be represented in your sample in a representative way, or you need to know how the covariates relate to the probability to be included so you can correct the sample bias…

If thin truck drivers with beards are largely censored from your sample, and you aren’t aware of this, then you’ll conclude that overall truck drivers with beards are like your sample of truck drivers with beards, rather than the true population which has more thin ones, and your extrapolation will extend this bias.

Of course, this is true in general, if you are systematically seeing very few of some kind of thing and you aren’t aware of the fact, so you can’t try to correct for it through some model… you’ll get the wrong answer. But it’s worth mentioning because the complexity of something like this can make it look like magic.

Another way to say this is the generative model needs to be at least approximately correct, so you had better think through if there are “unknown unknowns” affecting your sample.

]]>These weighting schemes are derived from a frequentist perspective where if you think about the sampling distribution of some estimator over datasets, instead of conditioning on the outcomes in a particular dataset, the estimator can be consistent or sometimes even unbiased.

]]>EDIT: in the strength equation, “response[j]” should really be an 0/1 indicator, “response[j] == k”.

]]>In most staircasing procedures, the operative thing is not so much time per se as it is the history of prior responses. As a result, if it is assumed that each trial depends only on the stimulus presented on that trial, as many measurement models assume (like how the Binomial likelihood assumes order of events doesn’t matter), then there is no role for past responding and it doesn’t enter into the model.

I’m not necessarily advocating this, just saying that this is probably the reason why staircase designs aren’t modeled with explicit temporal components—it is assumed that there is no trial-by-trial learning/adaptation. And time itself might play a role if there are variations in arousal/vigilance over the course of the experiment, though in my experience these changes have negligible effects on psychophysical experiments.

To me, modeling effects of trial history seems like the most useful route. Two forms of history effects seem particularly important, first, response adaptation (a participant is more likely to make responses they have made before) and second stimulus adaptation (depending on the design, a participant may “shrink” their perception of a stimulus on one trial toward or away from their perception of previous stimuli).

If we keep things simple and assume that these history effects have exponential forms, we have four history parameters to estimate: a rate of decay for response attraction (alpha_r) representing bias *toward* past responses; a rate of decay for response repulsion (beta_r) representing bias *away* from past responses; a rate of decay for stimulus attraction (alpha_s) representing a tendency to blend toward previous stimuli; and a rate of decay for stimulus repulsion (beta_s) representing a tendency to differentiate away from previous stimuli. We also assume that stimuli can be represented along a single underlying continuum “mu”. We associate each possible response category k on trial i with a “strength” A_{ik}:

A_{ik} = mu[Stimulus on trial i] + sum_{j=1}^{i – 1} ((exp(-alpha_r * (i – j)) – exp(-beta_r * (i – j))) * response[j] + (exp(-alpha_s * (i – j)) – exp(-beta_s * (i – j))) * stimulus[j])

Then we use softmax/Luce’s choice to convert these strengths to probabilities, e.g., Pr(response k on trial i) = exp(A_{ik}) / sum_{l = 1}^{Num. responses} exp(A_{il})

To be clear, the exponential is an assumption for convenience, though it is not an unreasonable model for drift/adaptation effects. I also have made no attempt to determine if this is identifiable at all. But it seems to me a reasonable starting point for modeling sequential effects in psychophysical experiments.

]]>Thanks Andrew, that’s helpful context. This exercise did help me better understand simulation-based calibration and how missingness can have such a strong impact on standard errors.

]]>Peter:

In some way, the reason this model is so awkward to build is that it’s so artificial. In a typical survey, you’ll model the probability of inclusion in the sample as depending on some background variables of the respondents such as sex, ethnicity, age, and education. You can then fit a regression model for the outcome of interest, including sex, ethnicity, age, and education as predictors and get inferences for the general population using poststratification, and no survey weights are necessary. Bob’s model (or the model that Yajuan, Natesh, and I fit here) is complicated in because it is contextless, with abstract “weights” that appear out of nowhere. This sort of exercise can be useful to help us understand why it is typically a good idea to model survey inclusion as a function of demographic variables of interest.

]]>1 ~ bernoulli_logit(y_obs);

All its components are observed or stipulated, so I don’t understand what Stan is doing with it probabilistically under the hood, and when I run the code without that line I get the same results. Apologies if my question is very ignorant.

As an aside, as a more applied researcher, this exercise made me feel pretty hopeless. To get this model “right” you had to know an incredibly unrealistic amount of information about the function generating missing responses. I’m left with more sympathy for those who don’t even try to build the generative model.

]]>Tanja:

To start with, I’d try a linear or logistic regression with time as a predictor. Another option is a sequential learning model such as done in Bush and Mosteller (1954).

]]>0 ~ bernoulli(y_miss[n] | inv_logit(y_miss[n]))

should be

0 ~ bernoulli(inv_logit(y_miss[n])) ]]>

How would one go about doing this? Where to start when trying to grasp the idea behind this in simple toy examples?

In psychophysics it’s customary to use sequential designs in which stimulus selections depend on observed responses. For example if the participant correctly identifies the location of a signal twice in a row, signal level is decreased; or in more modern approaches signal level would depend on what would minimize the expected entropy of the posterior distribution.

I don’t think I’ve ever seen anyone taking sequntiality into account when analyzing data from such designs. What would be the starting point to adding “time” to designs like these?

]]>Thomas:

This is related to the principle that your data collection model is ignorable if inclusion depends only on variables included in the model, and it has applications, for example if you have a sequential data collection rule, you should include time in your model. Many Bayesians seem to miss this point and discuss sequential data collection and other selection rules too glibly.

]]>The model is Y_i as Bernoulli(p_i) with logit(p)=Xb, and you sample everyone with Y=1 (‘cases’) and a small fraction f of the people with Y=0 (‘controls’). The weighted analysis uses weights 1/f for the controls.

A likelihood-based or full Bayesian analysis would apparently have to model the distribution of X in the unmeasured controls. What make it interesting is that the maximum likelihood estimator turns out to be *unweighted* logistic regression with an offset to correct the log(f) bias in the intercept. The maximum likelihood estimator is the same as ignoring the sampling and not trying to model the distribution of the unobserved X. There are a series of papers in Biometrika showing that Bayesian versions of this also work (eg, https://www.jstor.org/stable/29777153)

It’s a useful check for principled approaches to deriving weighted estimation to see what they do with case-control logistic regression.

]]>Interesting, now that I think about it,

As someone who is always doing Bayesian analyses, my intuitive thought when you said “unbiased” was zero average error across the posterior predictive distribution…

but of course you mean zero average error across repeated data collection.

]]>Makes sense… I guess my real point is that estimate variance can be so high that the unbiased property is pretty useless as whatever value you wind up with in your actual data set is very likely to be far from correct.

and yes to the comment about arbitrary functional forms, which is very relevant to many kinds of real-world data, where for example test scores rise a lot with a bit of education, but saturate with a lot of education or similarly for things like amount of money spent on say rent as a function of income or such like.

]]>You can’t simultaneously fix the observed number of people with income under 10k and talk about bias. Bias is a property of sampling distributions, not of single numbers. You can (reasonably) argue that being unbiased isn’t a strong enough property to be useful, but that’s different.

And obviously it won’t work for completely arbitrary functional forms]

]]>Is there a typo here, or am I missing something?

]]>Suppose there are 1M people in a population. you sample 50, you try to regress say test score vs income, the probability to sample people with income under 10k is nonzero but small, but the number of actual people in the sample with income under 10k is zero. The response function is nonlinear. How could you *possibly* get an unbiased estimate of the shape of this function between 0 and 10k?

]]>If the probability of being sampled is known for people in the sample and bounded away from zero for people in the population, you can get unbiased estimation (and consistent estimation, under reasonable asymptotic embeddings). You can’t if any of the probabilities are zero. A Bayesian version that gets a posterior distribution centered at the right place is easy.

Similarly, if pairwise sampling probabilities are known for all pairs in the sample and bounded away from zero for pairs in the population, you can get unbiased estimation of the variance (and consistent estimation, under reasonable asymptotic embeddings). You can’t if any of the pairwise probabilities are zero. A Bayesian version that gets the right posterior variance is non-trivial.

The big problem in many (but, pace Andrew, not all) real applications is that you don’t actually know the probabilities; they have to be estimated.

]]>It’s from Doug Bates’ computational formulation of mixed model loglikelihood and REML as penalised least squares problems. ]]>

Can you elaborate? I think you’re saying that you can learn the shape of a function Y vs X in an unbiased way even if you never see data in a certain region of the domain? That seems obviously false. Are you talking about a pure linear function?

]]>I mean suppose for example you have 1 million people, you poll them in random order until you get 1000 respondents, and you have a model in which their age, sex, weight, height, income, education level, and zip-code determine say a score on a brief test of problem solving…

You don’t know how many people didn’t answer… you don’t know what their age,sex,weight,height,income,education level, or zip code was… and you don’t know what their problem solving score was.

all you know is that there is probably some nonresponse probability function which is *also* a function of all the demographic variables.

One suspects this is shockingly common in the world of “big data”. Like, it describes pretty much every opt-in survey anyone’s ever done right?

The usual thing is just to model the behavior of “people who answered” and pretend it’s the same as “everyone else”

]]>