Also well put.

All in all, it would be great if clinicians and healthcare policy expertise would follow Andrew’s blog.

]]>Hey, I actually understood your commentary. Well put.

]]>Jim:

I agree.

For example, suppose we characterize the current standard approach as:

Approach 0: Compute classical confidence intervals and then report YES THERE’S AN EFFECT if the interval clearly excludes zero and report MAYBE THERE’S A SMALL EFFECT if the endpoint of the interval is very close to zero and report THERE’S NO EFFECT if zero is well within the interval.

Now consider the following reform:

Approach 1: Use the same classification rule as above but with Bayesian posterior intervals. I think this approach would be an improvement, because it lets us include prior information. But it still has major problems.

Then we can move to:

Approach 2: Do Approach 1, but instead of looking at comparisons or estimates one at a time, look at all of them at once, if possible embedding them in a hierarchical model. I think this would be a further improvement, because it uses more information and helps us avoid selection bias relating to forking paths. But it still has the problem that it’s extracting certainty from uncertainty.

So this moves to some sort of:

Approach 3: Do good modeling and report uncertainty intervals conditional on the model, but don’t use overlap-with-zero as a way of making strong deterministic-sounding statements.

]]>I was thinking the other day: it’s great that there is a group of people with extraordinary statistical expertise who can identify problems with NHST and suggest alternatives; but if any method is going to be trickled down into daily use and standard practice, it’s going to be used by a much broader group of people with substantially less statistical expertise. Under those conditions, there will always be people who just want to put guts in the machine and get sausage out, and not worry too much about what happens inside. What will the shortcomings of the alternative methods be under those circumstances?

Not defending NHST by any means. But the more widely any method is used the more widely it will be abused. so that’s something to consider.

]]>‘then change the design to not do interactions’. I’m curious to see how you see that playing out in most real applied cases. Also, what’s the problem with a prior on an interaction? is making some assumptions about the shape of a parameter of interest egregiously worse than the myriad of other assumptions made in your favorite stats? Or is that just the one you think is rhetorically easiest to pick on? why is this assumption worse than, say, setting an utterly arbitrary alpha level? subjectivity is all around us.

]]>Justin, it is all over the place in applied research. You routinely see papers thresholding results as follows:

1. Run a bunch of analyses on everything

2. Report which ones p 0.05

4. Report LS means or sample statistics for the responses passing #2.

I see a lot of sophisticated stat types getting on Andrew’s case for “strawman NHST”, but what he is describing is rampant and widespread. Besides, I have never seen a rigorous research program out in the wild cleaving to a Neyman-Pearson decision framework consistently for long enough for type I error rates to matter…

]]>> just by clinicians and researchers trying different things and publishing their qualitative findings?

If researchers had zero personal incentives to do this, then sure… But in the presence of career incentives to publish stuff… then the literature would be totally polluted with bullshit.

hey wait…

]]>I don’t agree that that is how significance testing is typically used. The wording is odd to me. If I said there is evidence from a well-designed experiment(s) to suggest a coin is unfair, I am not stating that as a truth with a capital T certainty, but as evidence for, and at a certain alpha level, and I allow for errors and discuss any assumptions.

“A quick calculation finds that it takes 16 times the sample size to estimate an interaction as a main effect, and given that we are lucky if our studies are powered well enough to estimate main effects of interest, it will typically be hopeless to try to obtain the near-certainty regarding interactions”

Then change the design to not do interactions, and/or get a larger sample (may need to save up some $). That still might be preferable to using a prior to get at the interaction. And did that 16 come from replacing uncertainty with certainty? ;)

Justin

]]>I once spent ~2 weeks trying to find a definition for ‘housing unit” (roughly house/apartment, maybe?) It appeared that in the USA & Canada it was a case of “I know one when I see it”.

]]>Nick:

The question is, would this benefit have occurred without RCTs, just by clinicians and researchers trying different things and publishing their qualitative findings? I have no idea (by which I really mean I have no idea, not that I’m saying that RCTs have no value).

]]>Same for social science. OMG I fear the day when there is precise definition for “food desert”; when we know what “quality preschool” is and what it does; when it’s known that we’ve become an “equal” society.

Thousands – nae! Tens of thousands! – would be out of work!

Save NHST! Save the economy!

]]>The amelioration of symptoms and prognosis of almost every common disease has improved since I started clinical medicine in 1987; progress built on very many RCTs, none of them perfect but together forming a tapestry of overlapping evidential strands that can be read. ]]>

Sander:

Fair enough. All such advice is context dependent.

]]>OK we agree on the math (mere arithmetic once you plug in the numbers). But it’s a pet peeve of mine when anyone tosses out context-sensitive numbers that are unmoored from context. In my biz sometimes the interactions are bigger than the average effects, occasionally to the point of effect reversal (as with my own dissertation’s real-example data!). No surprise as there are treatments that kill some patients and save others, especially in the Wild West of real clinical practice (which includes off-label and even contraindicated usage) as opposed to the carefully-selected patients and protocols in the refined world of RCTs. In that kind of reality, saying it takes 16 times the sample size is destructive, since not only is it wrong in general but it will make it sound like there is no point in proposing to examine interactions – but if that is not in the protocol then some will scream “data dredging!” if you look at them. So yeah I think tossing off a number like 16 (and repeatedly is very very bad nonsense, really statistical numerology (like most decontextualized “applied statistics”).

]]>Zad, what you’re talking about is ways to make your data more informative if your model is correct (that is, you’ve blocked or stratified on properly meaningful variables and for reasonable values of those variables). So you can learn more with smaller sample sizes. But within any group, you’re still randomizing, and within that group the probabilistic independence with unknown confounders still is limited by sample size.

Like for example, suppose you know women are different from men, and body weight is important in a medical treatment… SO you split by women and men, and you put them into 3 groups weight 1, weight 2, and weight 3… so you have 3 * 2 = 6 different groups, then you randomize within each group between drug A and drug B… Now you decide you want to have say 100 people in each category for some reason, you need 1200 people, which means you need to recruit somewhat more than that because you’re demanding balance among all the groups… maybe you need to see 2000 people, sort them and put them all into the various groups. Now your medical treatment is $5000 and you’ve got 1200 people: $6M to run the trial. Sure, this is doable for some people, for others it’s 2 orders of magnitude more money than they have.

]]>@Daniel Lakeland: “in practice you can’t eliminate confounders with high sample sizes in RCTs either, because you never get to those large enough N anyway due to cost constraints etc.”

I think this would most likely be a problem with trials that are using simple randomization, which would largely be dependent on the size of the study, but would also give you large standard errors to reflect the uncertainty, but then again, most experienced trialists and statisticians avoid using simple randomization for this reason due to potential imbalances and focus on blocking and stratifying based on prior knowledge of potential confounding variables

]]>Note however that randomized != controlled experimental…

We can run an experiment where for example we use some prior knowledge and decision theory to choose a treatment and then observe the outcome and model the treatment response using known confounders. You can’t eliminate all confounders using large sample sizes with this method, but you can learn a lot, and in practice you can’t eliminate confounders with high sample sizes in RCTs either, because you never get to those large enough N anyway due to cost constraints etc.

]]>One word: Ouija board… Two words… Use a Ouija board… four words!

]]>Sander:

I agree that 16 is not a magic number; it’s the product of assumptions. The larger the interactions are, the smaller this number will be. I don’t think that my number of 16 is “B.S.”; it’s clearly derived from its assumps.

Just one thing: In your comment, you write, “in a simple continuous-mean comparison from a 2×2 orthogonal randomized design, the standard error (SE) for the interaction contrast will only be double the SE of a single main-effect contrast from the trial.” I agree. But that’s what I say too! The factor of 16 in sample size comes from two factors: the factor of 4 in sample size arising from the factor of 2 in SE that you mention, and my assumption that interactions are half the size of main effects. If you’re disagreeing with me on the factor of 16, it’s because you’re saying that your interactions of interest are more than half the size of main effects. It’s hard to know about this, but I agree that the number we get will depend on this assumption.

]]>Dale:

We seem to be learning via Mendelian Randomization that there are few meaningful subgroup effects in medicine (very few piranhas swim in biological systems).

See – Professor George Davey Smith – Some constraints on the scope and potential of personalised medicine https://www.youtube.com/watch?v=uiCd9m6tmt0&t=2467s

]]>Consider that in a simple continuous-mean comparison from a 2×2 orthogonal randomized design, the standard error (SE) for the interaction contrast will only be double the SE of a single main-effect contrast from the trial, meaning that only 4 times the sample size would be needed to get the interaction SE down to what the main effect SE was. For tests, I published some not-so-quick calculations long ago for binary-data settings of interest in my applications, which gave most sizes much less than 16 times those for main effects (Greenland S (1983). Tests for interaction in epidemiologic studies: a review and a study of power. Statistics in Medicine 2:243-251), similar to what others got in the same type of setting.

]]>This week’s New England Journal of Medicine has an article on observational vs RCTs, emphasizing the relative value of the latter and weaknesses of the former. It has some good recommendations on how RCTs can be made easier and less expensive to conduct. However, I think it paints an overly stark distinction between types of studies. RCTs usually depend on two questionable assumptions: that intention to treat, rather than actual treatment received, is the relevant randomization factor. The other is that the randomized groups are sufficiently large to reduce the sampling variability enough to be meaningful. For the latter, they do compare the randomized groups so that they look similar (or have cofounders which could be modeled), but given the number of omitted variables we can never be sure that the randomized groups are sufficiently similar. Large enough sample sizes can offset this, but RCTs are expensive and often do not have very large sample sizes. At the same time, as the amount of observational data increases (both in observations and number of features), the performance of observational studies can get better.

I would not propose that observational studies are preferred to RCTs, but I do see these are on a continuum rather than stark alternatives. Both types of studies have practical limitations which make them more similar than the NEJM article suggests. I often (too often these days) find myself looking for evidence on a medical condition or treatment, only to find that there are no reasonably close RCTs (especially given Andrew’s point about the need to see the effects on particular subgroups rather than looking for average effects), and that the observational data I would like to see is simply unavailable (although, in theory, much more observational data could be made available, were it not for the insane private insurance model we use in the US, with little standardization or sharing of data).

]]>We already rely on RCTs when possible. The problem is it is often not possible. For example, investigating long-term effects of diet on chronic disease by RCTs is infeasible (due to compliance, for one thing), so we have to rely on observational studies.

]]>“suspension of belief ”

Do you perhaps mean “suspension of disbelief”?

]]>“For example telephone surveys during the 2016 election cycle should have had a nonresponse bias built in to their model… “we might be consistently seeing bias on the order of 5% in either direction” was a fairly safe bet given how polls work… But acknowledging it would have made poll output worthless, and so to make money pollsters ignored this issue.

I mean imagine if they’d said “we polled 2000 people and we’ve concluded given the possibility of nonresponse bias that Hillary Clinton will receive between 40 and 60% of the vote and has a 50% chance of winning.

Your grandmother could have told you that for free.”

Even if she died long before 2016? Even if she died long before I was born? By some miracle or prescience? ;~)

]]>> Gosset-Fisher debate

Also taken up as blinding via randomization being more important than ensuring imbalances in important confounders rarely occur.

However, randomization is the only known cure for ignorance, with the main side effect being loss of precision.

Its value will depend on the subject matter, but in medicine Mendelian Randomization is making it clearer that in treatment/exposure comparisons for treatments – its extremely important.

]]>To me randomization is a gadget that we can use to asymptotically eliminate correlations between stuff you’re doing and *anything at all*. This makes it an extremely useful gadget, but it’s just a gadget.

The key is finding out by repeatedly causing something what the downstream effects of causing that thing are. This information is valuable even if you don’t have asymptotically zero correlations. Ideally your model can include these correlations and correctly account for the size of your uncertainty.

For example telephone surveys during the 2016 election cycle should have had a nonresponse bias built in to their model… “we might be consistently seeing bias on the order of 5% in either direction” was a fairly safe bet given how polls work… But acknowledging it would have made poll output worthless, and so to make money pollsters ignored this issue.

I mean imagine if they’d said “we polled 2000 people and we’ve concluded given the possibility of nonresponse bias that Hillary Clinton will receive between 40 and 60% of the vote and has a 50% chance of winning.

Your grandmother could have told you that for free.

]]>Let me see if I understand this. Randomization is useful for minimizing selection bias, but if there are constraints on sample size (as there typically are) stratification across expected confounders can be more helpful still. This was the issue (yes?) in the famous Gosset-Fisher debate, and when I taught stats to budding young field ecologists I reviewed the debate since so much of their data collection comes from small-n studies. Of course one can randomize within strata if you can get a large enough n.

Also, there are potential issues with randomization depending on how the sample frame is constructed: you could have a randomized selection procedure but it might not be randomized wrt to the full population of interest — e.g. the famous example of randomized dialing of landline numbers.

]]>In some fields/organizations there can be either suspension of belief or lack of awareness. The subject matter experts and overseeing bodies are not always experts in statistics and statistical rigor is usually not the immediate motivating factor. As Andrew as voiced, I wish it could be considered obvious.

]]>Yes. experimental was intended to encompass that, though it might be better to phrase it explicitly, something like:

“reliance on evidence from controlled experiments with random assignment and blinding when possible”, in other words controlled experiment is essential, random assignment and blinding is nice to have.

]]>Daniel:

“Controlled” is more important than “randomized,” I think.

]]>Modify it to “reliance on experimental evidence with random assignment and blinding when possible” would be a much better version of (1) IMHO.

]]>Av:

This all may be obvious to you, but unfortunately it’s not obvious to many researchers. Indeed, it wasn’t obvious to me until recently! The purpose of much of academic research and writing is to figure out and explore ideas, looking at them in enough different ways until the ideas seem obvious to us.

I will be very happy if we reach a time when the ideas of the above post are considered obvious by most statisticians, medical and social scientists, and quantitative analysts.

]]>