Skip to content

The Shrinkage Trilogy: How to be Bayesian when analyzing simple experiments

There are lots of examples of Bayesian inference for hierarchical models or in other complicated situations with lots of parameters or with clear prior information.

But what about the very common situation of simple experiments, where you have an estimate and standard error but no clear prior distribution? That comes up a lot! In such settings, we usually just go with a non-Bayesian approach, or we might assign priors to varying coefficients or latent parameters but not to the parameters of primary interest. But that’s not right: in many of these problems, uncertainties are large, and prior information make a difference.

With that in mind, Erik van Zwet has done some research. He writes:

Our paper is now on arXiv where it forms a “shrinkage trilogy” with two other preprints. It would be really wonderful if you would advertise them on your blog – preferably without the 6 months delay! The three papers are:

1. The Significance Filter, the Winner’s Curse and the Need to Shrink at (Erik van Zwet and Eric Cator)

2. A Proposal for Informative Default Priors Scaled by the Standard Error of Estimates at (Erik van Zwet and Andrew Gelman)

3. The Statistical Properties of RCTs and a Proposal for Shrinkage at (Erik van Zwet, Simon Schwab and Stephen Senn)

He summarizes:

Shrinkage is often viewed as a way to reduce the variance by increasing the bias. In the first paper, Eric Cator and I argue that shrinkage is important to reduce bias. We show that noisy estimates tend to be too large, and therefore they must be shrunk. The question remains: how much?

From a Bayesian perspective, the amount of shrinkage is determined by the prior. In the second paper, you and I propose a method to construct a default prior from a large collection of studies that are similar to the study of interest.

In the third paper, Simon Schwab, Stephen Senn and I apply these ideas on a large scale. We use the results of more than 20,000 RCTs from the Cochrane database to quantify the bias in the magnitude of effect estimates, and construct a shrinkage estimator to correct it.

It’s all about the Edlin factor.


  1. Garnett says:

    I’m a couple of pages into the first paper and found this remarkable figure 1, which is “Recently, Barnett and Wren [2] collected over 968,000 confidence intervals extracted from abstracts and over 350,000 intervals extracted from the full-text of papers published in Medline (PubMed) from 1976 to 2019. We converted these to z-values and their distribution is shown in Figure 1.”

    Can anyone describe the conversion of confidence interval to z-value?

  2. Progress!

    As an aside, when I extracted similar data from Cochrane in 2001, I was informed of a legal restriction against distributing that data. Hice to see that’s gone.

  3. Nick Adams says:

    The low value (13%) for the median achieved power (paper #3) confirms my suspicion that the minimum clinically significant effect sizes used in sample size calculations are often wildly fudged in order to achieve a nominal “power” of 80%.

  4. We developed a method to estimate mean power from an observed set of test statistics (z-scores).
    We also extended that method to look into publication bias.
    I am just curious whether the proposed method here assumes that there is no publication bias or allows for censuring at z = 1.96?

    • Erik says:

      I’m well aware of your work which is indeed related to ours, but also quite different. As I understand it, you estimate the distribution of the z-values by fitting a truncated normal distribution to the significant ones. The goal of the method is produce a good estimate even if there is a lot of publication bias.

      Our goal is different. We’re focusing on collections of z-values where it’s reasonable to assume that there is *not* so much publication bias. Promising sources are replication studies, registered reports or careful meta-analyses that make an effort to include also unpublished studies.

      If (big “if”) we have such an “honest” collection of z-values, then we can model the distribution more carefully. In particular, we can model the central part and the tails separately. This has the important advantage that our shrinkage estimator adapts to the signal-to-noise ratio.

      • Ulrich Schimmack says:

        It is true that we have developed z-curve with psychology in mind, where publication bias is rampant and non-significant results are not trustworthy. However, z-curve can also be used without selection and seems to be working the same way as your method by using a mixture model to estimate mean power. it is also possible to examine whether publication bias is present by comparing estimated mean power (for all z-scores) to the observed mean power (i.e., the percentage of significant results). Anyhow, it would be interesting to apply z-curve to the Cochrane results.

        Best, Uli

      • > careful meta-analyses that make an effort to include also unpublished studies
        I doubt that effort improves things much (didn’t in past ~ 2010).

        If you haven’t, we might also want read Discussion: Difficulties in making inferences about scientific truth from distributions of published p-values. Less difficulties in the data you have, but they have not been ruled out.

        • Erik says:


          1. I have no doubt that Cochrane is far from perfect, but if you compare figure 1 from paper #1 (z-values from all of Medline) to figure 1 from paper #3 (z-values of RCTs from Cochrane) there is a big difference!

          2. If Cochrane missed some small z-values (as I’m sure they did), then we should shrink *even more* than we are recommending now. The current practice is to not shrink at all.

          3. I agree with the Discussion of you and Andrew, but that’s about distinguishing (exact) null from non-null effects. We don’t concern ourselves with that.

    • Paul Owen says:

      It seems Ulrich and Erik have two very different goals.

      Ulrich is estimating the average power of a set of studies post-hoc and claiming this is a “replicability estimate.” The problem with that is twofold. First, even in the most ideal cases (no publication bias, no heterogeneity, no moderators, etc.), simple math-stat shows estimates of average power are super noisy unless there are an unrealistically large number of studies. Thus, post-hoc estimates of average power will in practice be highly inaccurate (i.e., because in real applications there will be some combination of publication bias, heterogeneity, moderators, etc. and few studies). Second, average power is just not a replicability estimate in any real sense! This was all discussed a few months ago here
      and in the paper linked there.

      See also:

      Erik, on the other hand, is taking a set of “untainted” study results and using those to construct a distribution. This distribution is then used as a prior to shrink estimates from future studies back to zero (or, I suppose, not to zero but wherever the untainted results center).

      So, Ulrich’s tool is inherently backward looking while Erik’s tool is inherently forward looking.

      The only thing I do not understand about Erik’s prior is that it is constructed on “unlike” studies. For example, in his psychology application, he uses studies from the Open Science Collaboration to construct his prior, but these studies were all of different phenomenon (say, embodied cognition, social priming, anchoring, fluency or whatever specific effects they looked at in the OSC). And then he proposes using this prior for a new study on, say, power posing–yet another phenomenon. This does not make sense to me, unless I somehow assume all psychology studies are somehow “alike.” Obviously using a bunch of power posing studies to construct a prior for a new power posing study makes a good deal of sense but this is not that Erik seems to be doing. Maybe I could see it making sense if there were zero prior studies of power posing, but it seems silly to ignore prior studies of power posing phenomenon in constructing a prior for a new power posing study and to instead use a prior that mixes apples and oranges like his does. On the other hand, shrinkage is good so if this is simply a route to achieve that, perhaps it makes pragmatic sense? Or maybe constructing the prior from all studies has bias/variance tradeoff advantages relative to constructing it only from power-posing studies because there are a much larger number of the former (especially if we restrict to relatively untainted studies)?

      • Erik says:


        Thanks for asking! We propose to estimate the prior from a large (but honest) collection of studies that have some set of criteria in common with the study of interest. Those criteria then represent the prior information; the more citeria, the more prior information.

        For example, a collection of phase III RCTs in oncology represents more information than a collection of RCTs in oncology, which represents more information than a collection of RCTs, which represents more information than a collection of bio-medical studies.

        Researchers are often hesitant to use specific prior information. Or they believe that there *is* no prior information because their study is unique; nobody has ever studied that particular intervention or exposure in that particular population with that particular outcome under those particular circumstances.

        We think that it is a mistake to think like that. At the highest level of aggregation, just knowing that you are doing another study in the domain of the life sciences already represents a lot of information. In particular, we know that the signal-to-poise ratio is often not that large in such studies.

        • Paul Owen says:

          Thanks for the response, Erik. I think I understand correctly and we are in agreement. In particular, I agree one should not hesitate to use specific prior information when it is available and that it is hard to imagine cases where there is literally *no* prior information available.

          I guess I just wonder about that last point about being at the highest level of aggregation:
          “[J]ust knowing that you are doing another study in the domain of the life sciences already represents a lot of information.”
          I mean I guess this ultimately is an empirical question which could be true for some domains and not for others. I guess in domains where it is not true, your prior would be looser so maybe it ends up being okay regardless.

          Plus, as I said above, we know shrinkage is good in general so insofar as the prior enforces this, that is a good thing.

          • Erik says:

            You can always embed your study of interest in some well-defined larger collection of studies, and then use those to estimate the distribution of the signal-to-noise ratio (SNR). The inclusion criteria of that collection represent your prior info, and your posterior will be calibrated with respect to that larger collection.

            Importantly, a larger collection (so less prior info) does not imply a more diffuse prior. In fact, a very large collection (i.e. few criteria) will typically include many silly studies with low SNR. If that is the case, the distribution of the SNR will have a lot of mass near zero.

  5. Peter says:

    This is really fascinating work, and seems very practical for our research center which designs and implements RCTs and frequently comes up against this issue of low SNR. Is there a way to code this prior information into Stan when modeling a new RCT (or simulating one), or alternatively, would you recommend literally dividing the estimates for our RCTs by the corresponding shrinkage factors you calculate in the third paper? I’d be grateful for some code showing how this prior information might be incorporated when working with a “real” RCT (or a simulated one), as my ability to follow mathematical notation is mediocre!

    • Erik says:

      Our approach is meant to help interpret the results of a trial that’s already been done. We know from the Cochrane database about how trials typically behave. The calculations could be done in Stan, but there’s really no need for that because they are very explicit (Appendix B of paper #2). Please email me if I can help!

      I don’t think our work directly applies to planning a completely new trial.

  6. Wesley says:

    Am I missing something or isn’t this the whole point of empirical Bayes?

Leave a Reply