Now it’s time for some statistical semantics. Specifically, how do we describe the prior that Pfizer is using for their COVID-19 study? Here’s a link to the report.

Way down on page 101–102, they say (my emphasis),

A

minimally informativebeta prior, beta (0.700102, 1), is proposed for θ = (1-VE)/(2-VE). The prior is centered at θ = 0.4118 (VE=30%) which can be considered pessimistic. The prior allows considerable uncertainty; the 95% interval for θ is (0.005, 0.964) and the corresponding 95% interval for VE is (-26.2, 0.995).

I think “VE” stands for vaccine effect. Here’s the definition from page 92 of the report.

VE = 100 × (1 – IRR). IRR is calculated as the ratio of first confirmed COVID-19 illness rate in the vaccine group to the corresponding illness rate in the placebo group. In Phase 2/3, the assessment of VE will be based on posterior probabilities of VE1 > 30% and VE2 > 30%.

VE1 represents VE for prophylactic BNT162b2 against confirmed COVID-19 in participants without evidence of infection before vaccination, and VE2 represents VE for prophylactic BNT162b2 against confirmed COVID-19 in all participants after vaccination.

I’m unclear on why they’d want to impose a prior on (1 – VE) / (2 – VE), or even how to interpret that quantity, but that’s not what I’m writing about. But the internet’s great and Sebastian Kranz walks us through it in a blog post, A look at Biontech/Pfizer’s Bayesian analysis of their COVID-19 vaccine trial. It turns out that the prior is on the quantity $latex \theta = \frac{\displaystyle \pi_v}{\displaystyle \pi_v + \pi_c},$ where $latex \pi_v, \pi_c \in (0, 1)$ are, in Kranz’s words, “population probabilities that a vaccinated subject or a subject in the control group, respectively, fall ill to Covid-19.” I’m afraid I still don’t get it. Is the time frame restricted to the trial? What does “fall ill” mean, a positive PCR test or something more definitive. (The answers may be in the report—I didn’t read it.)

**What is a weakly informative prior?**

It’s the description “minimially informative” and subsequent results calling it “weakly informative” that got my attention. For instance, Ian Fellow’s post (which Andrew summarized in his own post here), The Pfizer-Biontech vaccine may be a lot more effective than you think that Andrew just reported on, Fellows calls it “a Bayesian analysis using a beta binomial model with a weakly-informative prior.”

What we mean by weakly informative is that the prior determines the scale of the answer. For example a standard normal prior (normal(0, 1)), imposes a unit scale, whereas a normal(0, 100) would impose a scale of 100 (like Stan and R, I’m using a scale or standard deviation parameterization of the normal so that the two parameters have the same units).

**Weakly informative in which parameterization?**

Thinking about proportions is tricky, because they’re constrained to fall in the interval (0, 1). The maximum standard deviation achievable with a beta distribution is 0.5 as alpha and beta -> 0, whereas a uniform distribution on (0, 1) has standard deviation 0.28, and a beta(100, 100) has standard deviation 0.03.

It helps to transform using logit so we can consider the log odds, mapping a proportion $latex \theta$ to $latex \textrm{logit}(\theta) = \log \theta / (1 – \theta).$. A uniform distribution on theta in (0, 1) results in a standard logistic(0, 1) distribution on logit(theta) in (-inf, inf). So even a uniform distribution on the proportion leads to a unit scale distribution on the log odds. In that sense, a uniform distribution is weakly informative in the sense that we mean it when we recommend weakly informative priors in Stan. All on its own, it’ll control the scale of the unconstrained parameter. (By the way, I think transforming theta in (0, 1) to logit(theta) in (-inf, inf) is the easiest way to get a handle on Jacobian adjustments—it’s easy to see the transformed variable no longer has a uniform distribution, and it’s the Jacobian of the inverse transform that defines the logistic distribution’s density.)

Fellows is not alone. In the post, Warpspeed confidence — what is credible?, which relates Pfizer’s methodology to more traditional frequentist methods, Chuck Powell says, “For purposes of this post I’m going to use a flat, uninformed prior [beta(1, 1)] in all cases.” Sure, it’s flat on the (0, 1) scale, but not on the log odds scale. Flat is relative to parameterization. If you work with a logistic prior on the log odds scale and then transform with inverse logit, you get exactly the same answer with a prior that is far from flat—it’s centered at 0 and has a standard deviation of pi / 3, or about 1.

**How much information is in a beta prior?**

It helps to reparameterize the beta with a mean $latex \phi \in (0, 1)$ and “count” $latex \kappa > 0,$

$latex \textrm{beta2}(\theta \mid \phi, \kappa) = \textrm{beta}(\theta \mid \phi \cdot \kappa, (1 – \phi) \cdot \kappa).$

The beta distribution is conjugate to the Bernoulli (and more generally, the binomial), which is what makes it a popular choice. What this means in practice is that it’s an exponential family distribution that can be treated as pseudodata for a Bernoulli distribution.

Because beta(1, 1) is a uniform distribution, we think of that as having no prior data, or a total of zero pseudo-observations. From this perspective, beta(1, 1) really is uninformative in the sense that it’s equivalent to starting uniform and seeing no prior data.

In the beta2 parameterization, the uniform distribution on (0, 1) is beta2(0.5, 2). This corresponds to pseudodata with count 0, not 2—we need to subtract 2 from $latex \kappa$ to get the pseudocount!

Where does that leave us with the beta(0.7, 1)? Using our preferred parameterization, that’s beta2(0.4117647, 1.7). That means a prior pseudocount of -0.3 observations! That means we start with negative pseudodata when the prior count parameter kappa is less than 2. Spoiler alert—that negative pseudocount is going to be swamped by the actual data.

What about Pfizer’s beta(0.700102, 1) prior? That’s beta2(0.4118, 1.700102). If you plot beta(theta | 0.7, 1) vs. theta, you’ll see that the log density tends to infinity as theta goes to 0. That makes it look like it’s going to be somewhat or maybe even highly informative. There’s a nice density plot in Kranz’s post.

Of course, the difference between beta(0.700102, 1) and beta(0.7, 1) is negligible—1/10,000th on prior mean and 1/1000-th of a patient in prior pseudocount. They must’ve derived the number from a formula somehow and then didn’t want to round. The only harm in using 0.700102 rather than 0.7 or even 1 is that someone may assume a false sense of precision.

Let’s look at the effect on the prior, in terms of how it affects the posterior. That is, differences between beta(n + 0.7, N – n + 1) vs. beta(n + 1, N – n + 1) for a trial with n out of N successes. I’m really surprised they’re only looking at N = 200 and expecting something like n = 30. Binomial data is super noisy and thus N = 200 is a small data size unless the effect is huge.

Is that 0.00102 in prior pseudocount going to matter? Of course not. Will the difference between beta(1, 1) and beta(0.7, 1) going to matter? Nope. matter? If we compare the posteriors beta(30 + 0.7, 170 + 1) and beta(30 + 1, 170 + 1), their posterior 95% central intervals are (0.107, 0.206) and (0.106, 0.205).

So I guess it’s like Andrew’s injunction to vote. It might make a difference on the edge if we impose a three-digit threshold somewhere and just manage to cross it in the last digit.

**Beta-binomial and Jeffrey’s priors**

I’ll leave it to the Bayes theory wonks to talk about why beta(0.5, 0.5) is the Jeffrey’s prior for the beta-binomial model. I’ve never dug into the theory enough to understand why anyone cares about these priors other than scale invariance.

> I’m afraid I still don’t get it. […] (The answers may be in the report—I didn’t read it.)

You’re right, the trial protocol happens to give some details on what they intended to do. :-)

> What we mean by weakly informative is that the prior determines the scale of the answer.

Who is “we”? Stan developers?

> For example a standard normal prior (normal(0, 1)), imposes a unit scale, whereas a normal(0, 100) would impose a scale of 100.

Are both equally weakly informative? What would be an example of a more or less informative prior? (By the way, Gelman et al. say that the prior can often only be understood in the context of the likelihood. This is an understatement, priors only ever make sense in the context of a model.)

> I’m really surprised they’re only looking at N = 200 and expecting something like n = 30.

Where are this numbers coming from?

> Binomial data is super noisy and thus N = 200 is a small data size unless the effect is huge.

A vaccine is useless unless the effect is huge :-)

> A vaccine is useless unless the effect is huge :-)

D’oh! That makes sense. You can tell I’m not an epidemiologist :-). And thanks for the clarification in the second note.

I should’ve been clearer—I meant the Stan developers’ notion of “weakly informative”, at least insofar as represented by our wiki of prior choice recommendations.

I’d go further and say that priors only make sense relative to a likelihood

and data. For example, if we’re doing a regression on length, then normal(0, 1) might be a weakly informative prior for a regression coefficient for a predictor whose units are meters. But then keeping all else the same, if you convert the data to millimeters, you’d have to change the prior to normal(0, 1000) to have the same meaning, even without changing the model in any substantial way. Both might be weakly informative if the posterior has unit scale in the first case and scale 1000 in the second. Standardizing predictors lets us take a better stab at default priors.I’ll grant that these things are not particularly well defined. My definition of weakly informative is that the prior has “limited” influence on the posterior over the range of expected outcomes. This is a very squishy definition. So for example the normal(0,1) prior has “limited” influence over the likelihood if the parameter is in the neighborhood of the unit scale.

In this case the prior has “limited” influence on the posterior if the efficacy is in the range of 0%-99%. A flat prior for theta, or jeffery’s prior or flat-on-efficacy will give you posteriors in the same ballpark. This is different than if the researchers had given their best estimate of the prior based on lab, phase I and phase II data. I could easily imagine them coming up with something like beta(1.5,10) truncated to <=.5 based on past data. That would be more opinionated, and philosophically more correct from a Bayesian standpoint.

Say what you will about Jeffrey's priors, at least they provide a concrete definition of "non-informative" to argue against. Otherwise it the "Stan developers'" definition versus the "Fellows Statistics developers'" definition versus …

> They must’ve derived the number from a formula somehow and then didn’t want to round.

From your first quote,

“A minimally informative beta prior, beta (0.700102, 1), is proposed for θ = (1-VE)/(2-VE). The prior is centered at θ = 0.4118 (VE=30%) which can be considered pessimistic.”

it would seem that they fixed the second parameter to one and solved for 30% efficacy.

> Because beta(1, 1) is a uniform distribution, we think of that as having no prior data, or a total of zero pseudo-observations. From this perspective, beta(1, 1) really is uninformative in the sense that it’s equivalent to starting uniform and seeing no prior data.

That is debatable. As the wikipedia page for the beta distribution emphasizes, the beta(1,1) is the “uninformative” prior distribution that conveys the information that both failure and success are possible. In this situation (any many others), it is totally reasonable to deny that a person is guaranteed to (not) get COVID as a result of the drug. That is in contrast to Haldane’s improper prior, beta(0+,0+), which is consistent with the possibility that the success probably might be exactly 0 or 1.

I’m glad Pfizer went with a more informative prior and maybe they can get more buy-in from a simple beta-binomial analysis, but I don’t like beta priors. It is one of many examples of a probability distribution that was constructed in the pre-computer era to have elementary expressions for its moments, which does not serve the analyst (or the regulators) well when they do not have a well-formed prior expectation, prior variance, etc.

> maybe they can get more buy-in from a simple beta-binomial analysis, but I don’t like beta priors

What kind of prior would you have liked better?

I did my StanCon presentation about such priors in August

https://github.com/bgoodri/StanCon2020

This is similar to stuff I naturally do, which is basically to choose a flexible family and then tweak the parameters until I get several quantiles the way I want them. I also often use the peak density and place that on a particular location. I don’t really care about the fact that it’s a “beta” or a “skew normal” or whatever, just whatever can set up an appropriate shape.

The Chebyshev idea is pretty good, but can result in not a real CDF (decreasing function).

You do have to check that the implied (inverse) CDF is increasing on (0,1) but the derivative is usually known, so it is pretty easy to check whether it has a root in (0,1) before you start sampling.

In I understand correctly, you propose to take to the extreme the idea of prior elicitation using percentiles so you take as input as many as you want and fit a distribution. This flexibility is interesting when you have detailed information to include in the prior but for creating a weekly-informative one it seems overkill.

There is not much difference between the prior they use and what you get using your method and the same median (41% vaccine efficiency). Fixing two quantiles instead of one (setting the middle third to be a vaccine efficiency between -27% and 74%) your method gives a prior that matches pretty well the beta(0.7,1) distribution.

https://imgur.com/a/QyUaQSZ

I agree that you could construct a prior via the inverse CDF that is similar to this beta prior or any other beta prior. Also, if Pfizer wanted to put the prior on the vaccine effect and make the success probability a transformed parameter, that would be easy too (at least with Stan). But I don’t think it is good practice to make people start with prior expectations and work out what fully specified prior distribution is consistent with those prior expectations because I think that people don’t have well-thought out prior expectations. It seems to me that people are much more comfortable being pressed about prior quantiles than prior expectations.

Thanks, Ben. That’s an interesting perspective I hadn’t thought about. I usually try to stay away from “uninformative” as I’m not even sure what it’s supposed to mean.

I thought “consistent with 0” just meant having non-zero mass there? Or is this a problem because when we transform to log odds, we’re back to vanishing tails?

The problem I have with thinking about this is that it’s uniform on the probability scale, which seems about as “uninformative” as you can get, but of course, when you transform to log odds, it’s logistic(0, 1), which is no longer flat.

I’m also having trouble reconciling that beta(1, 1) is anything more than maybe weakly informative in the sense of not expecting 0 or 1 values, because any central inference is still very close to what you’d get with an improper beta(0, 0). When we fit something like a baseball ability, we wind up with something like a beta(100, 200) prior, which feels much more informative to me, primarily because it involves a high pseudocount.

How can you use an improper prior? Do you just take the limit as alpha, beta -> 0? Doesn’t that still lead to an improper posterior if you only ever see 0 or 1 outcomes? I still have never worked through the math on the Jeffreys beta(0.5, 0.5) prior.

Yeah, if you let alpha, beta -> 0+, you get Haldane’s improper prior. I don’t think it is a good idea to use it in Pfizer’s case or any other case that I can imagine someone modeling successes. But one can see why he argued it was less informative than the uniform. If you start with a beta(1,1) prior, for any finite number of observations, you never get all the posterior probability on 0 or 1 even if the outcomes are all failures or all successes. That’s why that prior conveys the information that both failure and success are possible. If you started with a beta(0+,0+) prior and got all failures or all successes, you would end up with a beta(y, 0+) or beta(0+, n) which are degenerate but do put all the mass on one or the other.

Ben, Bob:

Jouni’s article is relevant to this discussion. I blogged it back in 2011 and there were 0 comments!

I was just going to post on this; to my mind beta(1,1) represents 1 pseudocount of success and 1 of failure (with these counts representing the information that both failure and success are possible).

It might be amusing to try to track down the first instance of the prediction that if Bayesian statistics were to be used for high stakes decision making people would start arguing over which prior to use. It surely cannot have been long after Bayesian statistics was developed.

I meant this more as a discussion of semantics and a general discussion about how to formulate priors with general properties than an argument about which priors to use in this example. Sensitivity analysis shows these priors have no substantial effects on the conclusions. Same with the hyperpriors Andrew and I used in our seroprevalance analysis.

And of course, I’m obliged, being on Andrew’s blog, to note that those statisticians were probably still arguing about likelihoods if they’re the kind of statisticians who like to argue about models. For example, we could’ve had logit or probit or robit models here, there can be pooling of various kinds, different power calculations based on assumptions about effects (very much like a discussion of a prior), different criteria for removing “outliers” from data, different significance thresholds, different forms of parametric or non-parametric hypothesis test, etc.

This is just Bayesian NHST, I don’t see what advantage adding an arbitrary prior has over the usual NHST. Bayesian approaches are useful when you have a mechanistic model you want to check.

The real issue is the age/comorbidity dependence of the immune response. They need to compare that to what happens if you just get infected the normal way and do a cost-benefit analysis.

A vaccine with low effectiveness and high risk of ADE in the same people at risk of severe illness from infection isn’t worth it. The people at low risk would have been much better off holding covid parties back in March (then self quarantining for two weeks after the *known* exposure) and ended this then. Your immunity is going to be much more diverse to a real infection than just to one viral peptide.

> (…) beta(0.5, 0.5) is the Jeffrey’s prior for the beta-binomial model. I’ve never dug into the theory enough to understand why anyone cares about these priors other than scale invariance.

I assume “these priors” are Jeffrey’s priors and by “scale invariance” you mean invariance under reparametrizations. The invariance property addresses the issue you mention of flat in theta being beta(1,1) but flat in log-odds being beta(0,0). The Jeffrey’s prior is always beta(0.5,0.5).

In the single-parameter case, Jeffrey’s priors have another interesting property: they maximize the expected divergence between prior and posterior. They are in some sense the “less informative” alternative. These reference priors can also be defined in multi-parameter models but are no longer the same as Jeffrey’s priors.

The prior is beta (0.700102, 1), so the expected p (conditional prob for events in vaccine group conditional on the total events) is 0.700102/(0.700102+1) = 0.4118 (same as their θ). Beta is the prior distribution for p — simply the proportion of events in vaccine group out of total events.

From p, we can derive relative risk and vaccine efficacy (VE). The relative risk = 0.4118/(1-0.4118) * 1=0.700 (corresponding to a VE=30%), same as their number. That is RR= p/(1-p)*C, where p is the conditional prob of events in vaccine conditional on the total events, and C is the randomisation ratio.

The likelihood for p is binomial, so the posterior for p is also beta (0.700102+events in vaccine, 1+events in control) due to conjugacy. From posterior p, we can derive RR and VE.

Stephen Senn briefly discusses the Pfizer study in his postscript to his guest post on my current blog.

https://errorstatistics.com/2020/11/12/s-senn-a-vaccine-trial-from-a-to-z-with-a-postscript-guest-post/

The uninformative prior is noted on p. 111 of the report which is in my comments, but I take it people here are aware of that already.

One feature that hasn’t been discussed enough is that theta which is a proportion in theory takes values between 0 and 1. However, values of theta between 0.5 and 1 implies negative efficacy – the vaccine actually did worse than the placebo. Vaccine efficacy as defined can implode to negative infinity but is upper bounded by +100%. So there are definitely scaling challenges.

pbeta(0.5,0.700102, 1, lower.tail=FALSE) => 38%

so the prior has 38% chance that the vaccine causes more cases than the placebo

Is that a problem? That’s why they call the prior minimally informative. “The prior is centered at θ = 0.4118 (VE=30%) which can be considered pessimistic. The prior allows considerable uncertainty; the 95% interval for θ is (0.005, 0.964) and the corresponding 95% interval for VE is (-26.2, 0.995).”