(This post is by Yuling)

The likelihood principle is often phrased as an axiom in Bayesian statistics. It applies when we are (only) interested in estimating an unknown parameter $latex \theta$, and there are two data generating experiments both involving $latex \theta$, each having observable outcomes $latex y_1$ and $latex y_2$ and likelihoods $latex p_1(y_1 \vert \theta)$ and $latex p_2(y_2 \vert \theta)$. If the outcome-experiment pair satisfies $latex p_1(y_1 \vert \theta) \propto p_2(y_2 \vert \theta)$, (viewed as a function of $latex \theta$) then these two experiments and two observations will provide the same inference information about $latex \theta$.

Consider a classic example. Someone was doing an AB testing and only interested in the treatment effect, and he told his manager that among all n=10 respondents, y=9 saw an improvement (assuming the metric is binary). It is natural to estimate the improvement probability $latex \theta$ by independent Bernoulli trial likelihood: $latex y\sim binomial (\theta\vert n=10)$. Other informative priors can exist but is not relevant to our discussion here.

What is relevant is that later the manager found out that the experiment was not done appropriately. Instead of independent data collection, the experiment was designed to sequentially keep recruiting more respondents until $latex y=9$ are positive. The actual random outcome is n, while y is fixed. So the correct model is $latex n-y\sim$ negative binomial $latex (\theta\vert y=9)$.

Luckily, the likelihood principle kicks in for the fact that binomial_lpmf $latex (y\vert n, \theta) =$ neg_binomial_lpmf $latex (n-y\vert y, \theta)$ + constant. No matter how the experiment was done, the inference remains invariant.

At the abstract level, the likelihood principle says the information of parameters can only be extracted via the likelihood, not from experiments that could have been done.

#### What can go wrong in model check

The likelihood is dual-purposed in Bayesian inference. For inference, it is just one component of the unnormalized density. But for model check and model evaluation, the likelihood function enables generative model to generate posterior predictions of y.

In the binomial/negative binomial example, it is fine to stop at the inference of $latex \theta$. But as long as we want to check the model, we do need to distinguish between the two possible sampling distributions and which variable (n or y) is random.

Consider we observe y=9 positive cases among n=10 trials, with the estimated $latex \theta=0.9$, the likelihood of binomial and negative binomial models are

> y=9 > n=10 > dnbinom(n-y,y,0.9) 0.3486784 > dbinom(y,n, 0.9) 0.3874205

Not really identical. But the likelihood principle does not require them to be identical. What is needed is a constant density ratio, and that is easy to verify:

> dnbinom(n-y,y, prob=seq(0.05,0.95,length.out = 100))/dbinom(y,n, prob=seq(0.05,0.95,length.out = 100))

The result is a constant ratio, 0.9.

However, the posterior predictive check (PPC) will have different p-values:

> 1-pnbinom(n-y,y, 0.9) 0.2639011 > 1-pbinom(y,n, 0.9) 0.3486784

The difference of the PPC-p-value can be even more dramatic with another $latex \theta$:

> 1-pnbinom(n-y,y, 0.99) 0.0042662 > 1-pbinom(y,n, 0.99) 0.9043821

Just very different!

Clearly using Bayesian posterior of $latex \theta$ does not fix the issue. The problem is that likelihood ensures some constant ratio on $latex \theta$, not on $latex y_1$ nor $latex y_2$.

#### Model selection?

Unlike the unnormalized likelihood in the likelihood principle, the marginal likelihood in model evaluation is required to be normalized.

In the previous AB testing example, given data $latex (y,n)$, if we know that one and only one of the binomial or the negative binomial experiment is run, we may want to make model selection based on marginal likelihood. For simplicity we consider a point estimate $latex \hat \theta=0.9$. Then we obtain a likelihood ratio test, with the ratio 0.9, slightly favoring the binomial model. Actually this marginal likelihood ratio is constant y/n, independent of the posterior distribution of $latex \theta$. If $latex y/n=0.001$, then we get a Bayes factor 1000 favoring the binomial model.

Except it is wrong. It is not sensible to compare a likelihood on y and a likelihood on n.

#### What can go wrong in cross-validation

CV requires some loss function, and the same predictive density does not imply the same loss (L2 loss, interval loss, etc.). For adherence, we adopt log predictive densities for now.

CV also needs some part of the data to be exchange, which depends on the sampling distribution.

On the other hand, the calculated LOO-CV of log predictive density seems to only depend on the data through the likelihood. Consider two model-data pair $latex M1: p_1(\theta\vert y_1)$ and $latex M2: p_2(\theta\vert y_2)$, we compute the LOOCV by $latex \text{LOOCV}_1= \sum_i \log \int_\theta {\frac{ p_\text{post} (\theta\vert M_1, y_1)}{ p_1(y_{1i}\vert \theta) }} \left({ \int_{\theta} { p_\text{post} (\theta\vert M_1, y_1)}{ p_1(y_{1i}\vert \theta) }d\theta}\right)^{-1} p_1 (y_{{1i}}\vert\theta) d\theta,$ and replace all 1 with 2 in $latex \text{LOOCV}_2$.

The likelihood principle does say that $latex p_\text{post} (\theta\vert M_1, y_1)=p_\text{post} (\theta\vert M_2, y_2) $, and if there is some generalized likelihood principle ensuring that $latex p_1 (y_{1i}\vert\theta)\propto p_2 (y_{2i} \vert\theta)$, then $latex \text{LOOCV}_1= \text{constant} + \text{LOOCV}_2$.

Sure, but it is an extra assumption. Arguably the point-wise likelihood principle is such a strong assumption that would hardly be useful beyond toy examples.

The basic form of the likelihood principle does not have the notation of $latex y_i$. It is possibles that $latex y_2$ and $latex y_1$ have different sample size: consider a meta-polling with many polls. Each poll is a binomial model with $latex y_i\sim binomial(n_i, \theta)$. If I have 100 polls, I have 100 data points. Alternatively I can view data from $latex \sum {n_i}$ Bernoulli trials, and the sample size becomes $latex \sum_{i=1}^{100} {n_i}$.

Finally just like the case in marginal likelihood, even if all conditions above hold, regardless of the identity, it is conceptually wrong to compare $latex \text{LOOCV}_1$ with $latex \text{LOOCV}_2$. They are scoring rules on two different spaces (probability measures on $latex y_1$ and $latex y_2$ respectively) and should not be compared directly.

#### PPC again

Although it is a bad practice, we sometimes compare PPC p-values from two models for the purpose of model comparison. In the y=9, n=10, $latex \hat\theta=0.99$ case, we can compute the two-sided p-value: min( Pr(y_{sim} > y), Pr(y_{sim} < y)) for the binomial model, and min( Pr(n_{sim} > n), Pr(n_{sim} < n)) for the NB model respectively.

> min(pnbinom(n-y,y, 0.99), 1-pnbinom(n-y,y, 0.99) ) 0.004717254 > min( pbinom(y,n, 0.99), 1-pbinom(y,n, 0.99)) 0.09561792

In the marginal likelihood and the log score case, we know we cannot directly compare two likelihoods or two log scores when they are on two sampling spaces. Here, the p-value is naturally normalized. Does it mean we the NB model is rejected while the binomial model passes PPC?

Still we cannot. We should not compare p-values at all.

#### Model evaluation on the joint space

To avoid unfair comparison of marginal likelihoods and log scores across two sampling spaces, a remedy is consider a product space: both y and n are now viewed as random variables.

The binomial/negative binomial narrative specify two joint models $latex p(n,y\vert \theta)= 1(n=n_{obs}) p(y\vert n, \theta)$ and $latex p(n,y\vert \theta)= 1(y=y_{obs}) p(n\vert y, \theta)$.

The ratio of these two densities only admit three values: 0, infinity, or a constant y/n.

If we observe several paris of $latex (n, y)$, we can easily decide which margin is fixed. The harder problem is we only observe one $latex (n,y)$. Based on the comparison of marginal likelihoods and log scores in the previous sections, it seems both metric would still prefer the binomial model (now it is viewed as a sampling distribution on the product space).

Well, it is almost correct expect that 1) the sample log score is not meaningful if there is only one observation and 2) we need some prior on models to go from marginal likelihood to the Bayes factor. After all, under either sampling model, the event admitting nontrivial density ratios, $latex 1(y=y_{obs}) 1(n=n_{obs})$, has zero measure. It is legit to do model selection/comparison on the product space, but we could do whatever we want at this point without affecting any property in almost sure sense.

#### Some causal inferene

In short, the convenience of inference invariance from the likelihood principle also makes it hard to practise model selection and model evaluation. The latter two modules rely on the sampling distribution besides the likelihood.

To make this blog post more confusing, I would like to draw some remote connection to causal inference.

Assuming we have data (binary treatment: z, outcome y, covariate: x) from a known model M1: y = b0 + b1 z + b2 x + iid noise. If the model is correct and if there is no other unobserved confounders, we estimate the treatment effect of z by b1.

The unconfoundedness assumption requires that y(z=0) and y(z=1) are independent of z given x. This assumption is only a description on causal interpretation, and never appears in the sampling distribution or the likelihood. Assuming there does exist a confounder c, and the true DG is M2: y | (x, z, c) = b0 + b1 z + b2 x + c + iid noise, and z | c= c + another iid noise. Then marginalize-out c (because we cannot observe it in data collection), M2 becomes y | x, z= b0 + b1 z + b2 x+ iid noise. Therefore, (M1, (x, y, z)) and (M2, (x,y,z)) admit an experiment-data pair on which the likelihood principle holds. It is precisely the otherwise lovely likelihood principle that excludes any method to test the unconfoundedness assumption.

yuling, I think your post is interesting, but I’m not sure what is the main conclusion? It seems like “here are a bunch of issues” but can you summarize what you think we learn from this? I am not sure.

For myself, it seems that to be consistent, to compare between models, we use a mixture with a degree of credibility for each model. I think this produces a coherent comparison of models. What do you think?

Daniel, I guess I was just giving some loose examples of not-likeliehood-principle-conforming procedures, and when they would be compatible with the rest of the workflow and when not.

Thank you for an interesting post, Yuling. Just a few questions for clarification, and some comments.

At the end of the second paragraph you say “Other informative priors can exist but is not relevant to our discussion here”. Actually, what prior are you using? You do not seem to specify any prior (on the parameter). Or do you refer to the choice of the binomial model as using an informative prior? Also, which estimator are you using?

Your calculations seem to imply that you are using either a flat prior (Beta(1,1), Laplace-Bayes) and the MAP estimator, or the (improper) Haldane prior (Beta(0,0)) and the posterior mean initially. But what are you using later when you have n=10, y=9 but $\hat\theta=0.99$? To obtain this estimate, would you not need n=100 and y=99 at least? If so, the PPC p-values would no longer be contradictory.

Finally, I don’t think that the likelihood principle is often phrased as an axiom in Bayesian statistics. I never heard this before and it would be highly problematic for some variations of Bayesian statistics. :-)

My take is that the likelihood principle is, as its names says, a principle that is intuitively appealing and is making sense to many people. Furthermore, it is well known that frequentist statistics does not follow this principle. So it was/is used by (some) Bayesians to claim the moral high ground by claiming that they did/do follow the likelihood principle. That was fine before the event of MCMC, which made the application of Bayesian statistics widely possible, but is not really tenable now. Actually, it was not really tenable then either and doesn’t stand up to closer scrutiny. In Good’s classification of Bayesians[^1], there might be some categories of Bayesians that follow the likelihood principle, but objective Bayesians who use reference priors and (some) subjective Bayesians definitely do not follow that principle.

[^1]: IIRC, Good actually does not use “do you follow the likelihood principle: (a) yes, (b) no” as one of his categorising questions. And, in my opinion, a more crucial question that is missing is “(a) do you think of the prior as part of the data generating process (i.e. as the parameters as truly random variables), or (b) do the parameters have fixed but unknown true values and the prior is used to encode some prior knowledge about parameters”.

> objective Bayesians who use reference priors (…) definitely do not follow that principle

What do you mean by “do not follow that principle”? Does the use of a reference prior somehow go against the likelihood principle?

Yes, and this binomial/negative binomial model is the canonical example.

The likelihood is proportional to $p^9 (1-p)^1$, i.e. the kernel of a B(10, 2) distribution.

The reference prior for the binomial sampling model is B(0.5, 0.5), so the posterior will be B(10.5, 2.5).

The reference prior for the negative binomial sampling model is B(0, 0.5), an improper prior, and the posterior will be B(10, 2.5).

Likelihoods are proportional, so the likelihood model stipulates that one should make the same inference about the parameter.

But as the priors depend on the sampling methods, and the posteriors differ, a Bayesian analysis using reference priors will lead to different inference depending on which sampling model is chosen, thus violating the likelihood principle.

See also the discussion in Chapter 7.4, p 232ff, of Lee (2012, Bayesian Statistics: An Introduction, 4th ed., John Wiley & Sons) or Lesaffre and Lawson (2012, Bayesian Biostatistics, John Wiley & Sons) who state on page 117 “Jeffreys rule can be derived using other principles [. . . ]. However, it has been criticized because of violating the likelihood principle since the prior depends on the expected value of the log-likelihood under the experiment (probability model for the data)”.

Thanks. I don’t remember if I had seen this argument before. I understand the likelihood principle to say something like “the inference about the parameter θ should depend on the sample data x only through its likelihood function L(θ|x)”. I don’t think it says that the inference shouldn’t depend on the model / prior.

If this is a valid objection to prior selection using reference priors then it would be an objection to any criteria for choosing a prior. The very idea of letting inference depend on the prior would be a violation of the likelihood principle. Presented with multiple analysis that used different priors and produced different inferences we should conclude that at most one is valid, and that would be assuming one of the priors is “right”.

The objection would be valid if the inference depended on the data through the prior, in the case where we let the data dictate the choice of model / reference prior. But that wouldn’t be an issue related to the use of reference priors in particular, the same violation of the likelihood principle would happen for any Bayesian analysis with a data-dependent prior.

There is a weak and a strong likelihood principle. As Yuling said in his blog entry, they essentially postulate that two likelihoods that are proportional to each other should lead to the same inference about the unknown parameter. What you seem to have in mind is part of the phrase that is typically used to “prove” that Bayesian inference comply with the likelihood principle: “The posterior distribution of the parameter depends on the data only via the likelihood, hence the inference complies with the likelihood principle”. But this is a faux argument. Unless I am missing something, to comply with the likelihood principle, the posteriors would need to be proportional if the likelihoods are proportional, which would mean that the priors would need to be proportional.

I am all in favour to do multiple analyses with different priors as part as a sensitivity analysis. So far I only wondered how a subjective Bayesian (i.e. somebody who sees probabilities as subjective beliefs and priors as expressing one’s belief) regard such an analysis. They would presumably think that the analyst has multiple personalities? I never thought about what the implication of doing multiple analyses with different priors is on claims on following the likelihood principle. But, yes, I agree with you, it would violate the likelihood principle. But then, as I said at the beginning, I never thought that there is much truth to the statement that Bayesians follow the likelihood principle. :)

The abstract of Birbaum (1962) says: “The likelihood principle states that the “evidential meaning” of experimental results is characterized fully by the likelihood function, without other reference to the structure of an experiment, in contrast with standard methods in which significance and confidence levels are based on the complete experimental model.”

I guess it depends on what do we understand by “evidential meaning”. I would say that the posterior distribution in Bayesian analysis is the result of combining the “evidential meaning” of the data with the prior distribution. The fact that the posterior depends on the likelihood _and_ the prior is not a problem!

A related but different question is what kind of prior selection methods are “acceptable”. Now I see better how changing the prior referencing the structure of an experiment goes “against the spirit” of the LP. But even if the LP didn’t exist, the real problem is that it goes “against the spirit” of Bayesian analysis! The prior is supposed to reflect the plausible values of the parameter and that shouldn’t depend on how we model the sampling distribution of the experiment. In the same way that the prior should not depend on reparametrizations, for example.

Berwin, my point is that PPC looks into the tail of y (the same as hypothesis testing), and does not conform the likelihood principle. I use a point estimate theta=.99 in the example for convenience (or equivalently a prior strongly favoring theta=1), but the conclusion would not change with other informative priors.

Regarding your distinction on prior-as-part-of-the-data-generating-process versus fixed-unknown-parameters, we discussed this distinction in our recent paper https://arxiv.org/pdf/2006.12335.pdf Section 2.2.

Yuling, my point is that somebody who uses a prior that yields a point estimate of 0.99 with n=10 and y=9 in all likelihood does not worry/care about PPC. Likewise, people who care about PPC would probably not use a prior that yields a point estimate of 0.99 with n=10 and y=9. :)

But thank you for pointing out that preprint of yours!

I have lost interest in the likelihood principle as when all the assumptions are fully clarified or grasped their remains little to no guidance for applying statistics. Or at least that’s my take on Mike Evans clarifications of the theorem and Phil Dawid’s take that Birnbaum actually made those clarifications. Pointers to that are likely on this blog somewhere.

But much of what you seem to be address could just follow from sufficiency and the likelihood function being a minimal sufficient statistic for the model assumed. But without the data, the model assumed cannot be be checked and so we are stuck doing math rather than statistics.