I am gratified to find the discussions on this blog indicating a much greater familiarity with

my work, compared with the discussions I found on my last visit here.

I believe the most effective contribution I can make to the current discussion is in the form of

three links.

1. First, to an unpublished report that complement my exchange with Rubin

“Myth, Confusion, and Science in Causal Analysis” — https://ucla.in/2EihVyD

2. Second, to a paper titled “Bayesianism and Causality, or, Why I am Only a

Half-Bayesian” — https://ucla.in/2nZN7IH

in which I explain the disconnect between Bayesianism and causality

3. Finally, to Section 11.1.1 in Causality (2009), titled

Is the Causal-Statistical Dichotomy Necessary?

https://ucla.in/2NnfGPQ#page=331

There is comfort, I admit, for researchers to dress causal inference in traditional probabilistic

vocabulary; familiar words evoke familiar tools and a sense of safe passage. From logical viewpoint,

however, causality and statistics do not mix, unless one extends the meaning of “statistics”

to cover the entire sphere of scientific thought. (including of course speculations about

Cinderella’s hair color, which can be decorated with Bayes priors.)

But if the comfort of traditional vocabulary increases researchers ability to solve causal problems

(like front door, external validity, mediation and missing data) so be it — I am all for it.

Judea Pearl

Hi Anon,

“Informative posterior” in this case is a posterior that is tighter than the prior but might not concentrate on a point (even in the large data limit).

In difficult cases the posterior will tighten in some aspects but reflect the full uncertainty of the prior in other aspects.

]]>https://gelmanstatdev.wpengine.com/2009/07/05/disputes_about/#comment-49475

]]>> The fact that Bayesian machinery can be applied to a causal problem that isn’t identifiable with the do-calculus is also noteworthy.

I’m struggling to figure out how this can be true, given the completeness of do-calculus? I assume “be applied to” effectively means this: “It is possible to obtain informative posteriors in cases where the query is non-identifiable via the do calculus.” from the blog post “Causal Inference with Bayes Rule”?

But with the assumptions needed for obtaining informative posteriors in non-identifiable cases (non-identifiable via the do-calculus –> non-identifiable, as per the completeness result), can’t information about the causal effect be obtained with the do-calculus as well?

Does there exist a causal query and a set of assumptions that result in Bayesian machinery providing us more information concerning the (causal) answer than do-calculus would?

How about in cases of partial identification and estimating bounds for the causal effect? (E.g. the IDA algorithm, which estimates a set of possible total causal effects instead of the unique total causal effect.)

]]>Thank you David, it was not completely clear to me that the output of the CausalBayesConstruct algorithm was the PGM defined by the DAG and the set of conditional probability distributions and that the step “connect the two graphs linking theta_V to the corresponding variable V* in the post interventional graph, for each V excluding T” implied that the parametrization is shared.

]]>Thanks Daniel, yes I think Pearl views probability as an external stochastic process and an intervention as modifying this stochastic process – and on this basis argues that probability theory is insufficient for causal inference. If you place a Bayesian joint on the whole system you reach a different conclusion.

]]>Ricardo,

It is certainly useful to get your perspective, and I will follow through on your references.

]]>Carlos: Maybe you mean a conditionally independent and _identical_ assumption is made if theta->x_n, yes this is assumed.

]]>Hi Carlos,

There is no point, it is just to demonstrate that standard probability is now being used and the arrow direction has no causal meaning.

(sorry for the delay in responding)

]]>Can you rephrase the question Carlos? I don’t follow although maybe Finn can help.

]]>Yes, that exchange between Rubin and Pearl was unfortunate, as it was clear they were talking past each other. Larry Wasserman even pointed this out. I still repeat that it is true that nobody has fundamental issues against turning the crank on probabilistic conditioning to calculate a causal effect (module issues of identifiability and nuisances). Pearl has done precisely that in his work with Balke and Chickering in the 90s (Bayesian posterior analysis of a unidentified causal effect is even a theme in a chapter of his book, and joint diagrams of exchangeable DAGs for different units sharing a single parameter node appear in Chickering’s papers). Conditioning on the data has always been totally fine here, because the data are different units from the one where a spurious collider might be conditioned on. That is, conditioned on the model, data points are independent, so this has no implications on choice of adjustment. Pearl didn’t get that Rubin was talking about conditioning on past data, and Rubin didn’t get that Pearl was talking about a choice of estimand (so, the model was being “conditioned on” already. Nothing is stopping us from having this model being the result of a Bayesian estimate). It’s again the conflation of estimand and estimator that plagues discussion on causal inference. Your papers help to clarify the distinction (thanks!), but I also hope it is clearer now where Pearl was coming from.

]]>I honestly felt like I wanted to follow this, but didn’t have the mental energy and time given a bunch of other constraints over the last few weeks. However I too would like to thank you for the effort you put in here.

I found it very frustrating to talk with Pearl regarding these issues (there was a long exchange between us on this blog about 3 or 4 years back), because I came to the conclusion just as you have that his understanding of what is probability theory and statistics is entirely frequentist… and my understanding was Bayesian… and so we talked past each other… He even acknowledged knowing about the development of the Cox/Jaynes theory of probability as extended logic, but seemed to gloss over any actual understanding of it.

For example he would insist on a toy problem, and that I calculate some number like the probability for X to happen, and would give me some data… And I would insist that we need a model of the causality in order to calculate this probability, because the probability for X to happen is not a quantity you can calculate from data alone because it represents how much credibility your model of the causality of X occurring gives to each possible outcome…. and so without a model there is no probability. He would essentially insist that I should plug in the frequency in the data and seemed to refuse to acknowledge the idea that probability depends on the specifics of the model.

Suffice it to say that I see p(X | model, data) is *different for each possible structure that the free variable “model” could take on*. If you leave that variable free… then you have an unevaluated purely symbolic quantity… and he insisted I should plug in a number.

In any case, thanks again for putting the time and effort in here.

]]>David (and Finn), thank you for the discussion. Despite “Casual Inference” being mentioned in the header of the blog we don’t get enough of it :-)

> The CausalBayesConstruct shows that you can convert the assumptions in a CGM to a PGM.

The CausalBayesConstruct algorithm doesn’t mention that the original and “twin” nodes satisfy P(V|parents(V)) = P(V*|parents(V*)).

Should we understand that setting those constraints is part of the algorithm and that they are part of the resulting PGM?

]]>Just to highlight some examples, Pearl requested a simple problem to be solved with probability theory (apparently believing it outside probability theory):

https://gelmanstatdev.wpengine.com/2009/07/05/disputes_about/#comment-49482

Incidentally, Andrew Gelman apparently viewed it as outside the Rubin Causal Model:

https://gelmanstatdev.wpengine.com/2009/07/07/more_on_pearls/

Or in Bayesianism versus Dogmatism Pearl states:

“While the Bayesian paradigm teaches us indeed that one should not ignore the prior knowledge in our possession and the variables that we can observe, it does not license us to blindly condition all probabilities on those observations. Instead, it instructs us to think carefully if conditioning would advance us towards the quantity we wish estimated, or away from that quantity”

He chides Rubin for saying: “To avoid conditioning on some observed covariates,… is neither Bayesian nor scientifically sound but rather it is distinctly frequentist and ’nonscientific ad hockery.”

Rubin is perfectly correct in his a Bayesian must always condition on everything (although to call this non-scientific could be considered overreach). Pearl is however correct that conditioning using the model Rubin mentions can result in incorrect causal effect estimates (Pearl cites the M-bias example, Rubin uses a regression model).

So it isn’t precisely true to say “nobody has ever said that writing a likelihood function from an intervened graphical model and calculating the conditional probability is fundamentally wrong” – although I don’t know widely held Pearls’ view actually is.

Our work, I think, helps clarify these issues. We condition on all data as Rubin advises yet reach the same conclusions as Pearl. While you can get the correct answer using Pearls procedures you can using probability theory too. Our worked examples show how to solve problems like the bell problem and the front door rule in the Bayesian paradigm. If you don’t share these views with Pearl, then we don’t appear to have anything to argue about.

Ricardo :

The axiom system that I mentioned is usually called the Ramsey-de Finetti-Savage theory of statistics, although if interested I would look at text books like those of Lad, Bernado and Smith or Kadane (his “Principles of uncertainty” is online). Pragmatic Bayesians prefer to concentrate on applications and best practice so there is very little of it in Gelman’s BDA. The Ramsey-de Finetti-Savage theory is characterized by its use of probability as a decision theoretic primitive i.e. subjective probability using the de Finetti representation to apply that probability specification to large spaces. Of course causal problems involve decision making under uncertainty and are therefore covered by the theory.

I was puzzled by your comment about mathematical basis as it seems to me (a) our approach can be justified by an axiom system (and the do calculus cannot be, and in fact violates conditionality) and (b) that causality is an applied area like statistics not a part of mathematics like probability. If you really believe causality is a part of mathematics that surely has led to us talking past each other.

Your point about it being easier to estimate a marginal than joint distribution is reasonable – in fact we make the same point ourselves as an advantage of the do calculus- but it will violate conditionality. There are arguments about if conditionality is sensible in practice e.g. see the Robbins-Ritrov-Wasserman-Simms debate as well as Berger and Wolpert’s book.

More Anonymous :

The CausalBayesConstruct shows that you can convert the assumptions in a CGM to a PGM. You could however go directly to the PGM. The factorization behind regression P(y,t|beta,theta)=P(y|t,beta)P(t|theta) is the same one used for back door adjustment. The factorization if it applies makes semi-supervised learning impossible and can be studied and applied without a CGM. In short these assumptions are routinely made outside causal settings. I therefore don’t consider mapping the assumptions from the CGM into the full joint in the PGM to be exiting probability theory (although I have little enthusiasm to argue this small point – and it is perhaps our only real disagreement).

I didn’t intend to say you can infer causal quantities without causal assumptions – that would be a contradiction. Only that no mathematics outside probability theory is required.

I fully agree that using the method in applied settings is an excellent future avenue for research.

Everybody seems to be dropping out. I sincerely thank you all for your attention to our work and your patience in presenting differing viewpoints (and Prof Gelman for posting!).

]]>Finn, Those claims are much more defensible, but also they’re not quite what your arxiv articles say or what your coauthor has said. Do you plan to update your articles and website accordingly?

May I also suggest that your work could draw less of the controversy David mentions if you discuss relevant precursors, like Imbens and Rubin (Ann Statis, 1997), Chickering and Pearl (Comp Sci Stat, 1997), and Balke and Pearl (AAAI, 1994).

One thing I’ve wondered about is how causal identifiability and partial identifiability relate to functional forms of the equations at the nodes of causal graphs. In cases from many fields (chemistry, physics, online advertising… anything really), it makes sense to assume that the equation governing one or more nodes in a causal graph has a specific form, such as linear or polynomial, or even is a noisy version of a specific equation like Michaelis-Menten or Newton’s universal law of gravitation.

Once you assume a functional form, this assumption can affect what is identifiable and partly identifiable. For example, there’s a nice paper by Kuroki and Pearl (Biometrika, 2014) with discussion of how some causal effects become identifiable even in the presence of measurement bias as long as certain nodes are governed by linear equations. However, it doesn’t seem to me that there will be a useful, general theory of how the identifiability of causal graphical models relates to functional forms because the variety of functional forms that need to be considered is too great.

Instead of a general theory, I think that a promising alternative could be to take the specific causal inference problem that is of interest, set up its causal graph and functional form requirement in a Bayesian framework, and then examine the sensitivity of the effect estimate to the priors over the latent variables. Your twin-network-like approach could be a natural home for this!

That would be a boon to causal inference, especially the ability to include specific equations that are scientifically known, like those from physics, microeconomics, etc.

If I were working on your project, I would prioritize this and skip past questions of whether causal inference can be done in pure probability theory. Or… maybe I’m misunderstanding something and my suggestion is bad. Also very possible!

]]>Firstly thanks to everyone for such constructive suggestions and comments. I think most key points have already been addressed – but just to add my thoughts on a few things.

1) We are not at all trying to suggest that causal inference (inferring the impact of an intervention from observational data) is possible without assumptions. We are indeed using the truncated product formula to determine how the structure of the data generating process is changed by intervention. The only difference between what we are doing and Pearl’s CGMs is the explicit representation of both the system before and after intervention within a single Bayesian network, with distributional assumptions represented by hyperparameters. We obviously need to make this clearer.

2) Our work may not be particularly novel. All we have really done is construct an alternative, explicit representation of the assumptions encoded in a CGM. The fact that you can do this is pretty obvious. However, even if this is an idea that has been around since the start, we could not find it clearly written down anywhere, and we found it to be useful in terms of bridging the gap in terminology & approach to thinking about modelling between causal graphical modellers & bayesians. Using this representation allows you to resolve disputed questions like; CG modeller:”you can’t just condition on all observed variables, that can lead to biased estimates (eg M-graph)”, Bayesian: “but you should always condition on all the information we have”. Although our models do resemble twin networks, they are not representing the same thing. From a philosophical point of view, this representation puts CGMs within the broader approach of; define you model (assumptions), define a prior, collect some data, compute your posterior.

3) From an applied standpoint our work is indeed no panecea. Marginalising out the nuisance parameters is likely to be intractable in many cases. If the problem is identifiable, then we can construct (using the same conditional independence relations underlying the do-calculus) a re-parameterisation that doesn’t require marginalisation over latent variables in the observational data – but then the whole thing is equivalent to just using CGMs & the do-calculus but with different notation. Our approach is more interesting for problems that are not identifiable – particularly those that are ‘almost identifiable’ for example non-linear instrumental variables. In these cases, the posterior will contain some sensitivity to the prior but you may still get useful bounds on causal effects.

]]>David,

In your CausalBayesConstruct algorithm, you write that the algorithm’s input and output are

Input: Causal graph G and intervention do(T = t)

Output: Probablistic graphical model representing this intervention.

So, the input already includes a special ingredient you don’t get from pure probability theory — the causal graph. In my example of the two DAGs with different arrow directions, this is point I was aiming to make.

To quote Nancy Cartwright, “No causality in, no causality out”

Overall, the way causality enters your algorithm is similar to how it enters standard twin networks, which is expected because your approach is very similar to standard twin networks.

Also, I have a suggestion that might foster better reactions to your work from causal inference researchers: In your comments, sometimes it looks like you apply the term “do calculus” to other concepts, like DAG-based causal inference as a whole or the structural interpretation of counterfactuals. At least personally, I have a difficult time following your arguments when this happens. So I suggest trying to avoid that kind of conflation.

…Actually, Ricardo has just now posted a comment that makes this point too, so it’s probably good advice.

I think I’ll stop participating in this conversation here, but it has been a good one. Thank you and good luck.

]]>(Maybe I should allow myself one more post, since I like the discussion :-) )

Hi, David

I think one problem here is the conflation of causal modelling and do-calculus. The former is a language to express assumptions of the type “A causes B” from a primitive notion of intervention. Pearl’s Structural Causal Models and the Rubin Causal Model are examples of this. The latter, do-calculus, is a computational technique to derive identifiability results, built on top of SCM. SCMs are defined and applicable without any requirement for the do-calculus. SCM/RCM exist because “A causes B” has no counterpart in probability, and causal inference is about mathematising such claims. You may say this is not important, but I think this is not a defensible claim. Those who work with stochastic differential equations and Bayesian nonparametrics surely are happy that a mathematicisation of probability exists, even if there will always be people who think (wrongly) that is just a mathematician’s game to keep academics busy (heck, what does it even mean to condition on the outcome of a continuous random variable, which is an event of probability zero?). Similar claims apply to people doing causal inference: it is good to have a formal language to express assumptions that have no “obvious” consequences and where it may be hard to even put them together in a coherent way.

Concerning the do-calculus, it doesn’t make sense to say that the do-calculus lacks “axioms of rational decision”. I don’t even know what this means, since I don’t know which axioms you are referring to (Cox’s? SCM/RCM encompasses probability, adding the concept of intervention, so it comes for free). The do-calculus is all about reducing a causal query to a probabilistic query, if at all possible, from nonparametric assumptions of conditional independence between random variables and interventions (it does *not* reduce causal assumptions to non-causal assumptions. Instead, *given* causal assumptions, it reduces expressions with random variables and interventions to expressions without interventions, if they exist). We don’t need it to do causal inference. Really, nobody ever claimed that. But the alternative is to use further assumptions, including priors about latent common causes. Hence we have (informative) latent variable models, instrumental variables, regression discontinuity, synthetic controls, difference-in-differences, etc., all of which people do resort to since conditional independence statements only may provide too little information. But If we don’t want to pay the price for these extra assumptions, the do-calculus is the most general tool to solve the nonparametric problem, and nothing more general exists. This is not a matter of opinion.

At the same time, nobody has ever said that writing a likelihood function from an intervened graphical model and calculating the conditional probability is fundamentally wrong (if the assumptions come from a formal language like SCM or RCM). Such ideas were there from the beginning. But this is just not a panacea given that it’s not clear which assumptions about hidden variables were unnecessary or which made the solution highly sensitive to them. And nobody wants to estimate loads of unnecessary nuisance parameters. In your example with three variables X,Y,Z where all you need is the bivariate distribution of two of them, it seems a strange advice to estimate the full joint and then marginalise the useless variables…

]]>> Only once you have the PGM can you re-factorize or reverse arrows.

Given that the model consists of the PGM produced by the CausalBayesConstruct algorithm _plus_ the additional constraints P(V|parents(V)) = P(V* |parents(V*)), what would be the point of re-factorizing or reversing arrows if the model still has to satisify the constraints defined by the parent-child relationships in the extended CGM?

]]>Hi Carlos,

The arrow directions in the CGM does indeed affect the joint on the PGM. Only once you have the PGM can you re-facotorize or reverse arrows. That “causal structure” can be seen through a probabilistic lens of partial exchangeability.

So I think your impression is correct.

]]>David, I have one question regarding the reversal of arrows.

Say that for the Front Door Rule example in “Replacing the do-calculus with Bayes rule” you reverse the arrow from W to Y (and from W* to Y*). Y depends only on U; an intervention on T* should have no effect on Y*.

If the PGM is essentially the same, shouldn’t your derivation remain valid?

I have the impression that you’re introducing the causal structure when you write down your parameterization because you force that the marginal probabilities will be equal in both sides of the network. That gives the right answer in the original problem and the wrong answer when those arrows are reversed and the causality flow changes.

]]>Thanks for sharing your thoughts Ricardo. It has been a good discussion, and the tone of it has been really constructive.

I must say I am struggling to see the point you are making here. Our approach sits on top of axioms of rational decision making, so it has that mathematical basis; something the do-cacluclus lacks. It is also as an applied approach is distinct from mathematics e.g. prior elicitation is required. Perhaps you regard this as “informal”, I see it as a standard part of probabilistic modelling, it is a prominent part of Bayesian decision theory foundations.

An application of the do-calculus is in contrast piecemeal. Observational data of A,B,C is collected. This is treated as an external stochastic process and an estimator applied. It could be a Bayesian estimator but such a step is fundamentally frequentist. You then read off the causal graphical model in order to modify the estimated joint of P(A,B,C) to give the causal quantities.

In terms of being a formal theory I don’t think the do calculus has the ambition of Bayesian decision theory. It can’t handle statistical estimation. After the estimation is performed it advocates violating the conditionality principle, by dropping variables. By its nature the do calculus is frequentist as it operates with external stochastic processes.

By applying the principles of Bayesian decision theory in a causal model we can condition on all available information, we can combine statistical estimation and determining causal effects in one step. We automatically have an axiom system. I just don’t understand how this can be “informal” compared to the do-calculus.

]]>Thanks for the continued discussion, David. I don’t want to be too repetitive, so I’ll make this my final message in this thread. What More Anonymous (I think) and I point out is: you still haven’t defined what intervention is. A very mainstream view is that it is worthwhile to provide a mathematical definition of it just like it is with probability (this guy here agrees: https://gelmanstatdev.wpengine.com/2018/12/26/what-is-probability/). Lots of informal talk above boils down to “so-what-I-don’t-need-to-define-formally-everything-in-what-I-do-etc etc.”. If causal inference has any meaning as part of Statistics, nobody should buy that: to paraphrase, “none of [the above] is the foundation of [causality]; rather, [causality] is a mathematical concept which applies to various problems.” I surely don’t agree with Pearl’s vision that statisticians don’t know how to talk about causality (well, at least not all of them), but to go in this direction of avoiding a mathematical treatment is basically to admit that.

Before going out before saying that: I genuinely appreciate the views in your papers of expressing a model under intervention by explicitly likelihoods, and at the very least is a way of bringing more people to understand causal graphical models. It is not entirely novel, as it relates to truncated factorisation you can see in some causal inference tutorials and books, but having this more explicitly linked to the representation of parameters in the Bayesian case is helpful and will help several readers to understand the concepts better.

]]>More Anonymous (responding above to your question below) Thanks for your continuing interest!

You correctly point out that if you have many draws from P(A,B,C) you could predict missing entries in your observational study or predict A if you knew C of a missing entry A etc but you can’t know how to extend to a new situation where you set C, without further assumptions. You point to an example where you might regress A on C and another where you might just look at the marginal distribution of A. I agree.

If we are to isolate our disagreement, I think probability theory is perfectly capable of distinguishing between the two scenarios you mention when you consider the joint between the observational data and the new case where you intervene. The two scenarios have the same marginal distribution over the observational data. The two scenarios differ only in an assumption of partial exchangeability between the observed data and the data relating to the intervention, i.e. what is the effect of exchanging A in the observed data with the response A* assuming that C=C*.

The fact that you can’t look at the joint of the observational P(A,B,C) and work out the causal effect of setting C does not mean you need to exit probability theory. You can simply model the full joint of the observations and the new system with the intervention.

I also agree that this assumption is not something that can be tested by looking at the historical data (if you only have access to data without interventions). AFAICT our only disagreement is if the assumption can be expressed with only probability theory.

]]>David, Thanks for staying with this discussion! It’s appreciated.

With observational data, the distribution of (A, B, C) in general does not allow one to distinguish between the two graphs that I mentioned (A right-arrow B right-arrow C and A left-arrow B right-arrow C). This is because the graphs imply the same conditional independencies (A independent of C given B, and no other independencies).

Therefore, when you are estimating the effect of A on C from observational data on A, B, C, there is no purely probabilistic way to decide whether you apply CausalbayesConstruct to A right-arrow B right-arrow C or to A left-arrow B right-arrow C. Extra-probabilistic information is needed to decide this, and is supplied by our assumptions about the causal relationships between the variables, as expressed in the DAG.

In this way, extra-probabilistic information contributes to CausalBayesConstruct.

]]>I would use a different joint distribution in each of the two cases you describe:

P(A*,A1..N,B1..N,C1..N|C*)

but regardless of the case I would condition to compute:

P(A*|C*,A1..N,B1..N,C1..N)

Because it is just conditioning you can re-factorize at will e.g.

P(A*,A1..N,B1..N,C1..N|C*) = P(A*,A1..N|C*,B1..N,C1..N)P(B1..N,C1..N|C*)

or any other way you like.

I would only do the re-factorizing after applying the BayesCausalConstruct. It would be reasonable to say that the arrow directions in the CGM imply a different joint distribution in the PGM.

]]>The graphs are still malformed… the first graph is A right-pointed-arrow B right-pointed-arrow C and the second graph should be A left-pointed-arrow B right-pointed-arrow C.

]]>My above comment is malformed… trying again:

David, you write, “Because we are using only probability theory you can indeed reverse arrow directions and get the same result.”

So, given observational data on A, B, and C, your model gives the same effect of A on C for graph A — > B — > C and graph A C?

If so, how do you deal with the problem that the effect should be (generally) nonzero in the first graph and zero in the second?

If not, how are you choosing the directions of the arrows without invoking causality?

We seem to be getting much farther away from agreement as this discussion proceeds.

]]>David, you write

]]>Because we are using only probability theory you can indeed reverse arrow directions and get the same result.

So, given observational data on A, B, and C, your model gives the same effect of A on C for graph A -> B -> C and graph A C?

If so, how do you deal with the problem that the effect should be (generally) nonzero in the first graph and zero in the second?

If not, how are you choosing the directions of the arrows without invoking causality?

We seem to be getting much farther away from agreement as this discussion proceeds.

Because we are using only probability theory you can indeed reverse arrow directions and get the same result. To be clear you need to use the correct model but you can re-factorize it at will using Bayes rule.

However as we apply the de Finetti representation (when we use a plate) this fixes some arrow directions in a practical sense. The model has a compact form involving a product of conditionally identical densities. You may loose this form if you try to re-factorize it.

I think we mostly agree, the arrows presence and direction is closely tied to the partial exchangeability relationships and these are indeed closely related to causality. However it is possible to consider partial exchangeability without causality.

]]>Thanks Carlos,

I find your response quite reasonable and am almost inclined to agree with it. After all if all we are arguing about is if the idea is original we are not arguing about much.

However, if twin networks are able to solve these problems (e.g. front door rule, M-bias), it has to the best of my knowledge not been demonstrated and in recent discussions forgotten by Pearl.

In the original statistics and medicine debate between Pearl and Rubin a key point was conditioning. Rubin wanted to do the Bayesian thing and condition on everything. Pearl warned about conditioning on the wrong variable and discussed M-bias where conditioning appeared to be a problem.

In our work we can condition on everything as Rubin and Bayes suggests and get the same answer as Pearl. Maybe Pearl knew this could be solved in a Bayesian way by conditioning on everything( in a twin network) – but he does not mention it. Instead he asks how Bayes solves the “bell problem”, which I took to be a sincere (and reasonable) question.

.

]]>David, Thanks again for your response.

For your discussion about ingedients beyond probability theory, please see Carlos’s comment about the twin networks approach and your work:

The point is that both approaches qualify to the same extent regarding the “using plain probability” claim.

Also, you write

For what it is worth we don’t use arrow direction for causality…

Try switching arrow directions in your twin networks and re-running your analyses. To my understanding, sometimes you will get the right answer and other times not. The way to guarentee that you get the right answer is to use arrow direction for causality. Therefore, there is real substance to the semantic association of arrows with causality. It’s consequential.

PS, note I’m using “semantic” as in “semantic vs. syntactic” not as in “semantic = superficial”.

]]>> I still see big differences in intent of twin networks and our proposed method.

Let’s focus for a moment on the similarities. You have the same networks that they do. If you’re allowed to mutilate the graph without the need for “do-calculus or something similar” then so are they.

The left side of the networks represents the real world, the right side the counterfactual world. In their paper the link is provided by exogenous disturbances ϵ which cannot be influenced by the forcing of any endogenous variable in the model. This allows evidence from the real-world network to propagate to the counterfactual network.

As you say, estimation splits naturally into two parts. In both cases, the calculation in the “causal projection” phase comes down to things like P(Y*=1|T*=0) = P(Y*=1|T*=1,Z*=0)P(Z*=0) + P(Y*=1|T*=0,Z*=1)P(Z*=1), where the different conditional probabilities are a direct consequence of the particular model that results from the “abduction” phase (specified by the posterior distributions of the parameters or the probability distribution of response functions).

There is a difference in the kind of model that you implement. Yours is better fleshed and more Bayesian. It may also have its own problems, I don’t know. The point is that both approaches qualify to the same extent regarding the “using plain probability” claim.

]]>Anon:

Yes, to apply probability theory to X, you need more than probability theory. And this is true for every X.

]]>My comments here just address the question of whether there is something to causality that can’t be modeled within a purely probabilistic framework. The claim made by Pearl is that probability distributions aren’t enough; I’m pointing out that Pearl himself shows how to reduce everything he does to a purely probabilistic model.

I’m not talking about trying to learn causal structure, which I agree is very difficult. I mention “non-parametric structural equation models” simply because that’s the general framework for Pearl’s causal graphs — he’s not assuming any particular parametric form. I don’t think that commits one to actually doing everything nonparametrically. BTW, my impression is that Pearl’s emphasis is not on trying to infer causal structure, but on making causal assumptions explicit and exploring their consequences in a rigorous fashion.

]]>Thanks for the comments! I wrote a long response – hopefully it is caught in a spam filter and will appear..

]]>Thanks for your kind words!

I think we are perhaps claiming less than that.

To paraphrase Daniel Lakeland, I don’t think you can even do statistics with pure probability theory. You need some additional framework like Bayesian decision theory. So is Bayesian decision theory as applied to statistical problems enough to solve causal problems? Even there you could argue either way. You need at least one additional ingredient that isn’t (typically) in Bayesian decision theoretic literature i.e. you need a random variable let’s call it y* which will change on a treatment let’s call it t*. (Is this separate from standard Bayesian decision theory.. it is hard to say and seems like a semantic point – but it should be emphasised more in the BDT literature).

You also need to be more careful about your assumptions in a causal setting. A really useful one if it applies is that you can factorise the “model”: p(y1..n,x1..n,y*|t*) = \int \prod_n P(y_n|x_n,beta)p(x_n|theta) P(beta)P(theta) P(y*|t*,beta). This is useful because under such an assumption you can regress y on t to get the causal effect. It can be viewed as an assumption of partially exchangeability where the decision t* has a partially exchangeability relationship with the observed t_1..t_n. You can swap y_{n1} with y_{n2} if t_{n1}=t_{n2} and this partial exchangeability relationship must also extend to y* t.

Is this using some causal concept beyond probability theory? I am not sure this is an important question. It is pretty much standard Bayesian decision theory with special attention to exchangeability relationships that apply in a causal setting. Depending on how you define Bayesian decision theory we are perhaps using some mild extensions. For what it is worth we don’t use arrow direction for causality, but it turns out when you apply the de Finetti representation (the plate in a PGM) some arrow directions become fixed for practical purposes (draws are no longer identical if you try to reverse arrows).

Pearl has I think more than anybody studied the types of partial exchangeability that permit identifiability, and there is a lot we can learn from the do-calculus.

I appreciate all the pointers. I am a bit under pressure at the moment, but I will read them all.

I agree with you that a thorough discussion of twin networks is missing from the article. I also agreed I oversimplified them in my previous comment.

I still see big differences in intent of twin networks and our proposed method. We have one joint distribution that models experimental data and future outcomes dependent on decisions we make now. Twin networks like the do-calculus modify a (already estimated) probability distribution to answer causal questions (historical counterfactual questions in the article). The concept of mutilation again is based on modifying a probability – not one full joint. The notion that reality is a stochastic process and if you apply a treatment you modify the graph and you change the stochastic process is at its heart frequentist. A big difference is we consider a single realisation of the system and place one joint over it. I can see if frequentist probability is used there is a clear need for the do-calculus or something similar.

Thanks again for your interest, encouragement and references.

It would be good to get to the bottom of if this disagreement is more than semantic or not.

]]>Hi David — Thanks very much for your helpful response! I considered it and looked through your papers and website more.

In short, you have a project and some claims about your project. As I see it, your project is Bayesian causal inference with a twin-network-like approach and special attention to latent variables. Your project seems great! I feel very positively about it. You also have major claims about the project, for example that it shows causal inference is possible in pure probability theory. I disagree with the claims.

Let’s start with the claimed demonstrations of causal inference in pure probability theory. Most causal inference researchers would say your demonstrations already use an ingredient that is external to pure probability theory — namely, the semantic association of causation with the arrows in your probabilistic graphical models (PGMs), and the particular mutilation of the PGMs to examine effects of actions. From this perspective, your demonstrations are already extraprobablistic in nature. Therefore, they are incapable of showing that causal inference is possible in pure probability theory.

To support my position, I recommend you to “Probabilistic Graphical Models” by Koller and Friedman. The authors spend the first 20 chapters of their book developing PGMs in pure probability theory, without causality. Then in chapter 21 causality is added through the short definition that a causal model is a Bayesian network which, in addition to answering probabilistic queries, can answer do() queries through mutilation.

Seeing this definition of a causal model, you may think it adds only a modicrum beyond pure probability theory, and you may therefore think Pearl is making a mountain out of a molehill in his distinction between causal inference and pure probability theory. But either way, that’s a matter of opinion separate from the topic at hand.

I do think it would be good to add more on twin networks in your articles. To my eye, the “CausalBayesConstruct” algorithm is essentially the same as the procedure for constructing twin networks, which appears in many papers. You may have reinvented twin networks (that’s impressive!), but they should be acknowleged. You state that the resemblance between your approach and twin networks is superficial, but maybe you haven’t had enough time to look through the literature on them. With all the blog comments to get through, that’s understandable.

You also state

A notable practical difference is that all the “parameters” in this setup are discrete (their first example). We use continuous parameters…

Twin networks apply to both discrete and continuous variables. There may be confusion becuase the twin network approach is often paired with response varaibles / canonical partitions / principal strata, which take discrete values. Actually, response variables might simplify your computation problems. Node merging might also help.

You write

The fact that Bayesian machinery can be applied to a causal problem that isn’t identifiable with the do-calculus is also noteworthy. … The CGM community have seemed a little hostile to the idea that priors over latent variables (unobserved confounders) can help solve these sorts of problems.

I’m not sure why you are encountering hostility. For Bayesian CGM work on priors over latent variables in unidentified models, see Pearl’s “Causality” section 8.5 — which covers Chickering and Pearl (1997) — and the Koller and Friedman book. I also thought full prior distributions were used in the Balke and Pearl article that proposed twin networks, but I was wrong. Thanks to Carlos for pointing out my mistake.

Finally, if you’ve reinvented twin networks, then you may be in an excellent place to greatly advance their study and use. Reinventing something can confer a depth of understanding books and classes just don’t give. If I were you, I’d capitalize on it!!

]]>Kevin:

I discussed some of this in my review essay from a few years ago. The problem with the nonparametric structural equation approach is that it relies on identifying patterns of conditional independence, but in the problems I work on in social and environmental sciences, there are no true zeroes. So I’m skeptical of the throw-lots-of-data-into-the-computer-and-learn-causal-structure attitude.

]]>See above comment about Pearl himself reducing causality to probability in Section 3.2.2 of his book Causality. The problem with previous attempts to reduce causality to probability was a modeling issue: you need to explicitly model interventions as variables in your model. If xi is the intervention variable for variable x, then conditioning on xi = “set x to v” has the same effect as do(x = v) has on the corresponding causal model.

For a brief, informal treatment of this idea, see this slide deck from a presentation I gave in 2012, Fully Bayesian Causality. You probably want to just jump to slide 20.

]]>To reiterate, Pearl himself shows how to reduce causal graphs and do actions to plain old probability theory in Section 3.2.2, “Interventions as Variables”, of _Causality_, Second Edition. This is done in terms of what amounts to a non-parametric structural equation model. He writes about “[interpreting] the causal reading of a DAG in terms of function, rather than probabilistic, relationships,” but of course, a functional relationship is just a degenerate conditional probability distribution.

]]>Thanks for the further comments, David. I agree that if a prior *is* an appropriate assumption, then we should go for it (in the paper I mentioned, I describe a study by Greenland on using priors about smoking, which was a latent confounder for a separate dataset concerning the effect of occupation on lung cancer development. He had a prior linking smoking and the occupation of the worker. The prior came from postulating exactly what the hidden variable was meant to be and use separated sources of information about this selection bias. This is totally fair game, although even there this may not suffice if we have other “unknown unknowns”, latent common causes we have no idea exist or what they might be).

Concerning what else we can do: well, I think the answer is conceptually simple. Just admit what you don’t know. If the data just can’t tell the difference between two different causal effects compatible with it (and I’m not talking about statistical variability only), *report everything*. If it’s too uninformative, well, tough. Maybe it motivates collecting performing different measurements. Maybe (with some caveat emptor) you can try to elicit additional believable assumptions to provide alternative and more precise estimates with the data you already have (while still reporting what weaker assumptions can’t tell).

And it’s still possible to be a full carrying card Bayesian there. Just construct your likelihood to reflect what the data call tell about the parameters. In the paper I mentioned, a Bayesian approach is used. The likelihood is not a latent variable model: what is the point, if we don’t know what the latent variable is to in order to draw informative priors from a magical hat? The likelihood is the marginal among the observables from whatever the latent variable model might be, as long as it agrees with information I can actually assess from data.

I’m less concerned about the point I made above about the irreducibility of causality to non-causal terms: even when people refuse to believe this, it looks like many do modelling as if they agree with it anyway (you wouldn’t flip those edges in the causal graph of your examples even if no observational data could distinguish among them, would you?). But the identifiability issue is serious. The folk knowledge of “identifiability doesn’t matter for Bayesians”, I’m afraid to say, is pseudo-science in this context. It’s one thing to say “look at my massive Bayesian neural net making awesome predictions even if its likelihood function is supercrazy”. In this context, identifiability really doesn’t matter. But if we have an extrapolation problem, like predicting effects of interventions in the data where interventions didn’t take place, now that’s a different game.

On an unrelated note, I would look into the problems of performing more than one intervention in a single system. People like James Robins have been doing that since the 80s (with real applications, not toy ones), and only recently people are coming to terms that much of what he was doing (and related work by Pearl and others) is directly relevant to off-policy reinforcement learning.

]]>I totally understand where you are coming from, since the whole point of something like the do-calculus (or Robin’s G-computation, or other identifiability results starting with Rubin’s ignorability + consistency conditions) is to express estimands with interventions + random variables to something with random variables only. This is a type of reduction, but not one that starts from scratch. It starts from a causal graph (or related assumptions – I agree you don’t “need” a graph, in the same way we don’t need syntactic sugar to express independences between interventions and random variables. But it does help, doesn’t it? The machinery was right there, from the graphical model literature.). So this is not the same as saying “we reduced causality to probability”. In fact, Pearl has an entire section in his book about how all notions of “probabilistic causality” (causality defined from non-causal terms, particularly probability) led to failure. This is a point repeated many times right from the beginning.

]]>Forget what I said. Looking again at Balke and Pearl now I understand the prior probability on the response functions as a set of parameters, instead of hyperparameters with their own prior distribution as I imagened (influenced surely by your model).

]]>Hi David,

I wouldn’t say that the similarities are superficial. If what Balke and Pearl present is “an alternative to the do calculus” so is your proposal, I think, as it creates the same augmented network. Their “response functions”, random variables that take as many values as there are deterministic functions between the parents of a node and the node, are directly related to your P(V|parents(V)).

For example, you write that “Parameterizing the conditional distribution P(t|z) requires two, ϕ0 and ϕ1 to represent P(t|Z=0) and P(t|Z=1) respectively”. My understanding is that their proposal for a functional specification (equations 2-5, where a and b stand for z and t) contains your model. If we call p1, p2, p3 and p4 to the probability of the four mappings (the sum is one, so there are effectively three parameters) the parameters in your model can be recovered: ϕ0 = p3+p4 and ϕ1 = p2+p4

Their “party” example assumes that we are supplied with the model but in principle it would estimated from the data as you do. Note that there is no data in the example. When they introduce the notation they say that “As part of the complete specification of a counterfactual query, there are real-world observations that make up the background context.” When discussing the response-function variables they say that “The prior probabilities on these response functions P(r_b) in conjunction with f_b(a, r_b) fully parameterizes the model.”

From their conclusion: “World knowledge is represented in the language of modified causal networks, whose root nodes are unobserved, and correspond to possible functional mechanisms operating among families of observables. The prior probabilities of these root nodes are updated by the factual information transmitted with the query, and remain fixed thereafter. […] At this time the algorithm has not been implemented but, given a subjective prior distribution over the response variables, there are no new computational tasks introduced by this formalism and the inference process follows the standard techniques for computing beliefs in Bayesian networks. If prior distributions over the relevant response-function variables cannot be assessed, we have developed methods of using the standard conditional-probability specification of Bayesian networks to compute upper and lower bounds of counterfactual probabilities.”

]]>sorry Ricardo not Ricado.

]]>In all the examples we have worked through you can obtain the same answer using an appropriate modelling setup and conditional probability and the do-calculus but we haven’t managed to show that this is generally the case.

Pearl has requested that solutions to toy problems are made in a Bayesian framework, we think we have made a first step in this direction.

I think it would be valuable to have a proof on the equivalence of probability theory and the do-calclulus. It seems very likely that this is the case, but it is difficult to establish what are the equivalent concepts in the two frameworks.

]]>Thanks for the question. There are superficial similarities but also lots of differences. Pearl typically considers a stochastic system and asks what-if type questions of how that system would change if you modify it. In this case a list of probabilities are provided Alice, Bob etc going to the party. The Twin graph produced is a mechanism for computing these modifications to the system. My understanding is it is an alternative to the do-calculus (feel free to clarify if you see it differently).

I believe Pearl’s argument that probability theory needs to be extended is on the basis that there is a need for rules to transform one stochastic system into another to answer causal questions. This seems fine, but views the world in a frequentist sense as an external stochastic system.

In contrast we model a joint of the collected data and the outcome we care about for each possible treatment i.e. P(y*,data|t*) where y* is a future outcome, t* is a future treatment and data contains a list of historical treatments, outcomes and other covariates. Important there is only one joint distribution for the whole system. We then obtain the causal effect by standard conditioning P(y*|t*,data). As Andrew says this is similar to how the Rubin Causal Model works, we differ in that we compute the predictive for y* for each t* as a separate conditioning operation [we have a distinct P(y*,data|t*) for every treatment t*], instead the RCM uses joint distributions on counterfactuals. Our framework maps closer to Pearl and allows us to represent non-standard scenarios such as the front door rule and M-bias. Like Dawid we don’t use historical counterfactuals.

Also like Rubin we consider only a single intervention, Pearl likes to have the flexibility to intervene on any node. Flexibility is good but often you practically can only intervene in a single place.

A notable practical difference is that all the “parameters” in this setup are discrete (their first example). We use continuous parameters in a modelling framework that is much more recognisable to a working statistician.

Andrew has quite often commented about a modelling preference for continuous vs discrete. I am in this setting sympathetic to the need for more flexible models, discrete latent variables seem rather limited here (although to be fair this is before Stan made such inference easy).

On the other hand “link presence or absence” can in some cases have very strong impacts in terms of partial exchangeability. Being able to assume P(y,t|theta,beta)=p(y|t,beta)P(t|theta) makes a world of difference in a causal setting but is indeed a strong assumption about the absence of a link – allowing even a weak interaction makes things considerably harder, priors will now have impact even in the presence of large datasets. Ricado’s reservations about identifiability and prior impact make some sense here, but if these are the appropriate assumptions for the problem what else can you do?

If you have further comments or questions I would be interested. I can see a need for a more extensive discussion of twin networks.

]]>Thanks Carlos, yes that does need to be corrected. Good spot.

]]>