Course files
This presentation is the property of its rightful owner.
Sponsored Links
1 / 41

Course files PowerPoint PPT Presentation


  • 51 Views
  • Uploaded on
  • Presentation posted in: General

Course files. http://www.andrew.cmu.edu/~ddanks/NASSLLI/. Principles Underlying Causal Search Algorithms. Fundamental problem. As we have all heard many times… “ Correlation is not causation! ”. Fundamental problem. Why is this slogan correct?

Download Presentation

Course files

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Course files

http://www.andrew.cmu.edu/~ddanks/NASSLLI/


Principles Underlying Causal Search Algorithms


Fundamental problem

  • As we have all heard many times…

    “Correlation is not causation!”


Fundamental problem

  • Why is this slogan correct?

    • Causalhypotheses make implicit claims about the effects of intervening (manipulating) one or more variables

    • Hypotheses about association or correlation make no such claims

      • Correlation or probabilistic dependence can be produced in many ways


Fundamental problem

  • Some of the possible reasons why X and Y might be associated are:

    • Sheer chance

    • X causes Y

    • Y causes X

    • Some third variable Z influences X and Y

    • The value of X (or a cause of X) and the value of Y (or a cause of Y) can be causes/reasons for whether an individual is in the sample (sample selection bias)


Fundamental problem

  • Fundamental problem of causal search:

    • For any particular set of data,there are often many different causal structures that could have produced that data

    • Causation → Association map is many → one


Fundamental problem

  • Okay, so what can we do about this?

    • Use the data to figure out as much as possible (though it usually won’t be everything)

      • Requires developing search procedures

    • And then try to narrow the possibilities

      • Use other knowledge (e.g., time order, interventions)

      • Get better / different data (e.g., run an experiment)


Always remember…

Even if we cannot discoverthe whole truth,

we might be able to find some of the truth!


Markov equivalence

  • Formally, we say that:

    • Two causal graphs are members of the same Markov Equivalence Class iff they imply the exact same (un)conditional independence relations among the observed variables

      • By the Markov and Faithfulness assumptions

    • Remember that d-separation gives a purely graphical criterion for determining all of the (un)conditional independencies


Markov equivalence

  • The “Fundamental Problem of Causal Inference”can be restated as:

    • For some sets of independence relations, the Markov equivalence class is not a singleton

  • Markov equivalence classes give a precise characterization of what can be inferred from independencies alone


Y

Z

X

Y

Z

X

Y

Z

X

X

Z

X

Y

Z

Y

Z

X

Y

Markov equivalence

  • Examples:

    • X {Y, Z} ⇒

    • X Y | Z ⇒

    • X Y ⇒


X

Y

Z

X

Y

Z

X

Y

Z

X

Y

Z

Markov equivalence

  • Two more examples:

    • Are these graphs Markov equivalent?

    • Are these two graphs?


Shared structure

  • What is shared by all of the graphs in a Markov equivalence class?

    • Same “skeleton”

      • I.e., they all have the same adjacency relations

    • Same “unshielded colliders”

      • I.e., X→ Y ← Z with no edge between X and Z

    • Sometimes, other edges have same direction

      • In these last two cases, we can infer that the true graph contains the shared directed edges.


Shared structure as patterns

  • Since every Markov equivalent graph has the same adjacencies, we can represent the whole class using a pattern

    • A pattern is itself a graph, but the edges represent edges in other graphs


Shared structure as patterns

  • A pattern can have directed and undirected edges

    • It represents all graphs that can be created by adding arrowheads to the undirected edges without creating either: (i) a cycle; or (ii) an unshielded collider

  • Let’s try some examples…


Shared structure as patterns

Nitrogen — PlantGrowth — Bees

Nitrogen→PlantGrowth →Bees

Nitrogen←PlantGrowth →Bees

Nitrogen←PlantGrowth ←Bees


Shared structure as patterns

Nitrogen→PlantGrowth ←Bees

Nitrogen→PlantGrowth ← Bees


Formal problem of search

  • Given some dataset D, find:

    • Markov equivalence class, represented as a pattern P, that predicts exactly the independence relations found in the data

  • More colloquially, find the causal graphs that could have produced data like this


Hard to find a pattern

  • “Gee, how hard could this be? Just test all of the associations, find the Markov equivalence class, then write down the pattern for it. Voila! We’re doing causal learning!”

  • Big problem: # of independencies to test is super-exponential in # of variables:

    • 2 variables ⇒ 1 test5 variables ⇒ 80 tests

    • 3 variables ⇒ 6 tests6 variables ⇒ 240 tests

    • 4 variables ⇒ 24 testsand so on…


General features of causal search

  • Huge model and parameter spaces

    • Even when we (necessarily) use prior information about the family of probability distributions.

    • Relevant statistics must be rapidly computed

  • But substantive knowledge about the domain may restrict the space of alternative models

    • Time order of variables

    • Required cause/effect relationships

    • Existence or non-existence of latent variables


Three schemata for search

  • Bayesian / score-based

    • Find the graph(s) with highest P(graph | data)

  • Constraint-based

    • Find the graph(s) that predict exactly the observed associations and independencies

  • Combined

    • Get “close” with constraint-based, and then find the best graph using score-based


Bayesian / score-based

  • Informally:

    • Give each model an initial score using “prior beliefs”

    • Update each score based on the likelihood of the data if the model were true

    • Output the highest-scoring model

  • Formally:

    • Specify P(M, v) for all models M and possible parameter values v of M

    • For any data D, P(D | M, v) can easily be calculated

    • P(M | D) ∝⎰vP(D | M, v)P(M, v)


Bayesian / score-based

  • In practice, this strategy is completely computationally intractable

    • There are too many graphs to check them all

  • So, we use a greedy search strategy

    • Start with an initial graph

    • Iteratively compare the current graph’s score (∝ posterior probability) with that of each 1- or 2-step modification of that graph

      • By edge addition, deletion or reversal


Bayesian / score-based

  • Problem #1: Local maxima

    • Often, greedy searches get stuck

  • Solution:

    • Greedy search over Markov equivalence classes,rather than graphs (Meek)

      • Has a proof of correctness and convergence (Chickering)

      • But it gets to the right answer slowly


Bayesian / score-based

  • Problem #2: Unobserved variables

    • Huge number of graphs

    • Huge number of different parameterizations

    • No fast, general way to compute likelihoods from latent variable models

  • Partial solution:

    • Focus on a small, “plausible” set of models for which we can compute scores


Constraint-based

  • Implementation of the earlier idea

    • “Build” the Markov equivalence class that predicts the pattern of association actually found in the data

      • Compatible with a variety of statistical techniques

      • Note that we might have to introduce a latent variable to explain the pattern of statistics

    • Important constraints on search:

      • Minimize the number of statistical tests

      • Minimize the size of the conditioning sets (Why?)


Constraint-based

  • Algorithm step #1: Discover the adjacencies

    • Create the complete graph with undirected edges

    • Test all pairs X, Y for unconditional independence

      • Remove X—Y edge if they are independent

    • Test all adjacent X, Y for independence given single N

      • Remove X—Y edge if they are independent

    • Test adjacent pairs given two neighbors


Constraint-based

  • Algorithm step #2: (Try to) Orient edges

    • “Unshielded triple”: X — C — Y, but X, Y not adjacent

    • If X & Y independent given S containing C, then C must be a non-collider

      • Since we have to condition on it to achieve d-separation

    • If X & Y independent given Snot containing C, then C must be a collider

      • Since the path is not active when not conditioning on C

    • And then do further orientations to ensure acyclicity and nodes being non-colliders


Constraint-based example

  • Variables are {X, Y, Z, W}

  • Only independencies are:

    • XY

    • X W | Z

    • Y W | Z


X

Y

Z

W

Constraint-based example

  • Step 1: Form the complete graph using undirected edges


X

Y

Z

W

Constraint-based example

  • Step 2: For each pair of variables, remove the edge between them if they’re unconditionally independent

X Y⇒


X

Y

Z

W

Constraint-based example

  • Step 3: For each adjacent pair, remove the edge if they’re independent conditional on some variable adjacent to one of them

{X, Y} W | Z⇒


X

Y

Z

W

Constraint-based example

  • Step 4: Continue removing edges, checking independence conditional on 2 (or 3, or 4, or…) variables


X

Y

Z

W

Constraint-based example

  • Step 5: Orientation

    • For X – Z – Y, since XY without conditioning on Z, then make Z a collider

    • Since Z is a non-collider between X and W, though, we must orient Z – W away from Z


Constraint-based output

  • Searches that allow for latent variables can also have edges of the form X o→Y

  • This indicates one of three possibilities:

    • X→Y

    • At least one unobserved common cause of X and Y

    • Both of these


Interventions to the rescue?

  • Interventions helped us solve an earlier equivalence class problem

    • Randomization meant that:Treatment-Effect association ⇒ T → E

  • Interventions alter equivalence classes, but don’t make them all into singletons

    • The fundamental problem of search remains


Y

X

X

Y

Z

X

Y

Z

X

Y

Z

X

Y

Z

X

Y

Y

Z

Z

Y

Y

Z

X

Y

Z

X

Y

X

Y

Y

Z

X

Y

Z

X

X

Z

Z

Z

Y

Z

X

Y

Z

X

Y

Z

X

Y

Z

X

Y

Z

Y

Z

X

X

Z

Y

X

Y

Z

X

Y

Z

Z

X

Z

X

Y

Z

X

X

X

Y

Before X-intervention


Y

X

X

Y

Z

X

Y

Z

X

Y

Z

X

Y

Z

X

Y

Y

Z

Z

Y

Y

Z

X

Y

Z

X

Y

X

Y

Y

Z

X

Y

Z

X

X

Z

Z

Z

Y

Z

X

Y

Z

X

Y

Z

X

Y

Z

X

Y

Z

Y

Z

X

X

Z

Y

X

Y

Z

X

Y

Z

Z

X

Z

X

Y

Z

X

X

X

Y

After X-intervention


Search with interventions

  • Search with interventions is the same as search with observations, except

    • We adjust the graphs in the search space to account for the intervention

  • For multiple experiments, we search for graphs in every output equivalence class

    • More complicated than this in the real world due to sampling variation


Y

Y

X

Y

Z

X

Y

Z

X

Z

Z

X

Y

Z

Y

Z

X

X

Example

  • Observation

    • Y Z | X⇒

  • Intervention on X

    • Y {X, Z}⇒ &

  • Only possible graph:


Looking ahead…

  • Have:

    • Basic formal representation for causation

    • Fundamental causal asymmetry (of intervention)

    • Inference & reasoning methods

    • Search & causal discovery principles

  • Need:

    • Search & causal discovery methods that work in the real world


  • Login