Course files
This presentation is the property of its rightful owner.
Sponsored Links
1 / 41

Course files PowerPoint PPT Presentation


  • 46 Views
  • Uploaded on
  • Presentation posted in: General

Course files. http://www.andrew.cmu.edu/~ddanks/NASSLLI/. Principles Underlying Causal Search Algorithms. Fundamental problem. As we have all heard many times… “ Correlation is not causation! ”. Fundamental problem. Why is this slogan correct?

Download Presentation

Course files

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Course files

Course files

http://www.andrew.cmu.edu/~ddanks/NASSLLI/


Principles underlying causal search algorithms

Principles Underlying Causal Search Algorithms


Fundamental problem

Fundamental problem

  • As we have all heard many times…

    “Correlation is not causation!”


Fundamental problem1

Fundamental problem

  • Why is this slogan correct?

    • Causalhypotheses make implicit claims about the effects of intervening (manipulating) one or more variables

    • Hypotheses about association or correlation make no such claims

      • Correlation or probabilistic dependence can be produced in many ways


Fundamental problem2

Fundamental problem

  • Some of the possible reasons why X and Y might be associated are:

    • Sheer chance

    • X causes Y

    • Y causes X

    • Some third variable Z influences X and Y

    • The value of X (or a cause of X) and the value of Y (or a cause of Y) can be causes/reasons for whether an individual is in the sample (sample selection bias)


Fundamental problem3

Fundamental problem

  • Fundamental problem of causal search:

    • For any particular set of data,there are often many different causal structures that could have produced that data

    • Causation → Association map is many → one


Fundamental problem4

Fundamental problem

  • Okay, so what can we do about this?

    • Use the data to figure out as much as possible (though it usually won’t be everything)

      • Requires developing search procedures

    • And then try to narrow the possibilities

      • Use other knowledge (e.g., time order, interventions)

      • Get better / different data (e.g., run an experiment)


Always remember

Always remember…

Even if we cannot discoverthe whole truth,

we might be able to find some of the truth!


Markov equivalence

Markov equivalence

  • Formally, we say that:

    • Two causal graphs are members of the same Markov Equivalence Class iff they imply the exact same (un)conditional independence relations among the observed variables

      • By the Markov and Faithfulness assumptions

    • Remember that d-separation gives a purely graphical criterion for determining all of the (un)conditional independencies


Markov equivalence1

Markov equivalence

  • The “Fundamental Problem of Causal Inference”can be restated as:

    • For some sets of independence relations, the Markov equivalence class is not a singleton

  • Markov equivalence classes give a precise characterization of what can be inferred from independencies alone


Markov equivalence2

Y

Z

X

Y

Z

X

Y

Z

X

X

Z

X

Y

Z

Y

Z

X

Y

Markov equivalence

  • Examples:

    • X {Y, Z} ⇒

    • X Y | Z ⇒

    • X Y ⇒


Markov equivalence3

X

Y

Z

X

Y

Z

X

Y

Z

X

Y

Z

Markov equivalence

  • Two more examples:

    • Are these graphs Markov equivalent?

    • Are these two graphs?


Shared structure

Shared structure

  • What is shared by all of the graphs in a Markov equivalence class?

    • Same “skeleton”

      • I.e., they all have the same adjacency relations

    • Same “unshielded colliders”

      • I.e., X→ Y ← Z with no edge between X and Z

    • Sometimes, other edges have same direction

      • In these last two cases, we can infer that the true graph contains the shared directed edges.


Shared structure as patterns

Shared structure as patterns

  • Since every Markov equivalent graph has the same adjacencies, we can represent the whole class using a pattern

    • A pattern is itself a graph, but the edges represent edges in other graphs


Shared structure as patterns1

Shared structure as patterns

  • A pattern can have directed and undirected edges

    • It represents all graphs that can be created by adding arrowheads to the undirected edges without creating either: (i) a cycle; or (ii) an unshielded collider

  • Let’s try some examples…


Shared structure as patterns2

Shared structure as patterns

Nitrogen — PlantGrowth — Bees

Nitrogen→PlantGrowth →Bees

Nitrogen←PlantGrowth →Bees

Nitrogen←PlantGrowth ←Bees


Shared structure as patterns3

Shared structure as patterns

Nitrogen→PlantGrowth ←Bees

Nitrogen→PlantGrowth ← Bees


Formal problem of search

Formal problem of search

  • Given some dataset D, find:

    • Markov equivalence class, represented as a pattern P, that predicts exactly the independence relations found in the data

  • More colloquially, find the causal graphs that could have produced data like this


Hard to find a pattern

Hard to find a pattern

  • “Gee, how hard could this be? Just test all of the associations, find the Markov equivalence class, then write down the pattern for it. Voila! We’re doing causal learning!”

  • Big problem: # of independencies to test is super-exponential in # of variables:

    • 2 variables ⇒ 1 test5 variables ⇒ 80 tests

    • 3 variables ⇒ 6 tests6 variables ⇒ 240 tests

    • 4 variables ⇒ 24 testsand so on…


General features of causal search

General features of causal search

  • Huge model and parameter spaces

    • Even when we (necessarily) use prior information about the family of probability distributions.

    • Relevant statistics must be rapidly computed

  • But substantive knowledge about the domain may restrict the space of alternative models

    • Time order of variables

    • Required cause/effect relationships

    • Existence or non-existence of latent variables


Three schemata for search

Three schemata for search

  • Bayesian / score-based

    • Find the graph(s) with highest P(graph | data)

  • Constraint-based

    • Find the graph(s) that predict exactly the observed associations and independencies

  • Combined

    • Get “close” with constraint-based, and then find the best graph using score-based


Bayesian score based

Bayesian / score-based

  • Informally:

    • Give each model an initial score using “prior beliefs”

    • Update each score based on the likelihood of the data if the model were true

    • Output the highest-scoring model

  • Formally:

    • Specify P(M, v) for all models M and possible parameter values v of M

    • For any data D, P(D | M, v) can easily be calculated

    • P(M | D) ∝⎰vP(D | M, v)P(M, v)


Bayesian score based1

Bayesian / score-based

  • In practice, this strategy is completely computationally intractable

    • There are too many graphs to check them all

  • So, we use a greedy search strategy

    • Start with an initial graph

    • Iteratively compare the current graph’s score (∝ posterior probability) with that of each 1- or 2-step modification of that graph

      • By edge addition, deletion or reversal


Bayesian score based2

Bayesian / score-based

  • Problem #1: Local maxima

    • Often, greedy searches get stuck

  • Solution:

    • Greedy search over Markov equivalence classes,rather than graphs (Meek)

      • Has a proof of correctness and convergence (Chickering)

      • But it gets to the right answer slowly


Bayesian score based3

Bayesian / score-based

  • Problem #2: Unobserved variables

    • Huge number of graphs

    • Huge number of different parameterizations

    • No fast, general way to compute likelihoods from latent variable models

  • Partial solution:

    • Focus on a small, “plausible” set of models for which we can compute scores


Constraint based

Constraint-based

  • Implementation of the earlier idea

    • “Build” the Markov equivalence class that predicts the pattern of association actually found in the data

      • Compatible with a variety of statistical techniques

      • Note that we might have to introduce a latent variable to explain the pattern of statistics

    • Important constraints on search:

      • Minimize the number of statistical tests

      • Minimize the size of the conditioning sets (Why?)


Constraint based1

Constraint-based

  • Algorithm step #1: Discover the adjacencies

    • Create the complete graph with undirected edges

    • Test all pairs X, Y for unconditional independence

      • Remove X—Y edge if they are independent

    • Test all adjacent X, Y for independence given single N

      • Remove X—Y edge if they are independent

    • Test adjacent pairs given two neighbors


Constraint based2

Constraint-based

  • Algorithm step #2: (Try to) Orient edges

    • “Unshielded triple”: X — C — Y, but X, Y not adjacent

    • If X & Y independent given S containing C, then C must be a non-collider

      • Since we have to condition on it to achieve d-separation

    • If X & Y independent given Snot containing C, then C must be a collider

      • Since the path is not active when not conditioning on C

    • And then do further orientations to ensure acyclicity and nodes being non-colliders


Constraint based example

Constraint-based example

  • Variables are {X, Y, Z, W}

  • Only independencies are:

    • XY

    • X W | Z

    • Y W | Z


Constraint based example1

X

Y

Z

W

Constraint-based example

  • Step 1: Form the complete graph using undirected edges


Constraint based example2

X

Y

Z

W

Constraint-based example

  • Step 2: For each pair of variables, remove the edge between them if they’re unconditionally independent

X Y⇒


Constraint based example3

X

Y

Z

W

Constraint-based example

  • Step 3: For each adjacent pair, remove the edge if they’re independent conditional on some variable adjacent to one of them

{X, Y} W | Z⇒


Constraint based example4

X

Y

Z

W

Constraint-based example

  • Step 4: Continue removing edges, checking independence conditional on 2 (or 3, or 4, or…) variables


Constraint based example5

X

Y

Z

W

Constraint-based example

  • Step 5: Orientation

    • For X – Z – Y, since XY without conditioning on Z, then make Z a collider

    • Since Z is a non-collider between X and W, though, we must orient Z – W away from Z


Constraint based output

Constraint-based output

  • Searches that allow for latent variables can also have edges of the form X o→Y

  • This indicates one of three possibilities:

    • X→Y

    • At least one unobserved common cause of X and Y

    • Both of these


Interventions to the rescue

Interventions to the rescue?

  • Interventions helped us solve an earlier equivalence class problem

    • Randomization meant that:Treatment-Effect association ⇒ T → E

  • Interventions alter equivalence classes, but don’t make them all into singletons

    • The fundamental problem of search remains


Before x intervention

Y

X

X

Y

Z

X

Y

Z

X

Y

Z

X

Y

Z

X

Y

Y

Z

Z

Y

Y

Z

X

Y

Z

X

Y

X

Y

Y

Z

X

Y

Z

X

X

Z

Z

Z

Y

Z

X

Y

Z

X

Y

Z

X

Y

Z

X

Y

Z

Y

Z

X

X

Z

Y

X

Y

Z

X

Y

Z

Z

X

Z

X

Y

Z

X

X

X

Y

Before X-intervention


After x intervention

Y

X

X

Y

Z

X

Y

Z

X

Y

Z

X

Y

Z

X

Y

Y

Z

Z

Y

Y

Z

X

Y

Z

X

Y

X

Y

Y

Z

X

Y

Z

X

X

Z

Z

Z

Y

Z

X

Y

Z

X

Y

Z

X

Y

Z

X

Y

Z

Y

Z

X

X

Z

Y

X

Y

Z

X

Y

Z

Z

X

Z

X

Y

Z

X

X

X

Y

After X-intervention


Search with interventions

Search with interventions

  • Search with interventions is the same as search with observations, except

    • We adjust the graphs in the search space to account for the intervention

  • For multiple experiments, we search for graphs in every output equivalence class

    • More complicated than this in the real world due to sampling variation


Example

Y

Y

X

Y

Z

X

Y

Z

X

Z

Z

X

Y

Z

Y

Z

X

X

Example

  • Observation

    • Y Z | X⇒

  • Intervention on X

    • Y {X, Z}⇒ &

  • Only possible graph:


Looking ahead

Looking ahead…

  • Have:

    • Basic formal representation for causation

    • Fundamental causal asymmetry (of intervention)

    • Inference & reasoning methods

    • Search & causal discovery principles

  • Need:

    • Search & causal discovery methods that work in the real world


  • Login