Course files

1 / 41

# Course files - PowerPoint PPT Presentation

Course files. http://www.andrew.cmu.edu/~ddanks/NASSLLI/. Principles Underlying Causal Search Algorithms. Fundamental problem. As we have all heard many times… “ Correlation is not causation! ”. Fundamental problem. Why is this slogan correct?

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'Course files' - edda

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Course files

http://www.andrew.cmu.edu/~ddanks/NASSLLI/

### Principles Underlying Causal Search Algorithms

Fundamental problem
• As we have all heard many times…

“Correlation is not causation!”

Fundamental problem
• Why is this slogan correct?
• Causalhypotheses make implicit claims about the effects of intervening (manipulating) one or more variables
• Hypotheses about association or correlation make no such claims
• Correlation or probabilistic dependence can be produced in many ways
Fundamental problem
• Some of the possible reasons why X and Y might be associated are:
• Sheer chance
• X causes Y
• Y causes X
• Some third variable Z influences X and Y
• The value of X (or a cause of X) and the value of Y (or a cause of Y) can be causes/reasons for whether an individual is in the sample (sample selection bias)
Fundamental problem
• Fundamental problem of causal search:
• For any particular set of data,there are often many different causal structures that could have produced that data
• Causation → Association map is many → one
Fundamental problem
• Use the data to figure out as much as possible (though it usually won’t be everything)
• Requires developing search procedures
• And then try to narrow the possibilities
• Use other knowledge (e.g., time order, interventions)
• Get better / different data (e.g., run an experiment)
Always remember…

Even if we cannot discoverthe whole truth,

we might be able to find some of the truth!

Markov equivalence
• Formally, we say that:
• Two causal graphs are members of the same Markov Equivalence Class iff they imply the exact same (un)conditional independence relations among the observed variables
• By the Markov and Faithfulness assumptions
• Remember that d-separation gives a purely graphical criterion for determining all of the (un)conditional independencies
Markov equivalence
• The “Fundamental Problem of Causal Inference”can be restated as:
• For some sets of independence relations, the Markov equivalence class is not a singleton
• Markov equivalence classes give a precise characterization of what can be inferred from independencies alone

Y

Z

X

Y

Z

X

Y

Z

X

X

Z

X

Y

Z

Y

Z

X

Y

Markov equivalence
• Examples:
• X {Y, Z} ⇒
• X Y | Z ⇒
• X Y ⇒

X

Y

Z

X

Y

Z

X

Y

Z

X

Y

Z

Markov equivalence
• Two more examples:
• Are these graphs Markov equivalent?
• Are these two graphs?
Shared structure
• What is shared by all of the graphs in a Markov equivalence class?
• Same “skeleton”
• I.e., they all have the same adjacency relations
• Same “unshielded colliders”
• I.e., X→ Y ← Z with no edge between X and Z
• Sometimes, other edges have same direction
• In these last two cases, we can infer that the true graph contains the shared directed edges.
Shared structure as patterns
• Since every Markov equivalent graph has the same adjacencies, we can represent the whole class using a pattern
• A pattern is itself a graph, but the edges represent edges in other graphs
Shared structure as patterns
• A pattern can have directed and undirected edges
• It represents all graphs that can be created by adding arrowheads to the undirected edges without creating either: (i) a cycle; or (ii) an unshielded collider
• Let’s try some examples…
Shared structure as patterns

Nitrogen — PlantGrowth — Bees

Nitrogen→PlantGrowth →Bees

Nitrogen←PlantGrowth →Bees

Nitrogen←PlantGrowth ←Bees

Shared structure as patterns

Nitrogen→PlantGrowth ←Bees

Nitrogen→PlantGrowth ← Bees

Formal problem of search
• Given some dataset D, find:
• Markov equivalence class, represented as a pattern P, that predicts exactly the independence relations found in the data
• More colloquially, find the causal graphs that could have produced data like this
Hard to find a pattern
• “Gee, how hard could this be? Just test all of the associations, find the Markov equivalence class, then write down the pattern for it. Voila! We’re doing causal learning!”
• Big problem: # of independencies to test is super-exponential in # of variables:
• 2 variables ⇒ 1 test 5 variables ⇒ 80 tests
• 3 variables ⇒ 6 tests 6 variables ⇒ 240 tests
• 4 variables ⇒ 24 tests and so on…
General features of causal search
• Huge model and parameter spaces
• Even when we (necessarily) use prior information about the family of probability distributions.
• Relevant statistics must be rapidly computed
• But substantive knowledge about the domain may restrict the space of alternative models
• Time order of variables
• Required cause/effect relationships
• Existence or non-existence of latent variables
Three schemata for search
• Bayesian / score-based
• Find the graph(s) with highest P(graph | data)
• Constraint-based
• Find the graph(s) that predict exactly the observed associations and independencies
• Combined
• Get “close” with constraint-based, and then find the best graph using score-based
Bayesian / score-based
• Informally:
• Give each model an initial score using “prior beliefs”
• Update each score based on the likelihood of the data if the model were true
• Output the highest-scoring model
• Formally:
• Specify P(M, v) for all models M and possible parameter values v of M
• For any data D, P(D | M, v) can easily be calculated
• P(M | D) ∝⎰vP(D | M, v)P(M, v)
Bayesian / score-based
• In practice, this strategy is completely computationally intractable
• There are too many graphs to check them all
• So, we use a greedy search strategy
• Iteratively compare the current graph’s score (∝ posterior probability) with that of each 1- or 2-step modification of that graph
• By edge addition, deletion or reversal
Bayesian / score-based
• Problem #1: Local maxima
• Often, greedy searches get stuck
• Solution:
• Greedy search over Markov equivalence classes,rather than graphs (Meek)
• Has a proof of correctness and convergence (Chickering)
• But it gets to the right answer slowly
Bayesian / score-based
• Problem #2: Unobserved variables
• Huge number of graphs
• Huge number of different parameterizations
• No fast, general way to compute likelihoods from latent variable models
• Partial solution:
• Focus on a small, “plausible” set of models for which we can compute scores
Constraint-based
• Implementation of the earlier idea
• “Build” the Markov equivalence class that predicts the pattern of association actually found in the data
• Compatible with a variety of statistical techniques
• Note that we might have to introduce a latent variable to explain the pattern of statistics
• Important constraints on search:
• Minimize the number of statistical tests
• Minimize the size of the conditioning sets (Why?)
Constraint-based
• Algorithm step #1: Discover the adjacencies
• Create the complete graph with undirected edges
• Test all pairs X, Y for unconditional independence
• Remove X—Y edge if they are independent
• Test all adjacent X, Y for independence given single N
• Remove X—Y edge if they are independent
• Test adjacent pairs given two neighbors
Constraint-based
• Algorithm step #2: (Try to) Orient edges
• “Unshielded triple”: X — C — Y, but X, Y not adjacent
• If X & Y independent given S containing C, then C must be a non-collider
• Since we have to condition on it to achieve d-separation
• If X & Y independent given Snot containing C, then C must be a collider
• Since the path is not active when not conditioning on C
• And then do further orientations to ensure acyclicity and nodes being non-colliders
Constraint-based example
• Variables are {X, Y, Z, W}
• Only independencies are:
• XY
• X W | Z
• Y W | Z

X

Y

Z

W

Constraint-based example
• Step 1: Form the complete graph using undirected edges

X

Y

Z

W

Constraint-based example
• Step 2: For each pair of variables, remove the edge between them if they’re unconditionally independent

X Y⇒

X

Y

Z

W

Constraint-based example
• Step 3: For each adjacent pair, remove the edge if they’re independent conditional on some variable adjacent to one of them

{X, Y} W | Z⇒

X

Y

Z

W

Constraint-based example
• Step 4: Continue removing edges, checking independence conditional on 2 (or 3, or 4, or…) variables

X

Y

Z

W

Constraint-based example
• Step 5: Orientation
• For X – Z – Y, since XY without conditioning on Z, then make Z a collider
• Since Z is a non-collider between X and W, though, we must orient Z – W away from Z
Constraint-based output
• Searches that allow for latent variables can also have edges of the form X o→Y
• This indicates one of three possibilities:
• X→Y
• At least one unobserved common cause of X and Y
• Both of these
Interventions to the rescue?
• Interventions helped us solve an earlier equivalence class problem
• Randomization meant that:Treatment-Effect association ⇒ T → E
• Interventions alter equivalence classes, but don’t make them all into singletons
• The fundamental problem of search remains

Y

X

X

Y

Z

X

Y

Z

X

Y

Z

X

Y

Z

X

Y

Y

Z

Z

Y

Y

Z

X

Y

Z

X

Y

X

Y

Y

Z

X

Y

Z

X

X

Z

Z

Z

Y

Z

X

Y

Z

X

Y

Z

X

Y

Z

X

Y

Z

Y

Z

X

X

Z

Y

X

Y

Z

X

Y

Z

Z

X

Z

X

Y

Z

X

X

X

Y

Before X-intervention

Y

X

X

Y

Z

X

Y

Z

X

Y

Z

X

Y

Z

X

Y

Y

Z

Z

Y

Y

Z

X

Y

Z

X

Y

X

Y

Y

Z

X

Y

Z

X

X

Z

Z

Z

Y

Z

X

Y

Z

X

Y

Z

X

Y

Z

X

Y

Z

Y

Z

X

X

Z

Y

X

Y

Z

X

Y

Z

Z

X

Z

X

Y

Z

X

X

X

Y

After X-intervention
Search with interventions
• Search with interventions is the same as search with observations, except
• We adjust the graphs in the search space to account for the intervention
• For multiple experiments, we search for graphs in every output equivalence class
• More complicated than this in the real world due to sampling variation

Y

Y

X

Y

Z

X

Y

Z

X

Z

Z

X

Y

Z

Y

Z

X

X

Example
• Observation
• Y Z | X⇒
• Intervention on X
• Y {X, Z}⇒ &
• Only possible graph: