Properties of Directed Acyclic Graphs: Markov Equivalence Patterns

Markov Properties of Directed Acyclic Graphs Peter Spirtes

Outline • Bayesian Networks • Simplicity and Bayesian Networks • Causal Interpretation of Bayesian Networks • Greedy Equivalence Search

Directed Acyclic Graph (DAG) ng bl ng – no gas bl – battery light on bd bd – battery dead ns – no start ns • The vertices are random variables. • All edges are directed. • There are no directed cycles.

Definition of Conditional Independence • Let P(X,Y,Z) be a joint distribution over X, Y, Z. • X is independent of Y conditional on Z in P, written as IP(X,Y|Z), if and only if P(X|Y,Z) = P(X|Z) whenever P(y,z) > 0.

Probabilistic Interpretation - Local Markov Condition • Distribution P satisfies the Local Markov Condition for DAG G iff each vertex is independent in P of all vertices that are neither parents nor descendants, conditional on its parents. ng bl bd ns IP(ng,{bd,bl}|∅) IP(bl,{ng,ns}|bd) IP(bd,ng|∅) IP(ns,bl|{bd,ng}

Probabilistic Interpretation: I-Map • If distribution P satisfies the Local Markov Condition for G, G is an I-map (Independence-map) of P. ng bl bd ns

Graphical Entailment • If G being an I-map of P entails a conditional independence relation I holds in P, G entails I. • Examples: In every distribution P that G is an I-map of • IP(bd,ng|∅) • IP(bl,ng|{bd,ns}) ng bl bd ns

If I is Not Entailed by G • If conditional independence relation I is not entailed by G, then I may hold in some (but not every) distribution P that G is an I-map of. • Example: G is an I-map of some P such that ~IP(ns,bl|∅) and some P’ such that IP’(ns,bl|∅) ng bl bd ns

Factorization ng bl bd ns • If G is an I-map of P, then P(ng,bd,bl,ns) = P(ng)P(bd)P(bl|bd)P(ns|bd,ng)

Example of Parametric Family ng bl bd ns • P(ng = 0) • P(bd = 0) • P(bl = 0|bd = 0) • P(bl = 0|bd = 1) • P(ns = 0|ng = 0, bd = 0) • P(ns = 0|ng = 0, bd = 1) • P(ns = 0|ng = 1, bd = 0) • P(ns = 0|ng = 1, bd = 1) • For binary variables, the dimension of the set of distributions that G is an I-map of, is 8.

Markov Equivalence • Two DAGs G1 and G2 are Markov equivalent when they contain the same variables, and for all disjoint X, Y, Z, X is entailed to be independent from Y conditional on Z in G1 if and only if X is entailed to be independent from Y conditional on Z in G2

Markov Equivalence • Two DAGS over the same set of variables are Markov equivalent iff they have: • the same adjacencies • the same unshielded colliders (X → Y ← Z, and no X → Z or Z→ X edge)

Markov Equivalence ng bl bd ns ng bl bd ns DAG G’ DAG G

Patterns • A pattern represents a Markov equivalence class of DAGs. ng bl bd ns ng bl bd ns ng bl bd ns Pattern(G) DAG G’ DAG G

Patterns • The adjacencies in a pattern are the same as the adjacencies in each DAG in the d-separation equivalence class. • An edge is oriented as A → B in the pattern if it is oriented as A → B in every DAG in the equivalence class. • An edge is oriented as A – B in the pattern if the edge is oriented as A → B in some DAGs in the equivalence class, and as A ← B in other DAGs in the equivalence class.

Patterns • All of the conditional independence relations entailed by a DAG G represented by a pattern G’ can be read off of G’. • So we can speak of a pattern being an I-map of P by extension.

Faithfulness • P is faithful to G if • Every conditional independence entailed by G, is true in P (Markov) • Every conditional independence true in P is entailed by G • For every P, there is at most one, but possibly no pattern that P is faithful to.

Unfaithfulness X Y Z G • + IP(X,Z|∅) • IP(X,Z|∅) • Both G and G’ are patterns that are I-maps of a P in which the only independence is I(X,Z|∅), but P is faithful only to G’, not G. X Y Z G’

Violations of Faithfulness • If conditional independence can be expressed as rational function of the parameters (e.g. any exponential family including Gaussian and multinomial) then violations of faithfulness are Lebesgue measure 0.

Minimality • G is Markov and Minimal for P if and only if • Every conditional independence entailed by G is true in P (Markov) • No graph that entails a subset of the conditional independencies entailed by G is Markov to P (minimality). • For every P, there is at least one, and possibly more than one pattern, Markov and minimal for P.

Two Markov Minimal Graphs for P • IP(X,Z|Y) + IP(X,Z|∅) • IP(X,Z|Y) + IP(X,Z|∅) X Y Z X Y Z

Faithfulness and Minimality • If P is faithful to G, then G is a unique Markov Minimal pattern for P.

Lattice of I-maps Dimensionality → Entailment Inclusion → X Y Z X Y Z X Y Z X Y Z X Y Z X Y Z X Y Z X Y Z X Y Z X Y Z X Y Z

Outline • Bayesian Networks • Simplicity and Bayesian Networks • Causal Interpretation of Bayesian Networks • Greedy Equivalence Causal Search

Causal Interpretation of DAG • There is an edge X → Y in G if and only if there is some pair of experimental manipulations of all of the variables in G other than Y, that • differ only in what manipulation is performed on X; and • make the probability of Y different.

Causal Sufficiency • A set S of variables is causally sufficient if there are no variables not in S that are direct causes of more than one variable in S.

Causal Sufficiency ng bl bd ns • S = {ng, ns} is causally sufficient. • S = {ng, ns, bl} is not causally sufficient.

Causal Markov Assumption • In a population Pop with causally sufficient graph G and distribution P, each variable is independent of its non-descendants (non-effects) conditional on its parents (immediate causes) in P. • Equivalently: G is an I-map of P.

Causal Faithfulness Assumption • In a population Pop with causally sufficient graph G and distribution P, I(X,Y|Z) in P only if X is entailed to be independent of Y conditional on Z by G.

Causal Faithfulness Assumption • As currently used, it is a substantive simplicity assumption, not a methodological simplicity assumption. It says to prefer the simplest explanation: if a more complicated explanation is true, it leads astray.

Causal Faithfulness Assumption • It serves two roles in practice: • Aiding model choice (in which case weaker versions of the assumption will do) • Simplifying search

Causal Minimality Assumption • In a population Pop with causally sufficient graph G and distribution P, G is a minimal I-map of P.

Causal Minimality Assumption • This is a strictly weaker simplicity assumption than Causal Faithfulness. • Given the manipulation interpretation of the causal graph, and an everywhere positive distribution, the Causal Minimality Assumption is entailed by the Causal Markov Assumption.

Greedy Equivalence Search • Inputs: Sample from a probability distribution over a causally sufficient set of variables from a population with causal graph G. • Output: A pattern that represents G.

Assumptions • Causally sufficient set of varaibles • No feedback • The true causal structure can be represented by a DAG. • Causal Markov Assumption • Causal Faithfulness Assumption

Consistent scores • If model M1 contains the true distribution, while model M2 doesn’t, then in the large sample limit M1gets the higher score. • If both M1 and M2contain the true distribution, and M1 has fewer parameters than M2 does, then in the large sample limit M1gets the higher score.

Consistent scores • So, assuming the Causal Markov and Faithfulness conditions, a consistent score assigns the true model (and its Markov equivalent models) the highest score in the limit. • Examples: Bayesian Information Criterion, posterior of Bde or Bge prior

Score Equivalence • Under BIC, Bde, and Bge, Markov equivalent DAGs are also score equivalent, i.e., always receive the same score. (Could also use posterior probabilities under a wide variety of priors). • This allows GES to search over the space of patterns, instead of the space of DAGs.

Two Phases of GES • Forward Greedy Search (FGS): Starting with any pattern Q, evaluate all patterns that are I-maps of Q with one more edge, and move to the one with the best increase of score. Iterate until local maximum. • Backward Greedy Search (BGS): Starting with a pattern Q that contains the true distribution, evaluate all patterns with one fewer edge of which Q is an I-map, and move to the one with the best increase of score. Iterate until local maximum.

Asymptotic Optimality • In the large sample limit, GES always returns a pattern that is minimal and Markov to P. • Proof. Because the score is consistent, the forward phase continues until it reaches a pattern G such that P is Markov to G. The backward phase preserves Markov, and continues until there is no pattern that is both Markov to P and entails more independencies.

Asymptotic Optimality • If P is faithful to the causal pattern G, then BGS returns G. • Proof Sketch: G is the unique pattern that is minimal and Markov to P. By the previous theorem, GES returns G.

When GES Fails to Find the Right Causal Pattern • In the large sample limit, GES always finds a Markov Minimal pattern • Some Markov Minimal pattern may not be the true causal pattern if • Causal Minimality Assumption is violated; or • There are multiple Markov Minimal patterns, and GES outputs the wrong one

One Markov Minimal Pattern for P, but Not the Causal Pattern X Y Z • + I(X,Z|∅) • I(X,Z|∅) X Y Z

Two Markov Minimal Graphs for P • I(X,Z|Y) + I(X,Z|∅) • I(X,Z|Y) + I(X,Z|∅) X Y Z G X Y Z G’ • For some parametric families G and G’ have the same score and the same dimension. • For other parametric families, G’ has lower dimension and higher score than G.

Multiple Minimal I-maps? • Lower Standard of Success • Output all Markov Minimal • Output any Markov Minimal • More Data • Experiments • More Background Knowledge • Time order, etc.

References • Chickering, M. (2002) Optimal Structure Identification With Greedy Search, Journal of Machine Learning Research3, pp. 507-554. • Spirtes, P., Glymour, C., and Scheines, R., (2000) Causation, Prediction, and Search, 2nd edition. MIT Press, Cambridge, MA.

Properties of Directed Acyclic Graphs: Markov Equivalence Patterns

Properties of Directed Acyclic Graphs: Markov Equivalence Patterns

Presentation Transcript

Directed Acyclic Graphs

Parameterized Tractability of Edge-Disjoint Paths on Directed Acyclic Graphs

Using Directed Acyclic Graphs (DAGs) to assess confounding

Directed graphs

Directed Graphs

Confounding and DAGs (Directed Acyclic Graphs)

Directed Graphs

Seepage in Directed Acyclic Graphs

Directed acyclic graphs with the unique dipath property

Demand-driven Execution of Directed Acyclic Graphs Using Task Parallelism

Directed Acyclic Graphs and Topological Sort

Directed Graphs

SSSP in DAGs (directed acyclic graphs)

Directed Acyclic Graph

Directed Acyclic Graphs: A tool to incorporate uncertainty in steelhead

A dynamic algorithm for topologically sorting directed acyclic graphs

Sparse Compact Directed Acyclic Word Graphs

Ternary Directed Acyclic Word Graphs (TDAWG)

Improved Shortest Path Algorithms for Nearly Acyclic Directed Graphs

Adapting Prime Number Labeling Scheme for Directed Acyclic Graphs

Directed Graphs

Directed Graphs