finding optimal bayesian networks with greedy search l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Finding Optimal Bayesian Networks with Greedy Search PowerPoint Presentation
Download Presentation
Finding Optimal Bayesian Networks with Greedy Search

Loading in 2 Seconds...

play fullscreen
1 / 59

Finding Optimal Bayesian Networks with Greedy Search - PowerPoint PPT Presentation


  • 154 Views
  • Uploaded on

Finding Optimal Bayesian Networks with Greedy Search. Max Chickering. Outline. Bayesian-Network Definitions Learning Greedy Equivalence Search (GES) Optimality of GES. Bayesian Networks. Use B = ( S , q ) to represent p(X 1 , …, X n ) . Markov Conditions.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Finding Optimal Bayesian Networks with Greedy Search' - oleg


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
outline
Outline
  • Bayesian-Network Definitions
  • Learning
  • Greedy Equivalence Search (GES)
  • Optimality of GES
bayesian networks
Bayesian Networks

Use B = (S,q) to represent p(X1, …, Xn)

markov conditions
Markov Conditions

From factorization: I(X, ND | Par(X))

ND

Par

Par

Par

X

Desc

ND

Desc

Markov Conditions + Graphoid Axioms

characterize all independencies

structure distribution inclusion
Structure/Distribution Inclusion

p is included in S if there exists q s.t. B(S,q) defines p

All distributions

p

X

Y

Z

S

structure structure inclusion t s
Structure/Structure Inclusion T ≤ S

T is included in S if every p included in T is included in S

All distributions

X

Y

Z

X

Y

Z

S

T

(S is an I-map of T)

structure structure equivalence t s
Structure/Structure EquivalenceT  S

All distributions

X

Y

Z

X

Y

Z

S

T

Reflexive, Symmetric, Transitive

equivalence
Equivalence

A

B

C

A

B

C

D

D

Skeleton

V-structure

Theorem (Verma and Pearl, 1990)

ST same v-structures and skeletons

learning bayesian networks
Learn the structure

Estimate the conditional distributions

Learning Bayesian Networks

X

X Y Z

0 1 1

1 0 1

0 1 0

.

.

.

1 0 1

iid

samples

Y

p*

Z

Generative

Distribution

Observed Data

Learned

Model

learning structure
Learning Structure
  • Scoring criterion

F(D, S)

  • Search procedure

Identify one or more structures with high values

for the scoring function

properties of scoring criteria
Properties of Scoring Criteria
  • Consistent
  • Locally Consistent
  • Score Equivalent
consistent criterion

X

Y

Z

p*

X

Y

Z

X

Y

Z

X

Y

Z

Consistent Criterion

Criterion favors (in the limit) simplest model that includes the generative distribution p*

S includes p*, T does not include p*  F(S,D) > F(T,D)

Both include p*, S has fewer parameters  F(S,D) > F(T,D)

locally consistent criterion
Locally Consistent Criterion

S and T differ by one edge:

X

Y

X

Y

S

T

If I(X,Y|Par(X)) in p*then F(S,D) > F(T,D)

Otherwise F(S,D) < F(T,D)

score equivalent criterion
Score-Equivalent Criterion

Y

X

S

Y

X

T

ST F(S,D) = F(T,D)

bayesian criterion consistent locally consistent and score equivalent
Bayesian Criterion(Consistent, locally consistent and score equivalent)

Sh : generative distribution p* has same

independence constraints as S.

FBayes(S,D) = log p(Sh |D)

= k + log p(D|Sh) + log p(Sh)

Structure Prior

(e.g. prefer simple)

Marginal Likelihood

(closed form w/ assumptions)

search procedure
Search Procedure
  • Set of states
  • Representation for the states
  • Operators to move between states
  • Systematic Search Algorithm
greedy equivalence search
Greedy Equivalence Search
  • Set of states

Equivalence classes of DAGs

  • Representation for the states

Essential graphs

  • Operators to move between states

Forward and Backward Operators

  • Systematic Search Algorithm

Two-phase Greedy

representation essential graphs
Representation: Essential Graphs

A

B

C

Compelled Edges

Reversible Edges

D

E

F

A

B

C

D

E

F

ges operators
GES Operators

Forward Direction – single edge additions

Backward Direction – single edge deletions

two phase greedy algorithm
Two-Phase Greedy Algorithm
  • Phase 1: Forward Equivalence Search (FES)
  • Start with all-independence model
  • Run Greedy using forward operators
  • Phase 2: Backward Equivalence Search (BES)
  • Start with local max from FES
  • Run Greedy using backward operators
forward operators
Forward Operators
  • Consider all DAGs in the current state
  • For each DAG, consider all single-edge additions (acyclic)
  • Take the union of the resulting equivalence classes
forward operators example

A

B

A

B

A

B

C

C

C

A

A

B

B

A

B

A

A

B

B

A

B

C

C

C

C

C

C

A

B

A

A

B

B

A

B

A

A

B

B

C

C

C

C

C

C

Forward-Operators Example

Current State:

All DAGs:

All DAGs resulting from single-edge addition:

Union of corresponding essential graphs:

backward operators
Backward Operators
  • Consider all DAGs in the current state
  • For each DAG, consider all single-edge deletions
  • Take the union of the resulting equivalence classes
backward operators example

A

B

A

A

B

B

C

C

C

A

B

C

A

B

A

B

C

C

Backward-Operators Example

Current State:

All DAGs:

All DAGs resulting from single-edge deletion:

A

B

A

B

A

B

A

B

A

B

A

B

C

C

C

C

C

C

Union of corresponding essential graphs:

dag perfect
DAG Perfect

DAG-perfect distribution p

Exists DAG G:

I(X,Y|Z) in p I(X,Y|Z) in G

Non-DAG-perfect distribution q

A

B

A

B

A

B

C

D

C

D

C

D

I(A,D|B,C)

I(B,C|A,D)

I(B,C|A,D)

I(A,D|B,C)

dag perfect consequence composition axiom holds in p
DAG-Perfect Consequence: Composition Axiom Holds in p*

If I(X,Y | Z) then I(X,Y | Z)

for some singleton Y  Y

A

B

C

D

C

X

X

optimality of ges
Optimality of GES

If p* is DAG-perfect wrt some G*

X

X

X

X Y Z

0 1 1

1 0 1

0 1 0

.

.

.

1 0 1

Y

Y

Y

n

iid

samples

GES

Z

Z

Z

G*

S*

S

p*

For large n, S = S*

optimality of ges30
Optimality of GES

BES

FES

State includes S*

State equals S*

All-independence

  • Proof Outline
  • After first phase (FES), current state includes S*
  • After second phase (BES), the current state = S*
fes maximum includes s
FES Maximum Includes S*

Assume: Local Max does NOT include S*

Any DAG G from S

Markov Conditions characterize independencies:

In p*, exists X not indep. non-desc given parents

A

B

C

 I(X,{A,B,C,D} | E) in p*

D

E

X

p* is DAG-perfect  composition axiom holds

A

B

C

 I(X,C | E) in p*

D

E

X

Locally consistent: adding CX edge improves score, and EQ class is

a neighbor

bes identifies s
BES Identifies S*
  • Current state always includes S*:

Local consistency of the criterion

  • Local Minimum is S*:

Meek’s conjecture

meek s conjecture
Meek’s Conjecture

Any pair of DAGs G,H such that H includes G (G≤H)

There exists a sequence of

  • covered edge reversals in G

(2) single-edge additions to G

after each change G≤H

after all changes G=H

meek s conjecture34
Meek’s Conjecture

A

B

I(A,B)

I(C,B|A,D)

C

D

H

A

B

A

B

A

B

A

B

C

D

C

D

C

D

C

D

G

meek s conjecture and bes s s
Meek’s Conjecture and BESS*≤S

Assume: Local Max S Not S*

Any DAG H from S

Any DAG G from S*

Add

Rev

Rev

Add

Rev

G

H

meek s conjecture and bes s s36
Meek’s Conjecture and BESS*≤S

Assume: Local Max S Not S*

Any DAG H from S

Any DAG G from S*

Add

Rev

Rev

Add

Rev

G

H

Del

Rev

Rev

Del

Rev

G

H

meek s conjecture and bes s s37
Meek’s Conjecture and BESS*≤S

Assume: Local Max S Not S*

Any DAG H from S

Any DAG G from S*

Add

Rev

Rev

Add

Rev

G

H

Del

Rev

Rev

Del

Rev

G

H

S*

Neighbor of S in BES

S

discussion points
Discussion Points
  • In practice, GES is as fast as DAG-based search

Neighborhood of essential graphs can be generated and scored very efficiently

  • When DAG-perfect assumption fails, we still get optimality guarantees

As long as composition holds in generative distribution, local maximum is inclusion-minimal

thanks
Thanks!

My Home Page:

http://research.microsoft.com/~dmax

Relevant Papers:

“Optimal Structure Identification with Greedy Search”

JMLR Submission

Contains detailed proofs of Meek’s conjecture and optimality of GES

“Finding Optimal Bayesian Networks”

UAI02 Paper with Chris Meek

Contains extension of optimality results of GES when not DAG perfect

bayesian criterion is locally consistent
Bayesian Criterion is Locally Consistent
  • Bayesian score approaches BIC + constant
  • BIC is decomposible:
  • Difference in score same for any DAGS that differ by YX edge if X has same parents

X

Y

X

Y

Complete network (always includes p*)

bayesian criterion is consistent
Bayesian Criterion is Consistent
  • Assume Conditionals:
  • unconstrained multinomials
  • linear regressions

Geiger, Heckerman, King and Meek (2001)

Network structures = curved exponential models

Haughton (1988)

Bayesian Criterion is consistent

bayesian criterion is score equivalent
Bayesian Criterion isScore Equivalent

ST F(S,D) = F(T,D)

Y

X

Sh: no independence constraints

S

Y

X

Th: no independence constraints

T

Sh = Th

active paths
Active Paths
  • Z-active Path between X and Y: (non-standard)
  • Neither X nor Y is in Z
  • Every pair of colliding edges meets at a member of Z
  • No other pair of edges meets at a member of Z

X

Z

Y

G ≤ H If Z-active path between X and Y in G

then Z-active path between X and Y in H

active paths45

A

B

C

Active Paths

X

A

Z

W

B

Y

  • X-Y: Out-ofX and In-toY
  • X-W Out-of both X and W
  • Any sub-path between A,BZ is also active
  • A – B, B–C, at least one is out-ofB
  • Active path between A and C
simple active paths
Simple Active Paths

contains YX

B

A

Then  active path

(1) Edge appears exactly once

OR

A

Y

X

B

(2) Edge appears exactly twice

A

Y

X

X

Y

B

Simplify discussion:

Assume (1) only – proofs for (2) almost identical

typical argument combining active paths
Typical Argument:Combining Active Paths

A

X

Y

B

X

Y

Z sink node

adj X,Y

Z

G

Z

H

A

X

Y

B

A

X

G≤H

Y

B

Z

G’ : Suppose AP in G’ (X not in CS) with no corresp. AP in H. Then Z not in CS.

proof sketch
Proof Sketch

Two DAGs G, H with G<H

Identify either:

  • a covered edge XY in G that has opposite orientation in H
  • a new edge XY to be added to G such that it remains included in H
the transformation
The Transformation

Choose any node Y that is a sink in H

Case 1a: Y is a sink in G

X ParH(Y)

X  ParG(Y)

Case 1b: Y is a sink in G

same parents

Case 2a: X s.t. YX

covered

Case 2b: X s.t. YX & W

par of Y but not X

Case 2c: Every YX,

Par(Y)  Par(X)

Y

X

Y

X

Y

Y

X

Y

X

W

W

Y

X

Y

X

Y

Y

preliminaries
Preliminaries

(G≤ H)

  • The adjacencies in G are a subset of the adjacencies in H
  • If XYZ is a v-structure in G but not H, then X and Z are adjacent in H
  • Any new active path that results from adding XY to G includes XY
proof sketch case 1
Proof Sketch: Case 1

Y is a sink in G

Case 1a:

X ParH(Y)

X  ParG(Y)

H:

X

Y

X

G:

Y

X

Y

Suppose there’s some new active path between A and B not in H

Y

X

B

A

Z

  • Y is a sink in G, so it must be in CS
  • Neither X nor next node Z is in CS
  • In H, AP(A,Z), AP(X,B), ZYX

Case 1b: Parents identical

Remove Y from both graphs: proof similar

proof sketch case 2
Proof Sketch: Case 2

Y is not a sink in G

Case 2a: There is a covered edge YX :Reverse the edge

Case 2b: There is a non-covered edge YX such that W is a parent of Y but not a parent of X

W

W

W

G’:

H:

G:

X

X

Y

Y

X

Y

Suppose there’s some new active path between A and B not in H

Y must be in CS, else replace WX by WYX (not new).

If X not in CS, then in H active: A-W,X-B, WYX

B

W

A

B

A

W

G’:

H:

Z

X

Y

Z

X

Y

case 2c the difficult case
Case 2c: The Difficult Case

All non-covered edges YZ have Par(Y)  Par(Z)

W1

W2

W1

W2

Y

Y

Z1

Z2

Z1

Z2

G

H

W1Y: G no longer < H (Z2-active path between W1 and W2)

W2Y: G < H

choosing z
Choosing Z

G

H

Y

Y

D

D

Z

Descendants of Y in G

Descendants of Y in G

D is the maximal G-descendant in H

Z is any maximal child of Y such that D is a descendant of Z in G

choosing z55

W1

W2

W1

W2

Y

Y

Z1

Z2

Z1

Z2

Choosing Z

G

H

Descendants of Y in G:

Y, Z1, Z2

Maximal descendant in H:

D=Z2

Maximal child of Y in G that has D=Z2 as descendant

Z2

Add W2Y

difficult case proof intuition
Difficult Case: Proof Intuition

Y

A

B

Y

W

A

W

B

Z

Z

B or CS

B or CS

D

D

G

H

1. W not in CS

2. Y not in CS, else active in H

3. In G, next edges must be away from Y until B or CS reached

4. In G, neither Z nor desc in CS, else active before addition

5. From (1,2,4), AP (A,D) and (B,D) in H

6. Choice of D: directed path from D to B or CS in H

optimality of ges58
Optimality of GES

Definition

p is DAG-perfect wrt G:

Independence constraints in p are precisely those in G

Assumption

Generative distribution p* is perfect wrt some G* defined

over the observable variables

S* = Equivalence class containing G*

Under DAG-perfect assumption, GES results in S*

important definitions
Important Definitions
  • Bayesian Networks
  • Markov Conditions
  • Distribution/Structure Inclusion
  • Structure/Structure Inclusion